» Articles » PMID: 27669338

ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature

Overview
Date 2016 Sep 27
PMID 27669338
Citations 88
Authors
Affiliations
Soon will be listed here.
Abstract

The emergence of "big data" initiatives has led to the need for tools that can automatically extract valuable chemical information from large volumes of unstructured data, such as the scientific literature. Since chemical information can be present in figures, tables, and textual paragraphs, successful information extraction often depends on the ability to interpret all of these domains simultaneously. We present a complete toolkit for the automated extraction of chemical entities and their associated properties, measurements, and relationships from scientific documents that can be used to populate structured chemical databases. Our system provides an extensible, chemistry-aware, natural language processing pipeline for tokenization, part-of-speech tagging, named entity recognition, and phrase parsing. Within this scope, we report improved performance for chemical named entity recognition through the use of unsupervised word clustering based on a massive corpus of chemistry articles. For phrase parsing and information extraction, we present the novel use of multiple rule-based grammars that are tailored for interpreting specific document domains such as textual paragraphs, captions, and tables. We also describe document-level processing to resolve data interdependencies and show that this is particularly necessary for the autogeneration of chemical databases since captions and tables commonly contain chemical identifiers and references that are defined elsewhere in the text. The performance of the toolkit to correctly extract various types of data was evaluated, affording an F-score of 93.4%, 86.8%, and 91.5% for extracting chemical identifiers, spectroscopic attributes, and chemical property attributes, respectively; set against the CHEMDNER chemical name extraction challenge, ChemDataExtractor yields a competitive F-score of 87.8%. All tools have been released under the MIT license and are available to download from http://www.chemdataextractor.org .

Citing Articles

Life Cycle Inventory Availability: Status and Prospects for Leveraging New Technologies.

Wright M, Tan E, Tu Q, Martins A, Parvatker A, Yao Y ACS Sustain Chem Eng. 2025; 12(34):12708-12718.

PMID: 40017908 PMC: 11864275. DOI: 10.1021/acssuschemeng.4c02519.


Auto-generating a database on the fabrication details of perovskite solar devices.

Valencia A, Liu F, Zhang X, Bo X, Li W, Daoud W Sci Data. 2025; 12(1):270.

PMID: 39952948 PMC: 11828846. DOI: 10.1038/s41597-025-04566-z.


Cost-Efficient Domain-Adaptive Pretraining of Language Models for Optoelectronics Applications.

Huang D, Cole J J Chem Inf Model. 2025; 65(5):2476-2486.

PMID: 39933074 PMC: 11898057. DOI: 10.1021/acs.jcim.4c02029.


MechBERT: Language Models for Extracting Chemical and Property Relationships about Mechanical Stress and Strain.

Kumar P, Kabra S, Cole J J Chem Inf Model. 2025; 65(4):1873-1888.

PMID: 39888859 PMC: 11863389. DOI: 10.1021/acs.jcim.4c00857.


A review of large language models and autonomous agents in chemistry.

Ramos M, Collison C, White A Chem Sci. 2025; 16(6):2514-2572.

PMID: 39829984 PMC: 11739813. DOI: 10.1039/d4sc03921a.