Structured Information Extraction from Scientific Text with Large Language Models

Overview

Journal Nat Commun

Specialty Biology

Date 2024 Feb 15

PMID 38360817

Authors

John Dagdelen

Alexander Dunn

Sanghoon Lee

Nicholas Walker

Andrew S Rosen

Gerbrand Ceder

Kristin A Persson

Anubhav Jain

Affiliations

Soon will be listed here.

Abstract

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

Citing Articles

Using Generative AI to Extract Structured Information from Free Text Pathology Reports.

Shahid F, Hsu M, Chang Y, Jian W J Med Syst. 2025; 49(1):36.

PMID: 40080229 PMC: 11906504. DOI: 10.1007/s10916-025-02167-2.

LLM-IE: a python package for biomedical generative information extraction with large language models.

Hsu E, Roberts K JAMIA Open. 2025; 8(2):ooaf012.

PMID: 40078164 PMC: 11901043. DOI: 10.1093/jamiaopen/ooaf012.

Pipeline to explore information on genome editing using large language models and genome editing meta-database.

Suzuki T, Bono H Database (Oxford). 2025; 2025.

PMID: 40056431 PMC: 11890094. DOI: 10.1093/database/baaf022.

Developing a named entity framework for thyroid cancer staging and risk level classification using large language models.

Fung M, Tang E, Wu T, Luk Y, Au I, Liu X NPJ Digit Med. 2025; 8(1):134.

PMID: 40025285 PMC: 11873034. DOI: 10.1038/s41746-025-01528-y.

Evaluating GPT Models for Automated Literature Screening in Wastewater-Based Epidemiology.

Chibwe K, Mantilla-Calderon D, Ling F ACS Environ Au. 2025; 5(1):61-68.

PMID: 39830716 PMC: 11741058. DOI: 10.1021/acsenvironau.4c00042.

References

Oliveira Jr O, Oliveira M . Materials Discovery With Machine Learning and Knowledge Discovery. Front Chem. 2022; 10:930369. PMC: 9300917. DOI: 10.3389/fchem.2022.930369. View

Trewartha A, Walker N, Huo H, Lee S, Cruse K, Dagdelen J . Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. Patterns (N Y). 2022; 3(4):100488. PMC: 9024010. DOI: 10.1016/j.patter.2022.100488. View

Sierepeklis O, Cole J . A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor. Sci Data. 2022; 9(1):648. PMC: 9587980. DOI: 10.1038/s41597-022-01752-1. View

Beard E, Cole J . Perovskite- and Dye-Sensitized Solar-Cell Device Databases Auto-generated Using ChemDataExtractor. Sci Data. 2022; 9(1):329. PMC: 9205998. DOI: 10.1038/s41597-022-01355-w. View

Dong Q, Cole J . Auto-generated database of semiconductor band gaps using ChemDataExtractor. Sci Data. 2022; 9(1):193. PMC: 9065101. DOI: 10.1038/s41597-022-01294-6. View

Kononova O, Huo H, He T, Rong Z, Botari T, Sun W . Text-mined dataset of inorganic materials synthesis recipes. Sci Data. 2019; 6(1):203. PMC: 6794279. DOI: 10.1038/s41597-019-0224-1. View

Huo H, Bartel C, He T, Trewartha A, Dunn A, Ouyang B . Machine-Learning Rationalization and Prediction of Solid-State Synthesis Conditions. Chem Mater. 2022; 34(16):7323-7336. PMC: 9407029. DOI: 10.1021/acs.chemmater.2c01293. View

Wang Z, Kononova O, Cruse K, He T, Huo H, Fei Y . Dataset of solution-based inorganic materials synthesis procedures extracted from the scientific literature. Sci Data. 2022; 9(1):231. PMC: 9132903. DOI: 10.1038/s41597-022-01317-2. View

Huang S, Cole J . A database of battery materials auto-generated using ChemDataExtractor. Sci Data. 2020; 7(1):260. PMC: 7411033. DOI: 10.1038/s41597-020-00602-2. View

10.

Beard E, Sivaraman G, Vazquez-Mayagoitia A, Vishwanath V, Cole J . Comparative dataset of experimental and computational attributes of UV/vis absorption spectra. Sci Data. 2019; 6(1):307. PMC: 6895184. DOI: 10.1038/s41597-019-0306-0. View

11.

Zhao J, Cole J . A database of refractive indices and dielectric constants auto-generated using ChemDataExtractor. Sci Data. 2022; 9(1):192. PMC: 9065060. DOI: 10.1038/s41597-022-01295-5. View

12.

Zheng Z, Zhang O, Borgs C, Chayes J, Yaghi O . ChatGPT Chemistry Assistant for Text Mining and the Prediction of MOF Synthesis. J Am Chem Soc. 2023; 145(32):18048-18062. PMC: 11073615. DOI: 10.1021/jacs.3c05819. View

13.

Qian Q, Asinger P, Lee M, Han G, Mizrahi Rodriguez K, Lin S . MOF-Based Membranes for Gas Separations. Chem Rev. 2020; 120(16):8161-8266. DOI: 10.1021/acs.chemrev.0c00119. View

14.

Tshitoyan V, Dagdelen J, Weston L, Dunn A, Rong Z, Kononova O . Unsupervised word embeddings capture latent knowledge from materials science literature. Nature. 2019; 571(7763):95-98. DOI: 10.1038/s41586-019-1335-8. View