» Articles » PMID: 38360817

Structured Information Extraction from Scientific Text with Large Language Models

Overview
Journal Nat Commun
Specialty Biology
Date 2024 Feb 15
PMID 38360817
Authors
Affiliations
Soon will be listed here.
Abstract

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

Citing Articles

Using Generative AI to Extract Structured Information from Free Text Pathology Reports.

Shahid F, Hsu M, Chang Y, Jian W J Med Syst. 2025; 49(1):36.

PMID: 40080229 PMC: 11906504. DOI: 10.1007/s10916-025-02167-2.


LLM-IE: a python package for biomedical generative information extraction with large language models.

Hsu E, Roberts K JAMIA Open. 2025; 8(2):ooaf012.

PMID: 40078164 PMC: 11901043. DOI: 10.1093/jamiaopen/ooaf012.


Pipeline to explore information on genome editing using large language models and genome editing meta-database.

Suzuki T, Bono H Database (Oxford). 2025; 2025.

PMID: 40056431 PMC: 11890094. DOI: 10.1093/database/baaf022.


Developing a named entity framework for thyroid cancer staging and risk level classification using large language models.

Fung M, Tang E, Wu T, Luk Y, Au I, Liu X NPJ Digit Med. 2025; 8(1):134.

PMID: 40025285 PMC: 11873034. DOI: 10.1038/s41746-025-01528-y.


Evaluating GPT Models for Automated Literature Screening in Wastewater-Based Epidemiology.

Chibwe K, Mantilla-Calderon D, Ling F ACS Environ Au. 2025; 5(1):61-68.

PMID: 39830716 PMC: 11741058. DOI: 10.1021/acsenvironau.4c00042.


References
1.
Oliveira Jr O, Oliveira M . Materials Discovery With Machine Learning and Knowledge Discovery. Front Chem. 2022; 10:930369. PMC: 9300917. DOI: 10.3389/fchem.2022.930369. View

2.
Trewartha A, Walker N, Huo H, Lee S, Cruse K, Dagdelen J . Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science. Patterns (N Y). 2022; 3(4):100488. PMC: 9024010. DOI: 10.1016/j.patter.2022.100488. View

3.
Sierepeklis O, Cole J . A thermoelectric materials database auto-generated from the scientific literature using ChemDataExtractor. Sci Data. 2022; 9(1):648. PMC: 9587980. DOI: 10.1038/s41597-022-01752-1. View

4.
Beard E, Cole J . Perovskite- and Dye-Sensitized Solar-Cell Device Databases Auto-generated Using ChemDataExtractor. Sci Data. 2022; 9(1):329. PMC: 9205998. DOI: 10.1038/s41597-022-01355-w. View

5.
Dong Q, Cole J . Auto-generated database of semiconductor band gaps using ChemDataExtractor. Sci Data. 2022; 9(1):193. PMC: 9065101. DOI: 10.1038/s41597-022-01294-6. View