Comparative Experiments on Learning Information Extractors for Proteins and Their Interactions

Overview

Journal Artif Intell Med

Specialty Biomedical Engineering

Date 2005 Apr 7

PMID 15811782

Citations 97

Authors

Razvan Bunescu

Ruifang Ge

Rohit J Kate

Edward M Marcotte

Raymond J Mooney

Arun K Ramani

Yuk Wah Wong

Affiliations

Soon will be listed here.

Abstract

Objective: Automatically extracting information from biomedical text holds the promise of easily consolidating large amounts of biological knowledge in computer-accessible form. This strategy is particularly attractive for extracting data relevant to genes of the human genome from the 11 million abstracts in Medline. However, extraction efforts have been frustrated by the lack of conventions for describing human genes and proteins. We have developed and evaluated a variety of learned information extraction systems for identifying human protein names in Medline abstracts and subsequently extracting information on interactions between the proteins.

Methods And Material: We used a variety of machine learning methods to automatically develop information extraction systems for extracting information on gene/protein name, function and interactions from Medline abstracts. We present cross-validated results on identifying human proteins and their interactions by training and testing on a set of approximately 1000 manually-annotated Medline abstracts that discuss human genes/proteins.

Results: We demonstrate that machine learning approaches using support vector machines and maximum entropy are able to identify human proteins with higher accuracy than several previous approaches. We also demonstrate that various rule induction methods are able to identify protein interactions with higher precision than manually-developed rules.

Conclusion: Our results show that it is promising to use machine learning to automatically build systems for extracting information from biomedical text. The results also give a broad picture of the relative strengths of a wide variety of methods when tested on a reasonably large human-annotated corpus.

Citing Articles

Annotated corpus for traditional formula-disease relationships in biomedical articles.

Yea S, Jang H, Kim S, Lee S, Kim J Sci Data. 2025; 12(1):26.

PMID: 39774689 PMC: 11707285. DOI: 10.1038/s41597-025-04377-2.

Learning to explain is a good biomedical few-shot learner.

Chen P, Wang J, Luo L, Lin H, Yang Z Bioinformatics. 2024; 40(10).

PMID: 39360976 PMC: 11483110. DOI: 10.1093/bioinformatics/btae589.

Evaluating GPT and BERT models for protein-protein interaction identification in biomedical text.

Rehana H, Cam N, Basmaci M, Zheng J, Jemiyo C, He Y Bioinform Adv. 2024; 4(1):vbae133.

PMID: 39319026 PMC: 11419952. DOI: 10.1093/bioadv/vbae133.

STRING-ing together protein complexes: corpus and methods for extracting physical protein interactions from the biomedical literature.

Mehryary F, Nastou K, Ohta T, Jensen L, Pyysalo S Bioinformatics. 2024; 40(9).

PMID: 39276156 PMC: 11441320. DOI: 10.1093/bioinformatics/btae552.

RegulaTome: a corpus of typed, directed, and signed relations between biomedical entities in the scientific literature.

Nastou K, Mehryary F, Ohta T, Luoma J, Pyysalo S, Jensen L Database (Oxford). 2024; 2024.

PMID: 39265993 PMC: 11394941. DOI: 10.1093/database/baae095.