Overview of the Protein-protein Interaction Annotation Extraction Task of BioCreative II

Overview

Journal Genome Biol

Specialties Biology
Genetics

Date 2008 Oct 18

PMID 18834495

Citations 115

Authors

Martin Krallinger

Florian Leitner

Carlos Rodriguez-Penagos

Alfonso Valencia

Affiliations

Soon will be listed here.

Abstract

Background: The biomedical literature is the primary information source for manual protein-protein interaction annotations. Text-mining systems have been implemented to extract binary protein interactions from articles, but a comprehensive comparison between the different techniques as well as with manual curation was missing.

Results: We designed a community challenge, the BioCreative II protein-protein interaction (PPI) task, based on the main steps of a manual protein interaction annotation workflow. It was structured into four distinct subtasks related to: (a) detection of protein interaction-relevant articles; (b) extraction and normalization of protein interaction pairs; (c) retrieval of the interaction detection methods used; and (d) retrieval of actual text passages that provide evidence for protein interactions. A total of 26 teams submitted runs for at least one of the proposed subtasks. In the interaction article detection subtask, the top scoring team reached an F-score of 0.78. In the interaction pair extraction and mapping to SwissProt, a precision of 0.37 (with recall of 0.33) was obtained. For associating articles with an experimental interaction detection method, an F-score of 0.65 was achieved. As for the retrieval of the PPI passages best summarizing a given protein interaction in full-text articles, 19% of the submissions returned by one of the runs corresponded to curator-selected sentences. Curators extracted only the passages that best summarized a given interaction, implying that many of the automatically extracted ones could contain interaction information but did not correspond to the most informative sentences.

Conclusion: The BioCreative II PPI task is the first attempt to compare the performance of text-mining tools specific for each of the basic steps of the PPI extraction pipeline. The challenges identified range from problems in full-text format conversion of articles to difficulties in detecting interactor protein pairs and then linking them to their database records. Some limitations were also encountered when using a single (and possibly incomplete) reference database for protein normalization or when limiting search for interactor proteins to co-occurrence within a single sentence, when a mention might span neighboring sentences. Finally, distinguishing between novel, experimentally verified interactions (annotation relevant) and previously known interactions adds additional complexity to these tasks.

Citing Articles

Artificial Intelligence Methods in Infection Biology Research.

Anter J, Yakimovich A Methods Mol Biol. 2025; 2890:291-333.

PMID: 39890733 DOI: 10.1007/978-1-0716-4326-6_15.

JTIS: enhancing biomedical document-level relation extraction through joint training with intermediate steps.

Li J, Pan D, Yang Z, Sun Y, Lin H, Wang J Database (Oxford). 2024; 2024.

PMID: 39700498 PMC: 11658465. DOI: 10.1093/database/baae125.

Biomedical relation extraction method based on ensemble learning and attention mechanism.

Jia Y, Wang H, Yuan Z, Zhu L, Xiang Z BMC Bioinformatics. 2024; 25(1):333.

PMID: 39425010 PMC: 11488084. DOI: 10.1186/s12859-024-05951-y.

CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes.

Nastou K, Koutrouli M, Pyysalo S, Jensen L Bioinform Adv. 2024; 4(1):vbae116.

PMID: 39411448 PMC: 11474106. DOI: 10.1093/bioadv/vbae116.

Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini.

Phan C, Phan B, Chiang J Database (Oxford). 2024; 2024.

PMID: 39383312 PMC: 11463225. DOI: 10.1093/database/baae104.

References

Orchard S, Montecchi-Palazzi L, Hermjakob H, Apweiler R . The use of common ontologies and controlled vocabularies to enable data exchange and deposition for complex proteomic experiments. Pac Symp Biocomput. 2005; :186-96. View

Beuming T, Skrabanek L, Niv M, Mukherjee P, Weinstein H . PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics. 2004; 21(6):827-8. DOI: 10.1093/bioinformatics/bti098. View

Settles B . ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics. 2005; 21(14):3191-2. DOI: 10.1093/bioinformatics/bti475. View

Sing T, Sander O, Beerenwinkel N, Lengauer T . ROCR: visualizing classifier performance in R. Bioinformatics. 2005; 21(20):3940-1. DOI: 10.1093/bioinformatics/bti623. View

Persico M, Ceol A, Gavrila C, Hoffmann R, Florio A, Cesareni G . HomoMINT: an inferred human network based on orthology mapping of protein interactions discovered in model organisms. BMC Bioinformatics. 2005; 6 Suppl 4:S21. PMC: 1866386. DOI: 10.1186/1471-2105-6-S4-S21. View

Mishra G, Suresh M, Kumaran K, Kannabiran N, Suresh S, Bala P . Human protein reference database--2006 update. Nucleic Acids Res. 2005; 34(Database issue):D411-4. PMC: 1347503. DOI: 10.1093/nar/gkj141. View

Chatr-Aryamontri A, Ceol A, Palazzi L, Nardelli G, Schneider M, Castagnoli L . MINT: the Molecular INTeraction database. Nucleic Acids Res. 2006; 35(Database issue):D572-4. PMC: 1751541. DOI: 10.1093/nar/gkl950. View

Kerrien S, Alam-Faruque Y, Aranda B, Bancarz I, Bridge A, Derow C . IntAct--open source resource for molecular interaction data. Nucleic Acids Res. 2006; 35(Database issue):D561-5. PMC: 1751531. DOI: 10.1093/nar/gkl958. View

Mathivanan S, Periaswamy B, Gandhi T, Kandasamy K, Suresh S, Mohmood R . An evaluation of human protein-protein interaction data in the public domain. BMC Bioinformatics. 2007; 7 Suppl 5:S19. PMC: 1764475. DOI: 10.1186/1471-2105-7-S5-S19. View

10.

Krallinger M, Malik R, Valencia A . Text mining and protein annotations: the construction and use of protein description sentences. Genome Inform. 2007; 17(2):121-30. View

11.

Orchard S, Salwinski L, Kerrien S, Montecchi-Palazzi L, Oesterheld M, Stumpflen V . The minimum information required for reporting a molecular interaction experiment (MIMIx). Nat Biotechnol. 2007; 25(8):894-8. DOI: 10.1038/nbt1324. View

12.

Mewes H, Dietmann S, Frishman D, Gregory R, Mannhaupt G, Mayer K . MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Res. 2007; 36(Database issue):D196-201. PMC: 2238900. DOI: 10.1093/nar/gkm980. View

13.

Alex B, Grover C, Haddow B, Kabadjov M, Klein E, Matthews M . Automating curation using a natural language processing pipeline. Genome Biol. 2008; 9 Suppl 2:S10. PMC: 2559981. DOI: 10.1186/gb-2008-9-s2-s10. View

14.

Abi-Haidar A, Kaur J, Maguitman A, Radivojac P, Rechtsteiner A, Verspoor K . Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks. Genome Biol. 2008; 9 Suppl 2:S11. PMC: 2559982. DOI: 10.1186/gb-2008-9-s2-s11. View

15.

Huang M, Ding S, Wang H, Zhu X . Mining physical protein-protein interactions from the literature. Genome Biol. 2008; 9 Suppl 2:S12. PMC: 2559983. DOI: 10.1186/gb-2008-9-s2-s12. View

16.

Rinaldi F, Kappeler T, Kaljurand K, Schneider G, Klenner M, Clematide S . OntoGene in BioCreative II. Genome Biol. 2008; 9 Suppl 2:S13. PMC: 2559984. DOI: 10.1186/gb-2008-9-s2-s13. View

17.

Hakenberg J, Plake C, Royer L, Strobelt H, Leser U, Schroeder M . Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol. 2008; 9 Suppl 2:S14. PMC: 2559985. DOI: 10.1186/gb-2008-9-s2-s14. View

18.

Leitner F, Krallinger M, Rodriguez-Penagos C, Hakenberg J, Plake C, Kuo C . Introducing meta-services for biomedical information extraction. Genome Biol. 2008; 9 Suppl 2:S6. PMC: 2559990. DOI: 10.1186/gb-2008-9-s2-s6. View

19.

Donaldson I, Martin J, De Bruijn B, Wolting C, Lay V, Tuekam B . PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics. 2003; 4:11. PMC: 153503. DOI: 10.1186/1471-2105-4-11. View

20.

Jenssen T, Laegreid A, Komorowski J, Hovig E . A literature network of human genes for high-throughput analysis of gene expression. Nat Genet. 2001; 28(1):21-8. DOI: 10.1038/ng0501-21. View