Is Searching Full Text More Effective Than Searching Abstracts?

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2009 Feb 5

PMID 19192280

Citations 23

Authors

Jimmy Lin

Affiliations

Soon will be listed here.

Abstract

Background: With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: bm25 and the ranking algorithm implemented in the open-source Lucene search engine.

Results: Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles.

Conclusion: Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives. Experimental results also highlight the need to develop distributed text retrieval algorithms, since full-text articles are significantly longer than abstracts and may require the computational resources of multiple machines in a cluster. The MapReduce programming model provides a convenient framework for organizing such computations.

Citing Articles

Unsupervised learning and natural language processing highlight research trends in a superbug.

Mendez-Cruz C, Rodriguez-Herrera J, Varela-Vega A, Mateo-Estrada V, Castillo-Ramirez S Front Artif Intell. 2024; 7:1336071.

PMID: 38576460 PMC: 10991725. DOI: 10.3389/frai.2024.1336071.

Predicting substantive biomedical citations without full text.

Hoppe T, Arabi S, Hutchins B Proc Natl Acad Sci U S A. 2023; 120(30):e2213697120.

PMID: 37463199 PMC: 10372685. DOI: 10.1073/pnas.2213697120.

Towards a unified search: Improving PubMed retrieval with full text.

Kim W, Yeganova L, Comeau D, Wilbur W, Lu Z J Biomed Inform. 2022; 134:104211.

PMID: 36152950 PMC: 9561061. DOI: 10.1016/j.jbi.2022.104211.

GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships.

Gunturkun M, Flashner E, Wang T, Mulligan M, Williams R, Prins P G3 (Bethesda). 2022; 12(5).

PMID: 35285473 PMC: 9073678. DOI: 10.1093/g3journal/jkac059.

Text mining for modeling of protein complexes enhanced by machine learning.

Badal V, Kundrotas P, Vakser I Bioinformatics. 2020; 37(4):497-505.

PMID: 32960948 PMC: 8088328. DOI: 10.1093/bioinformatics/btaa823.

References

Yu H, Hatzivassiloglou V, Friedman C, Rzhetsky A, Wilbur W . Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp. 2002; :919-23. PMC: 2244511. View

Demner-Fushman D, Hauser S, Thoma G . The role of title, metadata and abstract in identifying clinically relevant journal articles. AMIA Annu Symp Proc. 2006; :191-5. PMC: 1560462. View

Shatkay H, Chen N, Blostein D . Integrating image data into biomedical text categorization. Bioinformatics. 2006; 22(14):e446-53. DOI: 10.1093/bioinformatics/btl235. View

Schuemie M, Weeber M, Schijvenaars B, van Mulligen E, van der Eijk C, Jelier R . Distribution of information in biomedical abstracts and full-text publications. Bioinformatics. 2004; 20(16):2597-604. DOI: 10.1093/bioinformatics/bth291. View

Seki K, Mostafa J . Discovering implicit associations between genes and hereditary diseases. Pac Symp Biocomput. 2007; :316-27. View

Kou Z, Cohen W, Murphy R . A stacked graphical model for associating sub-images with sub-captions. Pac Symp Biocomput. 2007; :257-68. PMC: 2853925. View

Tbahriti I, Chichester C, Lisacek F, Ruch P . Using argumentation to retrieve articles with similar citations: an inquiry into improving related articles search in the MEDLINE digital library. Int J Med Inform. 2005; 75(6):488-95. DOI: 10.1016/j.ijmedinf.2005.06.007. View

Yu H, Lee M . Accessing bioscience images from abstract sentences. Bioinformatics. 2006; 22(14):e547-56. DOI: 10.1093/bioinformatics/btl261. View

Zweigenbaum P, Demner-Fushman D, Yu H, Cohen K . Frontiers of biomedical text mining: current progress. Brief Bioinform. 2007; 8(5):358-75. PMC: 2516302. DOI: 10.1093/bib/bbm045. View

10.

Shah P, Perez-Iratxeta C, Bork P, Andrade M . Information extraction from full text scientific articles: where are the keywords?. BMC Bioinformatics. 2003; 4:20. PMC: 166134. DOI: 10.1186/1471-2105-4-20. View

11.

Yu H . Towards answering biological questions with experimental evidence: automatically identifying text that summarize image content in full-text articles. AMIA Annu Symp Proc. 2007; :834-8. PMC: 1839512. View

12.

Hunter L, Cohen K . Biomedical language processing: what's beyond PubMed?. Mol Cell. 2006; 21(5):589-94. PMC: 1702322. DOI: 10.1016/j.molcel.2006.02.012. View

13.

Gay C, Kayaalp M, Aronson A . Semi-automatic indexing of full text biomedical articles. AMIA Annu Symp Proc. 2006; :271-5. PMC: 1560666. View