» Articles » PMID: 19192280

Is Searching Full Text More Effective Than Searching Abstracts?

Overview
Publisher Biomed Central
Specialty Biology
Date 2009 Feb 5
PMID 19192280
Citations 23
Authors
Affiliations
Soon will be listed here.
Abstract

Background: With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: bm25 and the ranking algorithm implemented in the open-source Lucene search engine.

Results: Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles.

Conclusion: Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives. Experimental results also highlight the need to develop distributed text retrieval algorithms, since full-text articles are significantly longer than abstracts and may require the computational resources of multiple machines in a cluster. The MapReduce programming model provides a convenient framework for organizing such computations.

Citing Articles

Unsupervised learning and natural language processing highlight research trends in a superbug.

Mendez-Cruz C, Rodriguez-Herrera J, Varela-Vega A, Mateo-Estrada V, Castillo-Ramirez S Front Artif Intell. 2024; 7:1336071.

PMID: 38576460 PMC: 10991725. DOI: 10.3389/frai.2024.1336071.


Predicting substantive biomedical citations without full text.

Hoppe T, Arabi S, Hutchins B Proc Natl Acad Sci U S A. 2023; 120(30):e2213697120.

PMID: 37463199 PMC: 10372685. DOI: 10.1073/pnas.2213697120.


Towards a unified search: Improving PubMed retrieval with full text.

Kim W, Yeganova L, Comeau D, Wilbur W, Lu Z J Biomed Inform. 2022; 134:104211.

PMID: 36152950 PMC: 9561061. DOI: 10.1016/j.jbi.2022.104211.


GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships.

Gunturkun M, Flashner E, Wang T, Mulligan M, Williams R, Prins P G3 (Bethesda). 2022; 12(5).

PMID: 35285473 PMC: 9073678. DOI: 10.1093/g3journal/jkac059.


Text mining for modeling of protein complexes enhanced by machine learning.

Badal V, Kundrotas P, Vakser I Bioinformatics. 2020; 37(4):497-505.

PMID: 32960948 PMC: 8088328. DOI: 10.1093/bioinformatics/btaa823.


References
1.
Yu H, Hatzivassiloglou V, Friedman C, Rzhetsky A, Wilbur W . Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp. 2002; :919-23. PMC: 2244511. View

2.
Demner-Fushman D, Hauser S, Thoma G . The role of title, metadata and abstract in identifying clinically relevant journal articles. AMIA Annu Symp Proc. 2006; :191-5. PMC: 1560462. View

3.
Shatkay H, Chen N, Blostein D . Integrating image data into biomedical text categorization. Bioinformatics. 2006; 22(14):e446-53. DOI: 10.1093/bioinformatics/btl235. View

4.
Schuemie M, Weeber M, Schijvenaars B, van Mulligen E, van der Eijk C, Jelier R . Distribution of information in biomedical abstracts and full-text publications. Bioinformatics. 2004; 20(16):2597-604. DOI: 10.1093/bioinformatics/bth291. View

5.
Seki K, Mostafa J . Discovering implicit associations between genes and hereditary diseases. Pac Symp Biocomput. 2007; :316-27. View