Is Searching Full Text More Effective Than Searching Abstracts?
Overview
Authors
Affiliations
Background: With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: bm25 and the ranking algorithm implemented in the open-source Lucene search engine.
Results: Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles.
Conclusion: Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives. Experimental results also highlight the need to develop distributed text retrieval algorithms, since full-text articles are significantly longer than abstracts and may require the computational resources of multiple machines in a cluster. The MapReduce programming model provides a convenient framework for organizing such computations.
Unsupervised learning and natural language processing highlight research trends in a superbug.
Mendez-Cruz C, Rodriguez-Herrera J, Varela-Vega A, Mateo-Estrada V, Castillo-Ramirez S Front Artif Intell. 2024; 7:1336071.
PMID: 38576460 PMC: 10991725. DOI: 10.3389/frai.2024.1336071.
Predicting substantive biomedical citations without full text.
Hoppe T, Arabi S, Hutchins B Proc Natl Acad Sci U S A. 2023; 120(30):e2213697120.
PMID: 37463199 PMC: 10372685. DOI: 10.1073/pnas.2213697120.
Towards a unified search: Improving PubMed retrieval with full text.
Kim W, Yeganova L, Comeau D, Wilbur W, Lu Z J Biomed Inform. 2022; 134:104211.
PMID: 36152950 PMC: 9561061. DOI: 10.1016/j.jbi.2022.104211.
GeneCup: mining PubMed and GWAS catalog for gene-keyword relationships.
Gunturkun M, Flashner E, Wang T, Mulligan M, Williams R, Prins P G3 (Bethesda). 2022; 12(5).
PMID: 35285473 PMC: 9073678. DOI: 10.1093/g3journal/jkac059.
Text mining for modeling of protein complexes enhanced by machine learning.
Badal V, Kundrotas P, Vakser I Bioinformatics. 2020; 37(4):497-505.
PMID: 32960948 PMC: 8088328. DOI: 10.1093/bioinformatics/btaa823.