» Articles » PMID: 28758138

Author Name Disambiguation for PubMed

Overview
Date 2017 Aug 1
PMID 28758138
Citations 19
Authors
Affiliations
Soon will be listed here.
Abstract

Log analysis shows that PubMed users frequently use author names in queries for retrieving scientific literature. However, author name ambiguity may lead to irrelevant retrieval results. To improve the PubMed user experience with author name queries, we designed an author name disambiguation system consisting of similarity estimation and agglomerative clustering. A machine-learning method was employed to score the features for disambiguating a pair of papers with ambiguous names. These features enable the computation of pairwise similarity scores to estimate the probability of a pair of papers belonging to the same author, which drives an agglomerative clustering algorithm regulated by 2 factors: name compatibility and probability level. With transitivity violation correction, high precision author clustering is achieved by focusing on minimizing false-positive pairing. Disambiguation performance is evaluated with manual verification of random samples of pairs from clustering results. When compared with a state-of-the-art system, our evaluation shows that among all the pairs the lumping error rate drops from 10.1% to 2.2% for our system, while the splitting error rises from 1.8% to 7.7%. This results in an overall error rate of 9.9%, compared with 11.9% for the state-of-the-art method. Other evaluations based on gold standard data also show the increase in accuracy of our clustering. We attribute the performance improvement to the machine-learning method driven by a large-scale training set and the clustering algorithm regulated by a name compatibility scheme preferring precision. With integration of the author name disambiguation system into the PubMed search engine, the overall click-through-rate of PubMed users on author name query results improved from 34.9% to 36.9%.

Citing Articles

Author name disambiguation based on heterogeneous graph neural network.

Wang G, Sun Z, Hu W, Cai M PLoS One. 2025; 20(2):e0310992.

PMID: 40009590 PMC: 11864548. DOI: 10.1371/journal.pone.0310992.


PubMed Computed Authors in 2024: an open resource of disambiguated author names in biomedical literature.

Tian S, Chen Q, Comeau D, Wilbur W, Lu Z Bioinformatics. 2024; 40(11).

PMID: 39520405 PMC: 11588201. DOI: 10.1093/bioinformatics/btae672.


An analysis of the effects of sharing research data, code, and preprints on citations.

Colavizza G, Cadwallader L, LaFlamme M, Dozot G, Lecorney S, Rappo D PLoS One. 2024; 19(10):e0311493.

PMID: 39475849 PMC: 11524460. DOI: 10.1371/journal.pone.0311493.


Bridging the gap in author names: building an enhanced author name dataset for biomedical literature system.

Zhang L, Song N, Gui S, Wu K, Lu W J Am Med Inform Assoc. 2024; 31(8):1648-1656.

PMID: 38916911 PMC: 11258411. DOI: 10.1093/jamia/ocae127.


Development and Validation of an Automated Tool to Retrieve and Curate Faculty Publications of Academic Departments.

Epstein R, Mueller D, Walco J, Manresa C, Banks S, Freundlich R Cureus. 2023; 15(10):e47976.

PMID: 38034270 PMC: 10685054. DOI: 10.7759/cureus.47976.


References
1.
Lin J, Wilbur W . PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics. 2007; 8:423. PMC: 2212667. DOI: 10.1186/1471-2105-8-423. View

2.
Torvik V, Smalheiser N . Author Name Disambiguation in MEDLINE. ACM Trans Knowl Discov Data. 2010; 3(3). PMC: 2805000. DOI: 10.1145/1552303.1552304. View

3.
Islamaj Dogan R, Murray G, Neveol A, Lu Z . Understanding PubMed user search behavior through log analysis. Database (Oxford). 2010; 2009:bap018. PMC: 2797455. DOI: 10.1093/database/bap018. View

4.
Wilbur W, Yeganova L, Kim W . The Synergy Between PAV and AdaBoost. Mach Learn. 2018; 61(1-3):71-103. PMC: 5815843. DOI: 10.1007/s10994-005-1123-6. View