Revealing and Avoiding Bias in Semantic Similarity Scores for Protein Pairs

Overview

Journal BMC Bioinformatics

Publisher Biomed Central

Specialty Biology

Date 2010 Jun 1

PMID 20509916

Citations 14

Authors

Jing Wang

Xianxiao Zhou

Jing Zhu

Chenggui Zhou

Zheng Guo

Affiliations

Soon will be listed here.

Abstract

Background: Semantic similarity scores for protein pairs are widely applied in functional genomic researches for finding functional clusters of proteins, predicting protein functions and protein-protein interactions, and for identifying putative disease genes. However, because some proteins, such as those related to diseases, tend to be studied more intensively, annotations are likely to be biased, which may affect applications based on semantic similarity measures. Thus, it is necessary to evaluate the effects of the bias on semantic similarity scores between proteins and then find a method to avoid them.

Results: First, we evaluated 14 commonly used semantic similarity scores for protein pairs and demonstrated that they significantly correlated with the numbers of annotation terms for the proteins (also known as the protein annotation length). These results suggested that current applications of the semantic similarity scores between proteins might be unreliable. Then, to reduce this annotation bias effect, we proposed normalizing the semantic similarity scores between proteins using the power transformation of the scores. We provide evidence that this improves performance in some applications.

Conclusions: Current semantic similarity measures for protein pairs are highly dependent on protein annotation lengths, which are subject to biological research bias. This affects applications that are based on these semantic similarity scores, especially in clustering studies that rely on score magnitudes. The normalized scores proposed in this paper can reduce the effects of this bias to some extent.

Citing Articles

Integration of probabilistic functional networks without an external Gold Standard.

James K, Alsobhe A, Cockell S, Wipat A, Pocock M BMC Bioinformatics. 2022; 23(1):302.

PMID: 35879662 PMC: 9316706. DOI: 10.1186/s12859-022-04834-4.

LePrimAlign: local entropy-based alignment of PPI networks to predict conserved modules.

Maskey S, Cho Y BMC Genomics. 2019; 20(Suppl 9):964.

PMID: 31874635 PMC: 6929407. DOI: 10.1186/s12864-019-6271-3.

CommWalker: correctly evaluating modules in molecular networks in light of annotation bias.

Luecken M, Page M, Crosby A, Mason S, Reinert G, Deane C Bioinformatics. 2017; 34(6):994-1000.

PMID: 29112702 PMC: 5860269. DOI: 10.1093/bioinformatics/btx706.

Exploring Approaches for Detecting Protein Functional Similarity within an Orthology-based Framework.

Weichenberger C, Palermo A, Pramstaller P, Domingues F Sci Rep. 2017; 7(1):381.

PMID: 28336965 PMC: 5428484. DOI: 10.1038/s41598-017-00465-5.

Microbial Community Responses to Increased Water and Organic Matter in the Arid Soils of the McMurdo Dry Valleys, Antarctica.

Buelow H, Winter A, Van Horn D, Barrett J, Gooseff M, Schwartz E Front Microbiol. 2016; 7:1040.

PMID: 27486436 PMC: 4947590. DOI: 10.3389/fmicb.2016.01040.

References

Tarassov K, Messier V, Landry C, Radinovic S, Serna Molina M, Shames I . An in vivo map of the yeast protein interactome. Science. 2008; 320(5882):1465-70. DOI: 10.1126/science.1153878. View

Yang D, Li Y, Xiao H, Liu Q, Zhang M, Zhu J . Gaining confidence in biological interpretation of the microarray data: the functional consistence of the significant GO categories. Bioinformatics. 2007; 24(2):265-71. DOI: 10.1093/bioinformatics/btm558. View

Goehler H, Lalowski M, Stelzl U, Waelter S, Stroedicke M, Worm U . A protein interaction network links GIT1, an enhancer of huntingtin aggregation, to Huntington's disease. Mol Cell. 2004; 15(6):853-65. DOI: 10.1016/j.molcel.2004.09.016. View

Ofran Y, Yachdav G, Mozes E, Soong T, Nair R, Rost B . Create and assess protein networks through molecular characteristics of individual proteins. Bioinformatics. 2006; 22(14):e402-7. DOI: 10.1093/bioinformatics/btl258. View

Freudenberg J, Propping P . A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics. 2002; 18 Suppl 2:S110-5. DOI: 10.1093/bioinformatics/18.suppl_2.s110. View

Adie E, Adams R, Evans K, Porteous D, Pickard B . SUSPECTS: enabling fast and effective prioritization of positional candidates. Bioinformatics. 2006; 22(6):773-4. DOI: 10.1093/bioinformatics/btk031. View

Wang H, Azuaje F, Bodenreider O, Dopazo J . Gene Expression Correlation and Gene Ontology-Based Similarity: An Assessment of Quantitative Relationships. Proc IEEE Symp Comput Intell Bioinforma Comput Biol. 2015; 2004:25-31. PMC: 4317290. DOI: 10.1109/CIBCB.2004.1393927. View

Lee H, Hsu A, Sajdak J, Qin J, Pavlidis P . Coexpression analysis of human genes across many microarray data sets. Genome Res. 2004; 14(6):1085-94. PMC: 419787. DOI: 10.1101/gr.1910904. View

Segal E, Friedman N, Koller D, Regev A . A module map showing conditional activity of expression modules in cancer. Nat Genet. 2004; 36(10):1090-8. DOI: 10.1038/ng1434. View

10.

Chabalier J, Mosser J, Burgun A . A transversal approach to predict gene product networks from ontology-based similarity. BMC Bioinformatics. 2007; 8:235. PMC: 1940024. DOI: 10.1186/1471-2105-8-235. View

11.

Mistry M, Pavlidis P . Gene Ontology term overlap as a measure of gene functional similarity. BMC Bioinformatics. 2008; 9:327. PMC: 2518162. DOI: 10.1186/1471-2105-9-327. View

12.

Ergun A, Lawrence C, Kohanski M, Brennan T, Collins J . A network biology approach to prostate cancer. Mol Syst Biol. 2007; 3:82. PMC: 1828752. DOI: 10.1038/msb4100125. View

13.

Altenhoff A, Dessimoz C . Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol. 2009; 5(1):e1000262. PMC: 2612752. DOI: 10.1371/journal.pcbi.1000262. View

14.

Soong T, Wrzeszczynski K, Rost B . Physical protein-protein interactions predicted from microarrays. Bioinformatics. 2008; 24(22):2608-14. PMC: 2579715. DOI: 10.1093/bioinformatics/btn498. View

15.

Ulitsky I, Shlomi T, Kupiec M, Shamir R . From E-MAPs to module maps: dissecting quantitative genetic interactions using physical interactions. Mol Syst Biol. 2008; 4:209. PMC: 2516364. DOI: 10.1038/msb.2008.42. View

16.

Franke L, van Bakel H, Fokkens L, de Jong E, Egmont-Petersen M, Wijmenga C . Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am J Hum Genet. 2006; 78(6):1011-25. PMC: 1474084. DOI: 10.1086/504300. View

17.

Fury W, Batliwalla F, Gregersen P, Li W . Overlapping probabilities of top ranking gene lists, hypergeometric distribution, and stringency of gene selection criterion. Conf Proc IEEE Eng Med Biol Soc. 2007; 2006:5531-4. DOI: 10.1109/IEMBS.2006.260828. View

18.

Tao Y, Sam L, Li J, Friedman C, Lussier Y . Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics. 2007; 23(13):i529-38. PMC: 2882681. DOI: 10.1093/bioinformatics/btm195. View

19.

Sorace J, Zhan M . A data review and re-assessment of ovarian cancer serum proteomic profiling. BMC Bioinformatics. 2003; 4:24. PMC: 165662. DOI: 10.1186/1471-2105-4-24. View

20.

Martin D, Brun C, Remy E, Mouren P, Thieffry D, Jacq B . GOToolBox: functional analysis of gene datasets based on Gene Ontology. Genome Biol. 2004; 5(12):R101. PMC: 545796. DOI: 10.1186/gb-2004-5-12-r101. View