Evaluation of the Vector Space Representation in Text-based Gene Clustering
Overview
Authors
Affiliations
Thanks to its increasing availability, electronic literature can now be a major source of information when developing complex statistical models where data is scarce or contains much noise. This raises the question of how to deeply integrate information from domain literature with experimental data. Evaluating what kind of statistical text representations can integrate literature knowledge in clustering still remains an unsufficiently explored topic. In this work we discuss how the bag-of-words representation can be used successfully to represent genetic annotation and free-text information coming from different databases. We demonstrate the effect of various weighting schemes and information sources in a functional clustering setup. As a quantitative evaluation, we contrast for different parameter settings the functional groupings obtained from text with those obtained from expert assessments and link each of the results to a biological discussion.
Classification of genomes with a bag-of-words approach and machine learning.
Podda M, Bonechi S, Palladino A, Scaramuzzino M, Brozzi A, Roma G iScience. 2024; 27(3):109257.
PMID: 38439962 PMC: 10910294. DOI: 10.1016/j.isci.2024.109257.
Xiang Z, Qin T, Qin Z, He Y BMC Syst Biol. 2014; 7 Suppl 3:S9.
PMID: 24555475 PMC: 3852244. DOI: 10.1186/1752-0509-7-S3-S9.
Evaluation of semantic-based information retrieval methods in the autism phenotype domain.
Hassanpour S, OConnor M, Das A AMIA Annu Symp Proc. 2011; 2011:569-77.
PMID: 22195112 PMC: 3243127.
IntelliGO: a new vector-based semantic similarity measure including annotation origin.
Benabderrahmane S, Smail-Tabbone M, Poch O, Napoli A, Devignes M BMC Bioinformatics. 2010; 11:588.
PMID: 21122125 PMC: 3098105. DOI: 10.1186/1471-2105-11-588.
Predicting novel human gene ontology annotations using semantic analysis.
Done B, Khatri P, Done A, Draghici S IEEE/ACM Trans Comput Biol Bioinform. 2010; 7(1):91-9.
PMID: 20150671 PMC: 3712327. DOI: 10.1109/TCBB.2008.29.