» Articles » PMID: 16087885

Unsupervised Learning of Natural Languages

Overview
Specialty Science
Date 2005 Aug 10
PMID 16087885
Citations 34
Authors
Affiliations
Soon will be listed here.
Abstract

We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The adios (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.

Citing Articles

(Mis)align: a simple dynamic framework for modeling interpersonal coordination.

Miao G, Dale R, Galati A Sci Rep. 2023; 13(1):18325.

PMID: 37884542 PMC: 10603172. DOI: 10.1038/s41598-023-41516-4.


Morphology in a Parallel, Distributed, Interactive Architecture of Language Production.

Kapatsinski V Front Artif Intell. 2022; 5:803259.

PMID: 35310958 PMC: 8927966. DOI: 10.3389/frai.2022.803259.


Models of Language and Multiword Expressions.

Contreras Kallens P, Christiansen M Front Artif Intell. 2022; 5:781962.

PMID: 35252848 PMC: 8892141. DOI: 10.3389/frai.2022.781962.


Modelling how cleaner fish approach an ephemeral reward task demonstrates a role for ecologically tuned chunking in the evolution of advanced cognition.

Prat Y, Bshary R, Lotem A PLoS Biol. 2022; 20(1):e3001519.

PMID: 34986149 PMC: 8765642. DOI: 10.1371/journal.pbio.3001519.


Physician Knowledge Base: Clinical Decision Support Systems.

Kim S, Kim E, Kim H Yonsei Med J. 2021; 63(1):8-15.

PMID: 34913279 PMC: 8688369. DOI: 10.3349/ymj.2022.63.1.8.


References
1.
Goldberg A . Constructions: a new theoretical approach to language. Trends Cogn Sci. 2003; 7(5):219-224. DOI: 10.1016/s1364-6613(03)00080-9. View

2.
Gomez R . Variability and detection of invariant structure. Psychol Sci. 2002; 13(5):431-6. DOI: 10.1111/1467-9280.00476. View

3.
Cai C, Han L, Ji Z, Chen X, Chen Y . SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003; 31(13):3692-7. PMC: 169006. DOI: 10.1093/nar/gkg600. View

4.
Nowak M, Komarova N, Niyogi P . Evolution of universal grammar. Science. 2001; 291(5501):114-8. DOI: 10.1126/science.291.5501.114. View

5.
Seidenberg M, MacDonald M, Saffran J . Neuroscience. Does grammar start where statistics stop?. Science. 2002; 298(5593):553-4. DOI: 10.1126/science.1078094. View