» Articles » PMID: 31105412

Automated Phrase Mining from Massive Text Corpora

Overview
Date 2019 May 21
PMID 31105412
Citations 10
Authors
Affiliations
Soon will be listed here.
Abstract

As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (, Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extend to model single-word quality phrases.

Citing Articles

Weakly Supervised Concept Map Generation through Task-Guided Graph Translation.

Lu J, Dong X, Yang C IEEE Trans Knowl Data Eng. 2024; 35(10):10871-10883.

PMID: 38389564 PMC: 10883073. DOI: 10.1109/tkde.2023.3252588.


Use of Machine Learning Tools in Evidence Synthesis of Tobacco Use Among Sexual and Gender Diverse Populations: Algorithm Development and Validation.

Ma S, Jiang S, Yang O, Zhang X, Fu Y, Zhang Y JMIR Form Res. 2024; 8:e49031.

PMID: 38265858 PMC: 10851114. DOI: 10.2196/49031.


Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark.

Yang C, Xiao Y, Zhang Y, Sun Y, Han J IEEE Trans Knowl Data Eng. 2023; 34(10):4854-4873.

PMID: 37915376 PMC: 10619966. DOI: 10.1109/tkde.2020.3045924.


An Empirical Methodology for Detecting and Prioritizing Needs during Crisis Events.

Sarol M, Dinh L, Rezapour R, Chin C, Yang P, Diesner J Proc Conf Empir Methods Nat Lang Process. 2023; 2020:4102-4107.

PMID: 37908863 PMC: 10616930. DOI: 10.18653/v1/2020.findings-emnlp.366.


Thyroidkeeper: a healthcare management system for patients with thyroid diseases.

Zhang J, Li J, Zhu Y, Fu Y, Chen L Health Inf Sci Syst. 2023; 11(1):49.

PMID: 37860050 PMC: 10582002. DOI: 10.1007/s13755-023-00251-w.


References
1.
Liu J, Shang J, Wang C, Ren X, Han J . Mining Quality Phrases from Massive Text Corpora. Proc ACM SIGMOD Int Conf Manag Data. 2015; 2015:1729-1744. PMC: 4688018. DOI: 10.1145/2723372.2751523. View