» Articles » PMID: 34853666

Natural Language-based Integration of Online Review Datasets for Identification of Sex Trafficking Businesses

Overview
Authors
Affiliations
Soon will be listed here.
Abstract

There is increasing interest in automatically identifying advertisements related to sex trafficking in online review sites. The main challenge is to identify the changing patterns in text reviews that are used to indicate illegal businesses. This work describes a novel means of identifying illegal business advertisements using natural language processing and machine learning. The method relies on building a training set of reviews of known illegal businesses. This training data is created by integrating a small high precision set of known illegal businesses (Rubmaps) with a large collection of online reviews from a general purpose review site (Yelp). Standard natural language pre-processing techniques are then applied to the text reviews and converted into a bag-of-words model with Term frequency-inverse document weighting. The resulting Document-Term matrix is used to train a classifier and then to identify suspicious activity from the remaining reviews. This approach therefore leverages a high-precision, low-recall dataset to identify relevant instances from the large low-precision, high-recall dataset. The approach was evaluated on a collection of 456,050 reviews from the Yelp online forum with a variety of machine learning algorithms and different number of text features. The method achieved a f1-score of 0.77 with a random forests classifier. The number of text features could also be reduced from 1,473 to 447 for a compact classifier with only a small drop in accuracy.

Citing Articles

Operations research and analytics to combat human trafficking: A systematic review of academic literature.

Dimas G, Konrad R, Maass K, Trapp A PLoS One. 2022; 17(8):e0273708.

PMID: 36037198 PMC: 9423650. DOI: 10.1371/journal.pone.0273708.

References
1.
Blagus R, Lusa L . Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics. 2010; 11:523. PMC: 3098087. DOI: 10.1186/1471-2105-11-523. View