» Articles » PMID: 36532803

Multi-label Multi-class COVID-19 Arabic Twitter Dataset with Fine-grained Misinformation and Situational Information Annotations

Overview
Date 2022 Dec 19
PMID 36532803
Authors
Affiliations
Soon will be listed here.
Abstract

Since the inception of the current COVID-19 pandemic, related misleading information has spread at a remarkable rate on social media, leading to serious implications for individuals and societies. Although COVID-19 looks to be ending for most places after the sharp shock of Omicron, severe new variants can emerge and cause new waves, especially if the variants can evade the insufficient immunity provided by prior infection and incomplete vaccination. Fighting the fake news that promotes vaccine hesitancy, for instance, is crucial for the success of the global vaccination programs and thus achieving herd immunity. To combat the proliferation of COVID-19-related misinformation, considerable research efforts have been and are still being dedicated to building and sharing COVID-19 misinformation detection datasets and models for Arabic and other languages. However, most of these datasets provide binary (true/false) misinformation classifications. Besides, the few studies that support multi-class misinformation classification deal with a small set of misinformation classes or mix them with situational information classes. False news stories about COVID-19 are not equal; some tend to have more sinister effects than others (., fake cures and false vaccine info). This suggests that identifying the sub-type of misinformation is critical for choosing the suitable action based on their level of seriousness, ranging from assigning warning labels to the susceptible post to removing the misleading post instantly. We develop comprehensive annotation guidelines in this work that define 19 fine-grained misinformation classes. Then, we release the first Arabic COVID-19-related misinformation dataset comprising about 6.7K tweets with multi-class and multi-label misinformation annotations. In addition, we release a version of the dataset to be the first Twitter Arabic dataset annotated exclusively with six different situational information classes. Identifying situational information (., caution, help-seeking) helps authorities or individuals understand the situation during emergencies. To confirm the validity of the collected data, we define three classification tasks and experiment with various machine learning and transformer-based classifiers to offer baseline results for future research. The experimental results indicate the quality and validity of the data and its suitability for constructing misinformation and situational information classification models. The results also demonstrate the superiority of AraBERT-COV19, a transformer-based model pretrained on COVID-19-related tweets, with micro-averaged F-scores of 81.6% and 78.8% for the multi-class misinformation and situational information classification tasks, respectively. Label Powerset with linear SVC achieved the best performance among the presented methods for multi-label misinformation classification with micro-averaged F-scores of 76.69%.

Citing Articles

Understanding the determinants of vaccine hesitancy in the United States: A comparison of social surveys and social media.

Sasse K, Mahabir R, Gkountouna O, Crooks A, Croitoru A PLoS One. 2024; 19(6):e0301488.

PMID: 38843170 PMC: 11156396. DOI: 10.1371/journal.pone.0301488.


Special issue on analysis and mining of social media data.

Zubiaga A, Rosso P PeerJ Comput Sci. 2024; 10:e1909.

PMID: 38435569 PMC: 10909232. DOI: 10.7717/peerj-cs.1909.

References
1.
Greene C, Murphy G . Quantifying the effects of fake news on behavior: Evidence from a study of COVID-19 misinformation. J Exp Psychol Appl. 2021; 27(4):773-784. DOI: 10.1037/xap0000371. View

2.
Barua Z, Barua S, Aktar S, Kabir N, Li M . Effects of misinformation on COVID-19 individual responses and recommendations for resilience of disastrous consequences of misinformation. Prog Disaster Sci. 2021; 8:100119. PMC: 7373041. DOI: 10.1016/j.pdisas.2020.100119. View

3.
Medford R, Saleh S, Sumarsono A, Perl T, Lehmann C . An "Infodemic": Leveraging High-Volume Twitter Data to Understand Early Public Sentiment for the Coronavirus Disease 2019 Outbreak. Open Forum Infect Dis. 2020; 7(7):ofaa258. PMC: 7337776. DOI: 10.1093/ofid/ofaa258. View

4.
Raza S, Ding C . Fake news detection based on news content and social contexts: a transformer-based approach. Int J Data Sci Anal. 2022; 13(4):335-362. PMC: 8800852. DOI: 10.1007/s41060-021-00302-z. View

5.
Bogdanowicz A, Guan C . Dynamic topic modeling of twitter data during the COVID-19 pandemic. PLoS One. 2022; 17(5):e0268669. PMC: 9140268. DOI: 10.1371/journal.pone.0268669. View