Open-source LLMs for Text Annotation: a Practical Guide for Model Setting and Fine-tuning

Overview

Journal J Comput Soc Sci

Date 2024 Dec 23

PMID 39712076

Authors

Meysam Alizadeh

Mael Kubli

Zeynab Samei

Shirin Dehghani

Mohammadmasiha Zahedivafa

Juan D Bermeo

Maria Korobeynikova

Fabrizio Gilardi

Affiliations

Soon will be listed here.

Abstract

Supplementary Information: The online version contains supplementary material available at 10.1007/s42001-024-00345-9.

Citing Articles

Developing a named entity framework for thyroid cancer staging and risk level classification using large language models.

Fung M, Tang E, Wu T, Luk Y, Au I, Liu X NPJ Digit Med. 2025; 8(1):134.

PMID: 40025285 PMC: 11873034. DOI: 10.1038/s41746-025-01528-y.

References

Rudin C . Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead. Nat Mach Intell. 2022; 1(5):206-215. PMC: 9122117. DOI: 10.1038/s42256-019-0048-x. View

Gilardi F, Alizadeh M, Kubli M . ChatGPT outperforms crowd workers for text-annotation tasks. Proc Natl Acad Sci U S A. 2023; 120(30):e2305016120. PMC: 10372638. DOI: 10.1073/pnas.2305016120. View

Frei J, Kramer F . Annotated dataset creation through large language models for non-english medical NLP. J Biomed Inform. 2023; 145:104478. DOI: 10.1016/j.jbi.2023.104478. View

van Dis E, Bollen J, Zuidema W, van Rooij R, Bockting C . ChatGPT: five priorities for research. Nature. 2023; 614(7947):224-226. DOI: 10.1038/d41586-023-00288-7. View

Alizadeh M, Hoes E, Gilardi F . Tokenization of social media engagements increases the sharing of false (and other) news but penalization moderates it. Sci Rep. 2023; 13(1):13703. PMC: 10444751. DOI: 10.1038/s41598-023-40716-2. View