The final, formatted version of the article will be published soon.
ORIGINAL RESEARCH article
Front. Big Data
Sec. Data Science
Volume 7 - 2024 |
doi: 10.3389/fdata.2024.1501154
Enhancing Sentiment and Intent Analysis in Public Health via Fine-Tuned Large Language Models on Tobacco and E-cigarette-Related Tweets
Provisionally accepted- University of Bath, Bath, United Kingdom
Background: Accurate sentiment analysis and intent categorisation of tobacco and e-cigarette-related social media content are critical for public health research, yet they necessitate specialised natural language processing approaches. Objective: To compare pre-trained and fine-tuned Flan-T5 models for intent classification and sentiment analysis of tobacco and e-cigarette tweets, demonstrating the effectiveness of pre-training a lightweight large language model for domain specific tasks Methods: Three Flan-T5 classification models were developed: (1) tobacco intent, (2) e-cigarette intent, and (3) sentiment analysis. Domain-specific datasets with tobacco and e-cigarette tweets were created using GPT-4 and validated by tobacco control specialists using a rigorous evaluation process. A standardized rubric and consensus mechanism involving domain specialists ensured high-quality datasets. The Flan-T5 Large Language Models were fine-tuned using Low-Rank Adaptation and evaluated against pre-trained baselines on the datasets using accuracy performance metrics. To further assess model generalizability and robustness, the fine-tuned models were evaluated on real-world tweets collected around the COP9 event. Results: In every task, fine-tuned models performed much better than pre-trained models. Compared to the pre-trained model's accuracy of 0.33, the fine-tuned model achieved an overall accuracy of 0.91 for tobacco intent classification. The fine-tuned model achieved an accuracy of 0.93 for e-cigarette intent, which is higher than the accuracy of 0.36 for the pre-trained model. The fine-tuned model significantly outperformed the pre-trained model's accuracy of 0.65 in sentiment analysis, achieving an accuracy of 0.94 for sentiments. The effectiveness of lightweight Flan-T5 models in analysing tweets associated with tobacco and e-cigarette is significantly improved by domain-specific fine-tuning, providing highly accurate instruments for tracking public conversation on tobacco and e-cigarette. The involvement of domain specialists in dataset validation ensured that the generated content accurately represented real-world discussions, thereby enhancing the quality and reliability of the results. Research on tobacco control and the formulation of public policy could be informed by these findings.
Keywords: social media analysis, Sentiment Analysis (SA), Intent Classification, Large Language Models (LLMs), Public Health, Domain adapatation, Tobacco, e-cigaertte
Received: 24 Sep 2024; Accepted: 04 Nov 2024.
Copyright: © 2024 Elmitwalli, Mehegan, Gallagher and Alebshehy. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence:
Sherif Elmitwalli, University of Bath, Bath, United Kingdom
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.