AUTHOR=Chen Minjun , Wu Yue , Wingerd Byron , Liu Zhichao , Xu Joshua , Thakkar Shraddha , Pedersen Thomas J. , Donnelly Tom , Mann Nicholas , Tong Weida , Wolfinger Russell D. , Bao Wenjun TITLE=Automatic text classification of drug-induced liver injury using document-term matrix and XGBoost JOURNAL=Frontiers in Artificial Intelligence VOLUME=7 YEAR=2024 URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2024.1401810 DOI=10.3389/frai.2024.1401810 ISSN=2624-8212 ABSTRACT=Introduction

Regulatory agencies generate a vast amount of textual data in the review process. For example, drug labeling serves as a valuable resource for regulatory agencies, such as U.S. Food and Drug Administration (FDA) and Europe Medical Agency (EMA), to communicate drug safety and effectiveness information to healthcare professionals and patients. Drug labeling also serves as a resource for pharmacovigilance and drug safety research. Automated text classification would significantly improve the analysis of drug labeling documents and conserve reviewer resources.

Methods

We utilized artificial intelligence in this study to classify drug-induced liver injury (DILI)-related content from drug labeling documents based on FDA’s DILIrank dataset. We employed text mining and XGBoost models and utilized the Preferred Terms of Medical queries for adverse event standards to simplify the elimination of common words and phrases while retaining medical standard terms for FDA and EMA drug label datasets. Then, we constructed a document term matrix using weights computed by Term Frequency-Inverse Document Frequency (TF-IDF) for each included word/term/token.

Results

The automatic text classification model exhibited robust performance in predicting DILI, achieving cross-validation AUC scores exceeding 0.90 for both drug labels from FDA and EMA and literature abstracts from the Critical Assessment of Massive Data Analysis (CAMDA).

Discussion

Moreover, the text mining and XGBoost functions demonstrated in this study can be applied to other text processing and classification tasks.