AUTHOR=Lamurias Andre , Jesus Sofia , Neveu Vanessa , Salek Reza M. , Couto Francisco M.
TITLE=Information Retrieval Using Machine Learning for Biomarker Curation in the Exposome-Explorer
JOURNAL=Frontiers in Research Metrics and Analytics
VOLUME=6
YEAR=2021
URL=https://www.frontiersin.org/journals/research-metrics-and-analytics/articles/10.3389/frma.2021.689264
DOI=10.3389/frma.2021.689264
ISSN=2504-0537
ABSTRACT=
Objective: In 2016, the International Agency for Research on Cancer, part of the World Health Organization, released the Exposome-Explorer, the first database dedicated to biomarkers of exposure for environmental risk factors for diseases. The database contents resulted from a manual literature search that yielded over 8,500 citations, but only a small fraction of these publications were used in the final database. Manually curating a database is time-consuming and requires domain expertise to gather relevant data scattered throughout millions of articles. This work proposes a supervised machine learning pipeline to assist the manual literature retrieval process.
Methods: The manually retrieved corpus of scientific publications used in the Exposome-Explorer was used as training and testing sets for the machine learning models (classifiers). Several parameters and algorithms were evaluated to predict an article’s relevance based on different datasets made of titles, abstracts and metadata.
Results: The top performance classifier was built with the Logistic Regression algorithm using the title and abstract set, achieving an F2-score of 70.1%. Furthermore, we extracted 1,143 entities from these articles with a classifier trained for biomarker entity recognition. Of these, we manually validated 45 new candidate entries to the database.
Conclusion: Our methodology reduced the number of articles to be manually screened by the database curators by nearly 90%, while only misclassifying 22.1% of the relevant articles. We expect that this methodology can also be applied to similar biomarkers datasets or be adapted to assist the manual curation process of similar chemical or disease databases.