AUTHOR=Zhang Yuanzhao , Walecki Robert , Winter Joanne R. , Bragman Felix J. S. , Lourenco Sara , Hart Christopher , Baker Adam , Perov Yura , Johri Saurabh 

TITLE=Applying Artificial Intelligence Methods for the Estimation of Disease Incidence: The Utility of Language Models

JOURNAL=Frontiers in Digital Health

VOLUME=Volume 2 - 2020

YEAR=2020

URL=https://www.frontiersin.org/journals/digital-health/articles/10.3389/fdgth.2020.569261

DOI=10.3389/fdgth.2020.569261

ISSN=2673-253X

ABSTRACT=Background: AI-driven digital health tools often rely on estimates of disease incidence or prevalence, but obtaining these estimates is costly and time-consuming. We explored the use of machine learning models that leverage contextual information about diseases from unstructured text, to estimate disease incidence.
Methods: We used a class of machine learning models, called language models, to extract contextual information relating to disease incidence. We evaluated three different language models: BioBERT, Global Vectors for Word Representation (GloVe), and the Universal Sentence Encoder (USE), as well as a combined approach. The output of these models is mathematical representations of the underlying data, known as ’embeddings’. We used these to train neural network models to predict disease incidence. The neural networks were trained and validated using data from the Global Burden of Disease study, and tested using independent data sourced from the epidemiological literature.
Findings: The method was evaluated in terms of mean absolute error (MAE) on a logarith- mic scale. We achieved an MAE of 0.152 when predicting disease incidence for specific disease-country pairs, and 0.196 predicting incidence for previously unseen countries. Performance was weaker when predicting the incidence of previously unseen diseases (MAE 0.736). The MAE when predicting previously unseen countries against an external test set was 0.881.
Interpretation: We demonstrate that context-aware machine learning models can be used for estimating disease burden. This method is quicker to implement than traditional epidemiological approaches. We therefore suggest it complements existing modelling efforts, where data is required more rapidly or at larger scale. This may particularly benefit AI-driven digital health products where the data will undergo further processing and a validated approximation of the disease incidence is adequate.
Competing Interests: This manuscript was developed as part of research initiative at Babylonhealth. All authors are full-time employed members of Babylonhealth. There are no competing interests to declare