AUTHOR=Chen Brian H. 

TITLE=Minimum standards for evaluating machine-learned models of high-dimensional data

JOURNAL=Frontiers in Aging

VOLUME=Volume 3 - 2022

YEAR=2022

URL=https://www.frontiersin.org/journals/aging/articles/10.3389/fragi.2022.901841

DOI=10.3389/fragi.2022.901841

ISSN=2673-6217

ABSTRACT=The maturation of machine learning and technologies that generate high dimensional data have led to the growth in the number of predictive models, such as the “epigenetic clock” that result from the use of machine learning algorithms. While powerful, machine learning algorithms run a high risk of overfitting, particularly when training data is limited, as is often the case with high-dimensional data (“large p, small n”). Making independent validation a requirement of “algorithmic biomarker” development would bring greater clarity to the field by more efficiently identifying prediction or classification models to prioritize for further validation and characterization. Reproducibility has been a mainstay in science, but only recently received attention in defining its various aspects and how to apply these principles to machine learning models. Furthermore, the relative ease of developing new models has also led to abandonment of fundamental scientific practices of clearly defining aims, proper study design, eliminating biases, and describing study methods sufficiently for independent replication. The goal of this paper is merely to serve as a call-to-arms for greater rigor and attention paid to newly developed models for prediction or classification.