Molecular radiation biomarkers are an emerging tool in radiation research with applications for cancer radiotherapy, radiation risk assessment, and even human space travel. However, biomarker screening in genome-wide expression datasets using conventional tools is time-consuming and underlies analyst (human) bias. Machine Learning (ML) methods can improve the sensitivity and specificity of biomarker identification, increase analytical speed, and avoid multicollinearity and human bias.
To develop a resource-efficient ML framework for radiation biomarker discovery using gene expression data from irradiated normal tissues. Further, to identify biomarker panels predicting radiation dose with tissue specificity.
A strategic search in the Gene Expression Omnibus database identified a transcriptomic dataset (GSE44762) for normal tissues radiation responses (murine kidney cortex and medulla) suited for biomarker discovery using an ML approach. The dataset was pre-processed in R and separated into train and test data subsets. High computational cost of Genetic Algorithm/k-Nearest Neighbor (GA/KNN) mandated optimization and 13 ML models were tested using the caret package in R. Biomarker performance was evaluated and visualized
Caret-based feature selection and ML methods vastly improved processing time over the GA approach. The KNN method yielded overall best performance values on train and test data and was implemented into the framework. The top-ranking genes were
The caret framework is a powerful tool for radiation biomarker discovery optimizing performance with resource-efficiency for broad implementation in the field. The KNN-based approach identified