We employed machine-learning methods to explore data from a large survey on students, with the goal of identifying and validating a thrifty panel of important factors associated with lower respiratory tract infection (LRTI).
Cross-sectional cluster sampling was performed for a survey of students aged 6–14 years who attended primary or junior high school in Beijing within January, 2022. Data were collected
Data from 11,308 students (5,527 girls and 5,781 boys) were analyzed, and 909 of them had LRTI with the prevalence of 8.01%. After a comprehensive evaluation, the Gaussian naive Bayes (gNB) algorithm outperformed the other machine-learning algorithms. The gNB algorithm had accuracy of 0.856, precision of 0.140, recall of 0.165, F1 score of 0.151, and area under the receiver operating characteristic curve (AUROC) of 0.652. Using the optimal gNB algorithm, top five important factors, including age, rhinitis, sitting time, dental caries, and food or drug allergy, had decent prediction performance. In addition, the top five factors had prediction performance comparable to all factors modeled. For example, under the sequential deep-learning model, the accuracy and loss were separately gauged at 92.26 and 25.62% when incorporating the top five factors, and 92.22 and 25.52% when incorporating all factors.
Our findings showed the top five important factors modeled by gNB algorithm can sufficiently represent all involved factors in predicting LRTI risk among Chinese students aged 6–14 years.