Development and economic assessment of machine learning models to predict glycosylated hemoglobin in type 2 diabetes

Tong, Yi-Tong; Gao, Guang-Jie; Chang, Huan; Wu, Xing-Wei; Li, Meng-Ting

doi:10.3389/fphar.2023.1216182

ORIGINAL RESEARCH article

Front. Pharmacol. , 30 June 2023

Sec. Experimental Pharmacology and Drug Discovery

Volume 14 - 2023 | https://doi.org/10.3389/fphar.2023.1216182

This article is part of the Research Topic Machine Learning and Pharmacotherapy View all 8 articles

Development and economic assessment of machine learning models to predict glycosylated hemoglobin in type 2 diabetes

Yi-Tong Tong¹

Guang-Jie Gao²

Huan Chang²

Xing-Wei Wu^2,3*

Meng-Ting Li²*

¹Chengdu Second People’s Hospital, Chengdu, Sichuan, China
²Personalized Drug Therapy Key Laboratory of Sichuan Province, Department of Pharmacy, Sichuan Provincial People’s Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
³Chinese Academy of Sciences Sichuan Translational Medicine Research Hospital, Chengdu, Sichuan, China

Background: Glycosylated hemoglobin (HbA1c) is recommended for diagnosing and monitoring type 2 diabetes. However, the monitoring frequency in real-world applications has not yet reached the recommended frequency in the guidelines. Developing machine learning models to screen patients with poor glycemic control in patients with T2D could optimize management and decrease medical service costs.

Methods: This study was carried out on patients with T2D who were examined for HbA1c at the Sichuan Provincial People’s Hospital from April 2018 to December 2019. Characteristics were extracted from interviews and electronic medical records. The data (excluded FBG or included FBG) were randomly divided into a training dataset and a test dataset with a radio of 8:2 after data pre-processing. Four imputing methods, four screening methods, and six machine learning algorithms were used to optimize data and develop models. Models were compared on the basis of predictive performance metrics, especially on the model benefit (MB, a confusion matrix combined with economic burden associated with therapeutic inertia). The contributions of features were interpreted using SHapley Additive exPlanation (SHAP). Finally, we validated the sample size on the best model.

Results: The study included 980 patients with T2D, of whom 513 (52.3%) were defined as positive (need to perform the HbA1c test). The results indicated that the model trained in the data (included FBG) presented better forecast performance than the models that excluded the FBG value. The best model used modified random forest as the imputation method, ElasticNet as the feature screening method, and the LightGBM algorithms and had the best performance. The MB, AUC, and AUPRC of the best model, among a total of 192 trained models, were 43475.750 (¥), 0.972, 0.944, and 0.974, respectively. The FBG values, previous HbA1c values, having a rational and reasonable diet, health status scores, type of manufacturers of metformin, interval of measurement, EQ-5D scores, occupational status, and age were the most significant contributors to the prediction model.

Conclusion: We found that MB could be an indicator to evaluate the model prediction performance. The proposed model performed well in identifying patients with T2D who need to undergo the HbA1c test and could help improve individualized T2D management.

1 Introduction

Diabetes mellitus (DM) is one of the most common and fastest-growing endocrine diseases, with both types (type 1 and type 2) contributing substantially to the healthcare costs of society. According to the International Diabetes Federation (IDF), the number of people with diabetes reached approximately 537 million worldwide by 2021 (1 in 10 adults live with diabetes), and approximately 90%–95% of cases of diabetes suffer from type 2 diabetes (T2D) (Sun et al., 2022). T2D has become a global threat to public health in the 21st century (Wang et al., 2022). Fasting blood glucose (FBG) and random blood glucose (RBG) have been the traditional method for assessing the risk of T2D, but they have obvious shortcomings—change over short periods of time due to behavioral changes (Christine et al., 2017). Relatively, glycated hemoglobin A1c (HbA1c), representing the average plasma glucose levels for the past 2–3 months (Rohlfing et al., 2002), has been recommended for diagnosing and monitoring diabetes by the World Health Organization (WHO) in 2011 and the American Diabetes Association (ADA) in 2010 (Leong et al., 2018).

According to the latest criteria, the American Diabetes Association (ADA) and the European Association for the Study of Diabetes (EASD) have recommended that glycemic management is evaluated primarily with the HbA1c test, and the therapeutic goal is to reduce the HbA1c to<7.0% (Davies et al., 2022). The Chinese guidelines are in line with international consensus—they stress the importance of regular HbA1c measurements (twice a year or four times a year) (Chinese Diabetes Society, 2021). Studies have shown that glycemic control is required in order to reduce the risk of onset and progression of complications (Williams et al., 2005; Yu et al., 2022). Once the target HbA1c is exceeded by 0.5% (>5 mmol/mol) after 3–6 months, further intensification should be administered. However, in practice, this does not always happen. The delay in intensifying therapy is referred to in clinical terms as therapeutic inertia and is due to underestimation of the need for therapy or failure to monitor the HbA1c level (Reach et al., 2017).

Machine learning (ML) is a branch of artificial intelligence and has been widely applied in clinical research and practice to construct high-performing prediction models, such as prediction of disease progression and outcomes (Griffith et al., 2020; Lewin-Epstein et al., 2020; Wang et al., 2020). Especially in the field of T2D management, identifying patients with T2D and estimating the risk of development of complications has become a hot topic during recent years (Ahlqvist et al., 2019; Dennis et al., 2019; Heerspink, 2019). ML has been shown to provide a useful management tool and has played a key role in the recognition of systems as routine therapeutic aids for patients with T2D. Thus, we consider whether it is possible to identify patients with a high risk of poor glycemic control utilizing machine learning methods based on the readily available daily data.

2 Methods

2.1 Data sources

The participants in the study were recruited from outpatients attending the Endocrinology Section of the Sichuan Provincial People’s Hospital. Participants were selected according to the following criteria: (1) over 18 years of age; (2) diagnosed as a T2D patient and received hypoglycemic treatment (the diagnostic criteria for T2D were in line with China’s 2017 guidelines on preventing and treating type 2 diabetes (Chinese Diabetes Society, 2017)); and (3) HbA1c levels were measured on the day of collection. (4) Researchers explained the purpose and scope of the survey to the subjects, and those who agreed to take part were retained in the study. Ethics approval was obtained through the Ethics Committee of the Sichuan Provincial People’s Hospital (approval # 2018-53).

Characteristics of participants were obtained from face-to-face interviews and electronic medical records (EMRs). The adherence status was defined according to the proportion of days covered (PDC). PDC higher than 80% was regarded as good medication compliance (Wu et al., 2020).

2.2 Outcome definition

HbA1c values on the day of visiting the clinic were measured at the clinical laboratory of Sichuan Provincial People’s Hospital and collected from EMRs. In this study, a value of HbA1c more than 7.0% was defined as positive and less than 7.0% was defined as negative. Furthermore, parents who had a positive HbA1c were considered to be needed for detection on the day of attending the Endocrinology Section.

2.3 Data pre-processing

After data collection ended, the information was converted to an Excel format. Each column represented a candidate variable, and each row represented a sample. To acquire high-quality data for modeling, a series of interventions were performed, including data pre-screening, data imputing, and variable selection.

First, data pre-screening was carried out using the following criteria: (1) the columns with missing values > 90% were removed, (2) the columns with a single value occupying >90% were removed, and (3) the columns with the coefficients of variation <0.1 were removed.

Missing information was inevitable in clinical data, such as the FBG value and PBG value. Missing data were filled using four imputing methods, including simple imputing (marked as SI), random forest imputing (marked as RF), k-nearest neighbor imputing (marked as KNN), and optimal deletion (marked as OD).

In order to eliminate irrelevant variables, reduce the number of variables, and improve the accuracy of the model, variable selection was performed. In this study, four algorithms—LASSO (Tibshirani, 1997), ridge regression (Marquardt and Snee, 1975), ElasticNet (Simon et al., 2011), and Boruta (Kursa M B, 2010)—were used to screen the key variables. The four aforementioned algorithms were marked as LA, RD, EN, and BOR, respectively.

2.4 Data partition

80 % of the data were assigned as the training set and the rest as the test set. The training set was used to train a classification model, and the test set was used to evaluate the model performance. Meanwhile, to assess whether the FGB value on the day was the important variable, the original data with the FGB value were used to train the models.

2.5 Model building

Sixteen datasets were generated in the training set by four data imputation methods and four variable selection methods. Then, six machine learning algorithms were employed on each dataset, respectively, to develop a total of 112 models. Machine learning algorithms in this study included random forest (RF), logistic regression (LR), multilayer perceptron (MLP), extreme gradient boosting (XGBoost), light gradient boosting machine (LGBM), and categorical boosting (CB).

RF, an ensemble learning algorithm proposed by Breiman, is very commonly used for classification (Breiman, 2001). Individual decision trees are built using a random subset of the training dataset in the training process. The final classification is then based on the majority voting results of all decision trees (Singha et al., 2019).

LR is widely used to solve binary classification problems (Jaillard et al., 2020). It predicts the probability of whether a dependent variable belongs to a particular class. The principle of LR is to first fit the decision boundary and then establish the probability relationship between the boundary and the classification so as to obtain the probability in the case of two classifications (Wang et al., 2020).

MLP, also known as a feed-forward neural network, is one of the most common deep learning approaches (Wan et al., 2018). It is mainly used to address supervised learning problems by learning the dependencies between the input layer (the variables) and output layer (the classification decision) using a fully connected hidden layer (Wang et al., 2020).

XGBoost (Chen and Guestrin, 2016), LightGBM (Guolin et al., 2017), and CatBoost (Bentéjac et al., 2021) were the three most popular implementations of gradient-boosting tree-based ensemble methods (Friedman, 2001). While built on structurally similar ideas, these libraries differ slightly on how decision trees are grown or how categorical variable data are handled, and only experimentation can validate which performs the best.

2.6 Model evaluation

The test set was used for external validation. A confusion matrix was used to evaluate the accuracy of classifier classification. In this study, a confusion matrix combined with relevant economic indicators, renamed as model benefit (MB), was used to redefine model performance to evaluate the accuracy of classifier classification. The test fee for HbA1c at our hospital was ¥73 per test. The mean additional economic burden of therapeutic inertia (Lindvig et al., 2021) was regarded as the cost of missed detection. In accordance with the current exchange rate, the exchange rate of the renminbi (RMB, ¥) against the Danish Krone is approximated to 1:1. The cost of missed detection was ¥786.77. The calculation formula was as follows:

$M B (¥) = T o t a l c o s t (T C, ¥) - M o d e l c o s t (M C, ¥),$

$T C (¥) = T o t a l p a r t i c i p a n t s \times H b A 1 c t e s t f e e,$

$M C (¥) = T P * \times H b A 1 c t e s t f e e + F P * \times H b A 1 c t e s t f e e + F N * \times c o s t o f m i s s e d d e t e c t i o n - T N * \times H b A 1 c t e s t f e e,$

$(* : T P = t r u e p o s i t i v e, F P = f a l s e p o s i t i v e, F N = f a l s e n e g a t i v e, T N = t r u e n e g a t i v e) .$

In addition, the area under the receiver operating characteristic curve (AUC), area under the precision recall curve (AUPRC), and decision curve analysis (DCA) were summarized to assess the model performance. The contribution of each variable to the predictive model was estimated with SHapley Additive exPlanation (SHAP).

2.7 Sample size validation

The best model (assessed by MB) was employed to estimate the impact of sample sizes on the predictive performance (Wu et al., 2020). The total samples were randomly separated into the training set and the test set at a ratio of 8:2. First, 10% of the training set was extracted to train the model, and the AUC was evaluated in the test set. The selected samples from the training set increased from 10% to 100% with a stepwise increase of 10%. These steps were repeated 10 times to generate 10 independent repeated AUC values. The relationship of sample size with the prediction performance of models was assessed according to the inflection point change of the line graph. The steeper broken line indicated that a larger sample size would improve the prediction performance of the model, and the gentler slope indicated that the performance of models was affected a little by the sample size.

Above all, the concise workflow for the development and validation of models is summarized in Figure 1.

FIGURE 1

FIGURE 1. Overview of the main modeling steps.

2.8 Statistical analysis

The continuous variables were described by mean and standard deviation, whereas categorical variables were expressed in terms of frequencies and percentages. Multivariate analyses were performed to identify the factors associated with the model performance. Multivariate analysis was performed by multi-linear regression analysis. Model development was performed using the sklearn package and SHAP package in Python (Python Software Foundation, Python Language Reference, version 3.6.8) on PyCharm (developed by JetBrains.r.o., version 11.0.4). The grid search technique was applied to calculate hyperparameter values optimally.

3 Results

3.1 Participant characteristics

Overall, 980 patients completed the survey, among which 571 were male and 409 were female. The mean age was 59.2 ± 11.9 years. A total of 513 patients were defined as positive (52.3%). Participants were grouped according to the HbA1c value, and detailed characteristics of the participants are shown in Table 1.

TABLE 1

TABLE 1. Characteristics of participants when grouped according to HbA1c.

After data pre-screening, 15 variables were removed (nationality, marital status, with or without complications, vascular complications, neurological complications, complications with lesions of the extremities, ocular complications, nephropathy complications, complications (other diseases), glinides, thiazolidinediones, GLP-1 Ras, SGLT2 inhibitors, SGLT2 inhibitors, and the use of Chinese medicine).

3.2 Dataset building

A total of 60 datasets were set up by applying different imputing methods and variable selection methods with 41 variables. The different numbers of variables and samples in each dataset are listed in Table 2.

TABLE 2

TABLE 2. Detailed information of 16 datasets.

3.3 Model validation

A total of 192 models (whether they included FBG or not) were validated in the test set, considered as external validation, and the performance metrics were output. As shown in Table 3; Figure 2, the five best models (excluded FBG) were listed in sequence according to the MB value. The five best models were trained in the No. 12 dataset (applied modified random forest as the imputing method and Ridge as the selection method). The MB, AUC, accuracy, and AUPRC values of the best model (model 1) were 3163.800 (¥), 0.852, 0.811, and 0.845, respectively.

TABLE 3

TABLE 3. Summary of the performance of the five best models (excluded the FBG value).

FIGURE 2

FIGURE 2. Performance of the five best models (excluded FBG). (A) ROC curves. (B) Precision–recall curves. (C) Calibration plots. (D) DCA plots.

As listed in Table 4; Figure 3, the five best models (included FBG) were trained in No. 11 and No. 10 datasets (applied modified random forest as the imputing method and ElasticNet or LASSO as the selection method). The MB, AUC, accuracy, and AUPRC values of the best model (model 1) were 43475.750 (¥), 0.972, 0.944, and 0.974, respectively. The calibration and DCA curves also displayed excellent predictive performances (Figures 2C, D; Figures 3C, D). The model that included FBG produced superior forecast performance compared to the model that excluded the FBG value.

TABLE 4

TABLE 4. Summary of the performance of the five best models (included the FBG value).

FIGURE 3

FIGURE 3. Performance of the five best models (included FBG). (A) ROC curves. (B) Precision–recall curves. (C) Calibration plots. (D) DCA plots.

3.4 SHapley Additive exPlanation evaluation

SHAP was used to interpret the results from the best model. The result of SHAP in the best model (excluded FBG) is shown in Figure 4. As shown in Figure 4A, SHAP evaluation quantifies the contribution of a feature in a single sample. As results in Figure 4B, previous HbA1c values, having a rational and reasonable diet, course of diabetes, BMI, interval of measurement, duration of treatment regimen, type of manufacturers of metformin, age, waistline, and marital status were the 10 most important variables.

FIGURE 4

FIGURE 4. Results of SHAP in the best model (excluded FBG). (A) SHAP value contribution graph of each indicator of a single sample. (B) Complete distribution of the SHAP values for the top 10 variables.

The results of the contribution of variables in the best model (included FBG) are shown in Figure 5. As illustrated in Figure 5A, waistline, previous HbA1c values, interval of measurement, number of oral drugs, psychological status, EQ-5D scores, type of manufacturers of metformin, and FBG values provided a positive contribution to the SHAP value, while exercise session and course of diabetes provided a negative contribution. As presented in Figure 5B, the 10 most important variables were FBG values, previous HbA1c values, having a rational and reasonable diet, health status scores, type of manufacturers of metformin, interval of measurement, EQ-5D scores, occupational status, and age.

FIGURE 5

FIGURE 5. Results of SHAP in the best model (included FBG). (A) SHAP value contribution graph of each indicator of a single sample. (B) Complete distribution of the SHAP values for the top 10 variables.

3.5 Sample size assessment

The adequacy of the sample size was verified using the resampling bootstrapping method, and the results are plotted in Figure 6. The AUC gradually increased and the dispersion of the AUC value decreased as the percentage of the sample size increased. When the sample size reached 60%, the curve flattened. The results indicated that the performance of the model might be slightly affected when expanding the sample size.

FIGURE 6

FIGURE 6. Results of sample size validation.

3.6 Multivariate analysis

As shown in Table 5, the number of samples could significantly affect the model prediction performance, including MB, AUC, accuracy, precision, and recall (p < 0.01). We found that the number of variables could affect the MB and recall of the prediction model significantly (p < 0.01). MB and AUC were influenced by screening methods (p < 0.05). Precision was affected by imputing methods (p < 0.01).

TABLE 5

TABLE 5. Results of multivariate analysis.

4 Discussion

In this research, we developed a total of 192 models (whether they included FBG or not) for the prediction of patients with poor glycemic control in patients with T2D. The MB, AUC, and AUPRC values of the best model were 43475.750 (¥), 0.972, 0.944, and 0.974, respectively. FBG values, previous HbA1c values, having a rational and reasonable diet, health status scores, type of manufacturers of metformin, interval of measurement, EQ-5D scores, occupational status, and age were the most important contributors to the prediction model.

In recent years, with the continuous development of artificial intelligence techniques, machine learning algorithms have been applied increasingly in clinical prediction models, and disease prediction models have begun to become a hot spot in clinical research. According to the TRIPOD checklist (Collins et al., 2015), the performance measures (with CI) for the prediction model should be reported. The AUC on validation data has represented the prediction abilities of models in most studies (Chan et al., 2021; Gibbons et al., 2021; de Souza et al., 2022; Yu et al., 2022). In addition, some prediction models have been internally validated by Harrell’s concordance index, the Brier score, and a satisfactory calibration curve (Qu et al., 2021; Lo-Ciganic et al., 2022). These aforementioned performance metrics pay more attention to the accuracy of the model and result in less clinical cost caused by wrong prediction or negative predictive value. In this study, we explored a novel measure that could overcome the limitation. Referring to the principles of pharmacoeconomic analysis, parameters for a cost–benefit analysis are costs for drugs and benefits for treatments. The worst outcomes of the absence of the HbA1c test were considered to lead to treatment inertia in this study. The economic burden associated with therapeutic inertia was regarded as the cost of negative predictive value, and these data were obtained from the study in patients with type 2 diabetes in Denmark (Lindvig et al., 2021). The fee for the HbA1c test was considered as the cost of treatment. Therefore, the MB of the best model in the study was 43475.750 (¥), suggesting that significant gains may result from the prediction model.

The primary goals in the treatment of patients with T2D are to maintain blood glucose levels as close to normal as possible and to achieve a relatively normal quality of life. Scientists early realized that both of these goals are influenced by a multitude of somatic and psychological factors (Rose et al., 2002; Williams et al., 2005). In addition, studies reported that educational level, age, duration of diabetes, BMI, and HbA1c at baseline were associated with HbA1c (Hu et al., 2020). One research reminded that occupational categories were relational to T2D (Baek et al., 2019). According to the results of SHAP in our study, the 10 most important variables were FBG values, previous HbA1c values, having a rational and reasonable diet, health status scores, type of manufacturers of metformin, interval of measurement, EQ-5D scores, occupational status, and age. The relationship between HbA1c and average glucose levels has been explored in many studies (Law et al., 2017). Meanwhile, this study developed prediction models on the different data (excluded FGB vs. included FGB). The results suggested that incorporating FGB into the models can allow for further improvements in predictive performance (3163.800 (¥) vs. 43475.750 (¥)).

In this study, multiple methods and algorithms were applied to build models. Because of their different principles, the methods and algorithms have different strengths and weaknesses. Specifically, four imputing methods were used to fill in missing values. The SI method fills with fixed values (Löw et al., 2019): the missing value of a continuous variable is replaced by the mean of the variable, and the missing value of a categorical variable is filled with the median. KNN (Beretta and Santaniello, 2016) and RF (Liao et al., 2022) are ensemble prediction methods and put out the predictive value to fill in the missing value of variables based on the variables without missing value. Compared to the fixed value, the predicted value should theoretically be similar to the true value. Meanwhile, this will also artificially increase the connection between variables. OD is a normal method to exclude variables with missing values, which we recently proposed. The principle of the algorithm was to keep the maximum sample size with no missing value by deleting variables (columns of the table) or samples (rows of the table). According to the results of the multivariate analysis (shown in Table 5), methods and algorithms could significantly affect the prediction performance. So, it is necessary to try which method is the most suitable for data preprocessing or modeling. On the same lines, XGBoost, LGBM, and CatBoost were implementations of gradient-boosting tree-based ensemble methods. The MB of LGBM was higher than that of others both in data that excluded the FGB value or data that included the FGB value (Table 3; Table 4), which was similar to a previous research (Zhang et al., 2022).

5 Limitation

First, the data were collected prospectively, but our study has the inherent limitations of a single-center retrospective analysis. Although the sample size in our study has been demonstrated to be suitable for modeling, more samples need to be collected in order to verify this prediction model, or a large multicenter sample study is desired that can substantiate the applicability of the model. Second, due to the retrospective research, for some variables, recall bias still exists, such as the intensity of exercise and exercise sessions.

6 Conclusion

In summary, the present research introduced 192 machine learning models to predict poor glycemic control in patients with T2D and proposed a new indicator to evaluate the performance of the prediction model. In fact, we developed a prediction model with better classifier performance. This work also reconfirmed that variables such as FBG values, previous HbA1c values, having a rational and reasonable diet, health status scores, interval of measurement, EQ-5D scores, occupational status, and age were risk factors for glycemic control. We are in the process of developing a mobile app or a Web server for caregivers and patients in an effort to integrate the glycemic control enhancement intervention into daily T2D management.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material; further inquiries can be directed to the corresponding authors.

Ethics statement

The studies involving human participants were reviewed and approved by the Ethics Committee of the Sichuan Provincial People’s Hospital (approval # 2018-53). The ethics committee waived the requirement of written informed consent for participation.

Author contributions

Y-TT contributed to data collection, data analysis, writing, and approval of the final manuscript. G-JG and HC assisted in data analysis and model design. X-WW and M-TL were responsible for designing and coordinating the research. All authors contributed to the article and approved the submitted version.

Funding

This study was funded by the National Natural Science Foundation of China (grant nos. 72004020 and 72174038), the National Key R&D Research Program of China, the National Key Research Program (grant no. 2020YFC2005506), and the Scientific Research Foundation of Sichuan Provincial People’s Hospital (grant no. 2022BH10).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Ahlqvist, E., Tuomi, T., and Groop, L. (2019). Clusters provide a better holistic view of type 2 diabetes than simple clinical features. Lancet Diabetes Endocrinol. 7, 668–669. doi:10.1016/S2213-8587(19)30257-8

PubMed Abstract | CrossRef Full Text | Google Scholar

Baek, Y., Kim, M., Kim, G. R., and Park, E. C. (2019). Cross-sectional study of the association between long working hours and pre-diabetes: 2010-2017 korea national health and nutrition examination survey. BMJ Open 9, 033579. doi:10.1136/bmjopen-2019-033579

Development and economic assessment of machine learning models to predict glycosylated hemoglobin in type 2 diabetes

1 Introduction

2 Methods

2.1 Data sources

2.2 Outcome definition

2.3 Data pre-processing

2.4 Data partition

2.5 Model building

2.6 Model evaluation

2.7 Sample size validation

2.8 Statistical analysis

3 Results

3.1 Participant characteristics

3.2 Dataset building

3.3 Model validation

3.4 SHapley Additive exPlanation evaluation

3.5 Sample size assessment

3.6 Multivariate analysis

4 Discussion

5 Limitation

6 Conclusion

Data availability statement

Ethics statement

Author contributions

Funding

Conflict of interest

Publisher’s note

References

95% of researchers rate our articles as excellent or good