This study aimed to identify the risk factors for pancreatic cancer through machine learning.
We investigated the relationships between different risk factors and pancreatic cancer using a real-world retrospective cohort study conducted at West China Hospital of Sichuan University. Multivariable logistic regression, with pancreatic cancer as the outcome, was used to identify covariates associated with pancreatic cancer. The machine learning model extreme gradient boosting (XGBoost) was adopted as the final model for its high performance. Shapley additive explanations (SHAPs) were utilized to visualize the relationships between these potential risk factors and pancreatic cancer.
The cohort included 1,982 patients. The median ages for pancreatic cancer and nonpancreatic cancer groups were 58.1 years (IQR: 51.3–64.4) and 57.5 years (IQR: 49.5–64.9), respectively. Multivariable logistic regression indicated that kirsten rats arcomaviral oncogene homolog (KRAS) gene mutation, hyperlipidaemia, pancreatitis, and pancreatic cysts are significantly correlated with an increased risk of pancreatic cancer. The five most highly ranked features in the XGBoost model were KRAS gene mutation status, age, alcohol consumption status, pancreatitis status, and hyperlipidaemia status.
Machine learning algorithms confirmed that KRAS gene mutation, hyperlipidaemia, and pancreatitis are potential risk factors for pancreatic cancer. Additionally, the coexistence of KRAS gene mutation and pancreatitis, as well as KRAS gene mutation and pancreatic cysts, is associated with an increased risk of pancreatic cancer. Our findings offered valuable implications for public health strategies targeting the prevention and early detection of pancreatic cancer.