Fairness in the prediction of acute postoperative pain using machine learning models

Davoudi, Anis; Sajdeya, Ruba; Ison, Ron; Hagen, Jennifer; Rashidi, Parisa; Price, Catherine C.; Tighe, Patrick J.

doi:10.3389/fdgth.2022.970281

ORIGINAL RESEARCH article

Front. Digit. Health, 11 January 2023

Sec. Health Informatics

Volume 4 - 2022 | https://doi.org/10.3389/fdgth.2022.970281

This article is part of the Research TopicFairness, Interpretability, Explainability and Accountability in Predictive HealthcareView all 4 articles

Fairness in the prediction of acute postoperative pain using machine learning models

Jennifer Hagen³

Catherine C. Price^1,5

Patrick J. Tighe^1*

¹Department of Anesthesiology, University of Florida College of Medicine, Gainesville, FL, United Sates
²Department of Epidemiology, University of Florida College of Public Health and Health Professions, Gainesville, FL, United States
³Department of Orthopedic Surgery, University of Florida College of Medicine, Gainesville, FL, United States
⁴Department of Biomedical Engineering, University of Florida Herbert Wertheim College of Engineering, Gainesville, FL, United States
⁵Department of Clinical and Health Psychology, University of Florida College of Public Health and Health Professions, Gainesville, FL, United States

Introduction: Overall performance of machine learning-based prediction models is promising; however, their generalizability and fairness must be vigorously investigated to ensure they perform sufficiently well for all patients.

Objective: This study aimed to evaluate prediction bias in machine learning models used for predicting acute postoperative pain.

Method: We conducted a retrospective review of electronic health records for patients undergoing orthopedic surgery from June 1, 2011, to June 30, 2019, at the University of Florida Health system/Shands Hospital. CatBoost machine learning models were trained for predicting the binary outcome of low (≤4) and high pain (>4). Model biases were assessed against seven protected attributes of age, sex, race, area deprivation index (ADI), speaking language, health literacy, and insurance type. Reweighing of protected attributes was investigated for reducing model bias compared with base models. Fairness metrics of equal opportunity, predictive parity, predictive equality, statistical parity, and overall accuracy equality were examined.

Results: The final dataset included 14,263 patients [age: 60.72 (16.03) years, 53.87% female, 39.13% low acute postoperative pain]. The machine learning model (area under the curve, 0.71) was biased in terms of age, race, ADI, and insurance type, but not in terms of sex, language, and health literacy. Despite promising overall performance in predicting acute postoperative pain, machine learning-based prediction models may be biased with respect to protected attributes.

Conclusion: These findings show the need to evaluate fairness in machine learning models involved in perioperative pain before they are implemented as clinical decision support tools.

Introduction

Acute postoperative pain is a significant public health problem. Eighty percent of surgical patients report experiencing postoperative pain, with as high as 88% of those reporting moderate or higher levels of pain (1, 2). Severe acute postoperative pain is associated with the development of persistent postoperative pain, although the nature of this relationship remains unclear (3–5). Poorly managed acute postoperative pain may lead to adverse outcomes, including lower patient satisfaction with pain management, delayed inpatient recovery and discharge, increased costs of care, chronic pain, inappropriate opioid prescribing, increased risk of opioid misuse and opioid use disorder, overdose, and death (2, 6–10).

One reason for suboptimal pain management is the difficulty in predicting severe acute postoperative pain. A preoperative prediction model for acute postoperative pain could, for instance, suggest a preoperative application of regional anesthetic techniques in patients whose surgical procedures may not otherwise qualify for such therapies based on local procedure-based heuristics. Over the past several decades, numerous models have been proposed to understand patient and procedural risk factors for severe acute postoperative pain (11–13). Although these models helped determine relevant predisposing and precipitating factors of moderate to severe postoperative acute pain, they often incorporated features that required extra clinical assessments, such as pain catastrophizing, anxiety, and functional disability tests (11, 14–16). Additionally, most previous research in this domain has focused on determining risk factors for postoperative pain using statistical methodology. Prior work suggests that given similar features, machine learning models can outperform linear statistical models in classifying outcomes related to postoperative pain (17, 18). Previous work using machine learning to predict pain with perioperative data shows promising performance with an area under the curve (AUC) of 0.70 for predicting acute postoperative pain (19).

Although machine learning methods have significantly improved the accuracy of predictions, questions remain concerning their interpretability and fairness. Those aspects are especially important for future implementation and translational research. Previous research using machine learning in healthcare primarily focused on the model's overall performance, which was evaluated based only on how well the model predicted the decided outcome for the test dataset. Recently, there has been growing concern about the performance of these models among underrepresented and marginalized groups that may not be well represented in the dataset used for training the model. This is a crucial consideration for physicians applying population-level models on individual patients from underrepresented backgrounds who may ask how well such models apply to them personally. Here, concerns over machine learning “fairness” refer to the algorithmic bias in machine learning approaches, where the developed models will systematically predict an outcome more likely for one group than another, especially when these groupings are based on sensitive attributes that should not be correlated with the outcomes (e.g., ethnicity, gender, disability status). Using a model with a strong performance in the general population but that is biased against unprivileged groups might be harmful to patients in the unprivileged subcohorts.

To date, there have been no formal assessments of fairness in machine learning models used to predict postoperative pain. This retrospective cohort analysis examines the fairness of a best-in-class machine learning model that predicts acute postoperative pain among patients presenting for orthopedic surgery. We hypothesized that even in models that performed well overall in classification accuracy and the AUC, select population subgroups may suffer from much poorer performance regarding acute postoperative pain risk stratification.

Materials and methods

The study protocol was approved by the University of Florida Institutional Review Board (IRB #201601338), which waived informed consent. This retrospective single-center machine learning study was designed and conducted according to Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research: A Multidisciplinary View (20).

Dataset

The retrospective cohort consisted of adult surgical patients undergoing orthopedic surgery at the University of Florida Health system/Shands Hospital between June 1, 2011, and June 30, 2019, and residing in Florida at the time of surgery. Orthopedic surgical procedures are among the surgical procedures with the highest postoperative pain (21–23). The cohort of patients undergoing orthopedic surgery was not constrained to any specific sociodemographic group. Data were provided by the UF Health Integrated Data Repository via an honest data broker; all variables were validated via a continuous quality control process.

Our primary diagnostic outcome was mean pain on postoperative day 1 (POD1; during the day after the surgery). We used clinical pain intensities assessed using the Defense and Veterans Pain Rating Scale (DVPRS, ranging from 0 to 10) (24) and entered into the electronic health record (EHR) system as part of routine clinical care. Notably, the EHR implementation contains user prompts providing instruction in the bedside application of the DVPRS. The mean of all numerical pain scores of the patients on POD1 was calculated and dichotomized into a binary outcome: no pain or mild pain class (discussed as “low pain” in subsequent sections; pain scores 0–4) and moderate or severe pain class (discussed as “high pain”; pain scores >4). The observation unit for the outcome and predictors was patient-based. This threshold was based on prior work establishing cutpoints for mild pain intensity (25–28).

Pain management guidelines at this institution have been developed in concert with the surgical service and acute pain medicine service. These guidelines heavily emphasize multimodal analgesic strategies with regular use of preoperatively placed continuous catheter-based peripheral, paravertebral, and neuraxial regional anesthesia by faculty with fellowship training in regional anesthesia and acute pain medicine. All patients who receive blocks are reviewed upon arrival to the recovery room or intensive care unit, and block adjustments or additions are made accordingly. The central tenants of these guidelines have largely remained intact over the past decade, with annual review and updates as needed. Additional details of this process have been published previously (16).

We included common sociodemographic and clinical variables routinely available for surgical patients, including age, sex, race, ethnicity, marital status, body mass index (BMI), language, health literacy, insurance, area deprivation index (ADI) (29) of patient's residence, diagnosis categories, current procedure terminology (CPT) category, combined comorbidity score (30, 31), and the American Society of Anesthesiologists physical status (ASA-PS) classification (32). Health literacy was determined using the Rapid Estimate of Adult Literacy in Medicine-Revised (REALM-R) assessment (33), and patients who could correctly pronounce all of the eight proposed words were recorded as having adequate health literacy. Procedure and diagnosis categories were determined via the clinical classification software (CCS), which is a Healthcare Cost and Utilization Project (HCUP) research tool, using patients' CPT codes and International Classification of Diseases (ICD 9 and 10) codes (34–36). These variables were chosen given their general widespread availability in administrative datasets. CPT categories are referred to as CCS-CPT in this article.

We used “sociome” and “sf” packages in R for extracting patient ADI information (37, 38). ADI scores were extracted from the “sociome” R package for each census tract in the state of Florida using 15 variables from the American Community Survey for the year 2019 (5-year data) (Supplementary Table S1). The ADI encompasses education, employment, poverty, and environment indicators in the census tract, with higher values showing worse neighborhood deprivation. We used the shapefiles of census tract borders available from the US Census Bureau (39). Latitude and longitude coordinates of the patients' residences at the time of surgery were spatially joined with the polygons of the census tract borders. The extracted neighborhoods' US Census Bureau geographic identifiers (GEOIDs) were used to find neighborhood tract attributes from the census data tables. ADI is a continuous variable and was used as such in the prediction models. We transformed its value into tertiles for our analysis of fairness.

Machine learning models and statistical considerations

Figure 1 shows the analytical workflow. The variables used in POD1 pain prediction models were summarized and compared between the two patient groups using the student t-test for continuous variables and chi-squared test for categorical variables.

FIGURE 1

Figure 1. Analytical workflow. EHR, electronic health records; ML, machine learning.

We used the CatBoost machine learning classification models to predict pain using EHR data. We used fourfold nested cross-validation for parameter tuning using AUC as the loss function and reported models' performance as the mean (SD) of the model performance metrics on the unseen test data for each of the four outer folds (internal validation with threefold cross-validation with patient-based split). We reported the model performance in terms of accuracy and balanced accuracy, AUC, precision, recall, and F1-score. We used the “CatBoost” library for developing CatBoost models. We did not use stratification based on the outcome in training the models because the outcome was not significantly imbalanced. Similarly, we did not use stratification based on any categorical variables in training the model because the aim of the study was to investigate the severity of algorithmic bias in the machine learning models developed for pain prediction used as decision support tools, which usually relay data reflective of the real patients' data. We reported the ranking of the variable importance in the model's training using variable importance extracted from the model, based on the change in the loss function. Observing the variable importance ranking is helpful for feature selection and informed dimensionality reduction. Information on variable importance ranking in the model also provides insight into the data and the model. A higher importance ranking of protected attributes may cause concern for fairness in that attribute. Further details of data cleaning and preprocessing steps and model developments are reported in the Supplementary Methods, Supplementary Tables S2, S3 and Figure S1.

Fairness

Investigating bias

We studied model bias for the following sociodemographic attributes: age, sex, race, language, health literacy, ADI, and insurance type. In this context, the privileged group was defined as subcohorts with a lower risk of adverse clinical outcomes. The unprivileged groups were determined as subcohorts with a higher risk of adverse clinical outcomes. Table 1 shows the classes of protected attributes and their corresponding privileged and unprivileged values. We used the “Dalex” library in Python to evaluate model fairness (40). In investigating the fairness of the classifier concerning each of the attributes mentioned, several model performance metrics were calculated and compared between the privileged and unprivileged subcohorts. For each protected attribute, model performance was calculated based on the performance metrics defined in Table 2 for each unprivileged subcohort separately and compared to the model's performance for the privileged group (41, 42).

TABLE 1

Table 1. Privileged and unprivileged values of the protected attributes studied.

TABLE 2

Table 2. Model performance metrics definitions.

More specifically, we calculated the ratio (Equation 1) for each unprivileged class and each model performance metric. The closer this ratio is to 1, the fairer the model performance. With ɛ a value between 0 and 1, we used the value of 0.8 as a threshold to determine bias in our models’ performance as a threshold for bias in other domains (known as the “80% rule”) (43). ɛ value of 0.8 resulted in an acceptable range of model performance ratio between 0.8 and 1.25, meaning that if the ratio defined in Equation 1 was between 0.8 and 1.25, the model was reported to not be biased for that metric and that attribute class:

\forall_{i \in {a, b, \dots, z}} ε < \frac{metri c_{i}}{metri c_{privileged}} < \frac{1}{ε} (1)

Mitigating bias

To address the algorithmic bias in the prediction models, we used a reweighing approach to adjust the weight of observations in each attribute-outcome combination in training the model and compared the bias in the new models to the base models. In this approach, a new model was built using the observation weights defined based on the number of observations (patients) in each unprivileged and privileged group.

Data preparation and analysis were performed in R V4.0.0 and Python V3.8.5.