Application of machine learning and artificial intelligence in the diagnosis and classification of polycystic ovarian syndrome: a systematic review

Barrera, Francisco J.; Brown, Ethan D.L.; Rojo, Amanda; Obeso, Javier; Plata, Hiram; Lincango, Eddy P.; Terry, Nancy; Rodríguez-Gutiérrez, René; Hall, Janet E.; Shekhar, Skand

doi:10.3389/fendo.2023.1106625

SYSTEMATIC REVIEW article

Front. Endocrinol., 18 September 2023

Sec. Reproduction

Volume 14 - 2023 | https://doi.org/10.3389/fendo.2023.1106625

Application of machine learning and artificial intelligence in the diagnosis and classification of polycystic ovarian syndrome: a systematic review

Francisco J. Barrera^1,2

Nancy Terry⁵

René Rodríguez-Gutiérrez^2,4,6

Janet E. Hall³

Skand Shekhar^3*

¹Department of Epidemiology, Harvard TH Chan School of Public Health, Boston, MA, United States
²Plataforma INVEST Medicina, Universidad Autónoma de Nuevo León- Knowledge Education Research (UANL-KER), Unit Mayo Clinic (KER Unit Mexico), Universidad Autónoma de Nuevo León, Monterrey, Mexico
³Reproductive Physiology and Pathophysiology Group, Clinical Research Branch, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, United States
⁴Knowledge and Evaluation Research Unit-Endocrinology (KER-Endo), Mayo Clinic, Rochester, MN, United States
⁵Division of Library Services, Office of Research Services, National Institutes of Health, Bethesda, MD, United States
⁶Endocrinology Division, Department of Internal Medicine, University Hospital “Dr. José E. González”, Universidad Autonoma de Nuevo Leon, Monterrey, Mexico

Introduction: Polycystic Ovarian Syndrome (PCOS) is the most common endocrinopathy in women of reproductive age and remains widely underdiagnosed leading to significant morbidity. Artificial intelligence (AI) and machine learning (ML) hold promise in improving diagnostics. Thus, we performed a systematic review of literature to identify the utility of AI/ML in the diagnosis or classification of PCOS.

Methods: We applied a search strategy using the following databases MEDLINE, Embase, the Cochrane Central Register of Controlled Trials, the Web of Science, and the IEEE Xplore Digital Library using relevant keywords. Eligible studies were identified, and results were extracted for their synthesis from inception until January 1, 2022.

Results: 135 studies were screened and ultimately, 31 studies were included in this study. Data sources used by the AI/ML interventions included clinical data, electronic health records, and genetic and proteomic data. Ten studies (32%) employed standardized criteria (NIH, Rotterdam, or Revised International PCOS classification), while 17 (55%) used clinical information with/without imaging. The most common AI techniques employed were support vector machine (42% studies), K-nearest neighbor (26%), and regression models (23%) were the commonest AI/ML. Receiver operating curves (ROC) were employed to compare AI/ML with clinical diagnosis. Area under the ROC ranged from 73% to 100% (n=7 studies), diagnostic accuracy from 89% to 100% (n=4 studies), sensitivity from 41% to 100% (n=10 studies), specificity from 75% to 100% (n=10 studies), positive predictive value (PPV) from 68% to 95% (n=4 studies), and negative predictive value (NPV) from 94% to 99% (n=2 studies).

Conclusion: Artificial intelligence and machine learning provide a high diagnostic and classification performance in detecting PCOS, thereby providing an avenue for early diagnosis of this disorder. However, AI-based studies should use standardized PCOS diagnostic criteria to enhance the clinical applicability of AI/ML in PCOS and improve adherence to methodological and reporting guidelines for maximum diagnostic utility.

Systematic review registration: https://www.crd.york.ac.uk/prospero/, identifier CRD42022295287.

Introduction

Polycystic Ovary Syndrome (PCOS) is the most common endocrinopathy in reproductive aged women, with an estimated prevalence ranging from 4% to 20% and affecting more than 66 million worldwide in 2019 (1–5). PCOS is associated with increased incidence of cardiovascular disease, infertility, and of endometrial cancer (6–9). Its public health burden is immense, with nearly eight billion US dollars spent in 2020 to manage PCOS-related symptoms among women in the United States alone (10).

The diagnosis of PCOS is based on clinical criteria, with the Rotterdam criteria/International PCOS criteria (11, 12) being the most widely accepted. PCOS is characterized by the presence of a combination of hyperandrogenism, ovulatory dysregulation, and polycystic ovarian morphology (PCOM) (13–15). This already heterogenous clinical phenotype is complicated further by the elaborate interplay of genetic and environmental factors, such as diet related obesity or lifestyle factors, which affect clinical presentation (16). The criteria-based diagnosis of PCOS is complicated by variations in the clinical assessment of hyperandrogenism and determination of menstrual irregularity. Furthermore, the variation in normative standards for PCOM compounds these challenges (17). Estimates suggest that diagnosis is delayed by more than two years in one third of women with PCOS; yet this is likely an underestimation (18).

Artificial intelligence (AI) refers to simulation of human intelligence by computer based systems (19). On the other hand, machine learning (ML) is a subdivision of AI focused on learning from previous events and applying this knowledge to future decision making (20). ML techniques can be sub-classified as either supervised or unsupervised (21). The revolutionary advances in AI and ML over the last decade promise to rapidly advance our ability to diagnose and manage PCOS. This is in part due to the ability of AI to process massive amounts of disparate data, making it an ideal aid in the diagnosis of heterogeneous disorders like PCOS.

Several studies have investigated the ability of ML models to synthesize such disparate data as family genetic history, biomarkers, and demographic information into a unified algorithm for the diagnosis of PCOS, and make diagnostic predictions (22). Some pitfalls of these studies are their small size (22), lack of relevant comparators (23), use of varied diagnostic criteria (24, 25), and heterogeneity in reporting structures. Thus, the real gaps in knowledge and the full scope of AI/ML in the diagnosis of PCOS remain unclear. To better understand and summarize the body of evidence related to the application of AI/ML in PCOS, we conducted a systematic review of all relevant studies published up to January 1, 2022.

Methods

Study overview and eligibility criteria

This manuscript employed the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) guidelines, and was submitted to PROSPERO (record number PROSPERO 2022 CRD42022295287) (26). We included English language, peer-reviewed original studies that evaluated the use of AI/ML in diagnosing, classifying, stratifying, or predicting PCOS. We subdivided studies into those that ‘diagnosed’ and those that ‘classified’ PCOS subjects. Studies were considered to diagnose PCOS if they employed standard diagnostic criteria such as NIH, Rotterdam, androgen excess-PCOS and international PCOS criteria. In contrast, those studies that partially used standard criteria or only used some measures to determine PCOS were considered to ‘classify’ subjects as having PCOS.

Data sources and search strategy

We applied a search strategy developed in collaboration with an experienced librarian to find potentially eligible studies. Databases searched were MEDLINE, Embase, the Cochrane Central Register of Controlled Trials, the Web of Science, and the IEEE Xplore Digital Library. The search included all articles from the time of inception of the dataset to May 2019. Conference abstracts were included if they fulfilled the eligibility criteria provided the manuscript wasn’t published. The full search strategy is included in Supplementary Material 3.

Study selection

We uploaded all references to Covidence and performed two rounds of screening, title-and-abstract screening, and full-text screening. Each article was assessed for eligibility by two independent reviewers in both rounds of screening using standardized instructions. Pilot phases were conducted before each screening round to ensure a baseline understanding of the eligibility criteria and resolve misunderstandings between reviewers. Inter-rater reliability assessed through Cohen’s Kappa statistic was high (κ>0.80) in both rounds of screening.

In the first screening round, disagreements were included in the second round. In the second round, disagreements were resolved by consensus between reviewers or by arbitration of a third trained reviewer.

Data collection and management

Five reviewers working independently and in duplicate extracted data from studies using a standardized extraction form. Two pilot phases were performed to ensure proficiency in the data extraction procedure. Further disagreements were discussed and resolved by consensus, and the database was cleaned by two reviewers. The extracted variables were: 1) study characteristics (authors’ information, publication year, country and setting, study design, aim and type of machine learning used, and type of data entered into the models); 2) artificial intelligence/machine learning characteristics (type of dataset used, dataset independence, type of results reported [sensitivity, specificity, area under the curve, diagnostic accuracy, precision]); 3) PCOS characteristics (definition of the disease, sample size); and 4) risk of bias.

Risk of bias

Each study was assessed for risk of bias by two independent reviewers and disagreements were resolved by two separate reviewers. We used a modified version of the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool, which includes four domains: patient selection, index test, reference standard, and flow and timing. As this tool is not designed for systematic reviews of diagnostic accuracy studies using AI/ML interventions, we summarized and complemented it with input from the authors to ensure that critical questions for AI/ML interventions were included in addition to the relevant pre-existing QUADAS-2 questions. Details of the modified QUADAS-2 tool are provided in Supplementary Material 2. The tailored QUADAS-2 tool was piloted on five studies by all reviewers and differences resolved with consensus. If a study had at least two domains at unclear risk of bias without any domain deemed at high risk of bias, it was judged to be at unclear risk of bias. Finally, studies with domains classified as low risk of bias without any domain of unclear or high risk of bias were considered low risk of bias.

Results

Characteristics of the included studies

A total of 31 studies met our inclusion criteria (Figure 1). All studies were observational and used retrospective data samples to assess the performance of the AI/ML process on the diagnosis or classification of patients. Seven of 31 studies (23%) were multi-center studies and many were conducted either in India (29%) or in China (16%). Eleven studies (36%) included subjects who did not have PCOS as controls. Sample size ranged from 9 to 2,000 patients with PCOS and the median age of participants included in studies was 29 years. The rest of the general characteristics can be found in Table 1.

FIGURE 1

Figure 1 Selection process of the studies. Article selection flow chart for studies related to AI/ML and PCOS according to Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.

TABLE 1

Table 1 Characteristics of the included studies.

Nearly half of all studies (48%) used ultrasound images to implement the AI/ML intervention. Twelve studies (39%) used clinical data such as anthropometric features (10%), signs and symptoms (16%), biomarkers (19%), genetics (13%) and metabolomics or proteomics (10%).

Ten (32%) studies used a validated diagnostic criterion to select the population, such as exclusively the Rotterdam criteria (23%), the NIH Criteria (6%), with one study using a combination of NIH and Rotterdam criteria (3%) (11, 12, 57) (Table 1). Another study (3%) used the Adams criteria, an imaging-based criteria which has not been clinically validated (58). The remaining 20 studies (65%) used clinical information to make the diagnosis, with or without complementary imaging (55%), with one study using ICD codes. Two (7%) studies reported that they used age-matched participants without the diagnosis of PCOS as controls, while other studies reported scarce information about controls; including definitions such as “normal ovaries through imaging”, or “normal ovulation cycles”. Five (16%) studies provided no definition for controls.

AI/ML models performance

Among the ten (32%) studies that used standardized diagnostic criteria, the area under the receiver operator curve ranged from 80% to 100% (n=3 studies), diagnostic accuracy from 89% to 100% (n=4 studies), sensitivity from 87% to 100% (n=3 studies), specificity from 90% to 100% (n=3 studies), and positive predictive value from 68% to 81% (n=2 studies), and negative predictive value (NPV) from 94% to 99% (n=2 studies). Performance measures for all the included studies are shown in Table 2. The studies that used standardized PCOS criteria are summarized in Figure 2 by outcome type.

TABLE 2

Table 2 Main findings of the included studies.

FIGURE 2

Figure 2 Unpooled results of studies with well-defined PCOS patient population. Outcomes and interventions are denoted in shorthand as Area Under the Curve (AUC), Partial Least-Squares Discriminant Analysis (PLS-DA), Support Vector Machine (SVM), and K-Nearest Neighbor (K-NN). A parameter threshold of 80% (0.8) indicated by the dotted line was considered a benchmark to evaluate studies assuming a 20% performance error.

Machine learning methods

The majority (71%) of studies we investigated used supervised methods. The most common were support vector machine (SVM) (42%), K-nearest neighbor (26%), regression models (23%), and Random Forest (23%). Unsupervised methods were used in nine (29%) studies and included neural networks (13%), Otsu’s thresholding and Watershed + object growing algorithm (6%), clustering analysis (6%), and self-organizing maps (3%) (Table 2). Various AI/ML models are described in Table 3.

TABLE 3

Table 3 Machine Learning Methods.

Only six (19%) studies performed all major steps of training, testing, and validation in their AI/ML interventions. About three-quarters of studies (74%) performed at least one of these steps. Specifically, ten (32%) studies performed only training and testing, four (13%) only training and validation, and three (10%) completed only one of them. Among those studies that used at least two steps, all used an independent data set for each step by using a proportion of their sample for each step or cross-validation models (where data is trained and tested on different observations).

Nineteen (61%) studies compared the effectiveness of two or more AI/ML interventions on the same sample, while only three (10%) compared AI/ML interventions against a non-machine learning classifier (board-certified physician or ICD-9 codes). Of these three, two studies described the criteria used by the clinician or the codes used to make the diagnosis.

Risk of bias

Overall, the risk of bias was judged to be high across all studies (Supplementary Material 1). Six (19%) studies described using a consecutive or random sample of the enrolled patients. Moreover, five (16%) studies used validated criteria to select their population, which affected risk of bias due to misclassification bias but also applicability bias due to an unclearly defined patient population in the studies. About half of all (52%) studies used an independent dataset to validate the AI/ML intervention. Finally, nine (29%) studies had hospital affiliations or a physician as a co-author of the study.

Discussion

We performed a systematic review of AI/ML interventions in PCOS. All included studies were observational and retrospective. A small number used standard inclusion criteria such as the NIH, Rotterdam, or International PCOS criteria for diagnosis. Most studies achieved a high ability to diagnose PCOS or ‘classify’ patients as having PCOS using AI informed by clinical, radiological, electronic health records or biochemical data. Among the ten studies that used standardized criteria, the area under the receiver operator curve ranged from 80% to 100%, diagnostic accuracy from 89% to 100%, sensitivity from 87% to 100%, specificity from 90% to 100%, and positive predictive value from 68% to 81%. The most common AI/ML methods were SVM in 13 (42%) studies, K-nearest neighbor in eight (26%) studies, and regression models in seven (23%) studies. Importantly, a large number of the studies analyzed in the current review were able to achieve a high degree of diagnostic accuracy relative to standardized criteria. For instance, Deshpande et al. (2014) attained a 95% diagnostic accuracy against the Rotterdam criteria using an SVM algorithm using ultrasound imaging, clinical, and biochemical data (44). Similarly, Bharti et al. (2020) employed multiple ML algorithms to a dataset of 364 women with and without PCOS using clinical and imaging data and reported a > 90% diagnostic accuracy for the best SVM model (28).

AI/ML-based screening techniques for diabetic retinopathy and colorectal cancer have previously been found to be highly cost-effective (59, 60). In the case of colorectal cancer, cost savings of 400 million USD have been estimated when comparing next generation sequencing approaches to AI-based screening techniques (61). The potential use of AI/ML in the diagnosis and management of endocrine disorders has sparked intense research activity. A recent review reported that among the 611 ML-based endocrinology studies published between 2015 and 2020, 52% focused on diabetes, 14% on retinopathy, 14% on thyroid dysfunction, 8% on endocrine-related carcinoma, 7% on osteoporosis, and 5% on other disease states (62). Despite a growth in such studies, FDA-approved applications of AI for diagnostic or therapeutic purposes have lagged and approved devices employing AI/ML are concentrated in the management of diabetes and related conditions (63, 64).

In comparison, polycystic ovarian syndrome represents an ideal setting for future AI-based tools, given its high prevalence, significant healthcare burden, delayed detection, and complex diagnostic criteria spanning clinical, biochemical, and radiological domains. The diagnostic delay of greater than two years in a third of women reporting PCOS symptoms is a potent target for AI/ML-based approaches (18). Furthermore, geographical heterogeneity in clinical features of PCOS suggests an additional role of environmental influences, which may be overcome through adoption of AI/ML (65). Together, high costs and diagnostic delays in PCOS present a major unmet need which could be filled by the adoption of AI technology, as effectively demonstrated in other diseases. AI holds especially high potential for the diagnosis of PCOS because of its heterogeneous nature, with clinical, biochemical and radiological features each being incorporated into its diagnostic criteria (12). The use of AI on electronic health record (EHR) systems holds the potential to integrate these features while reducing diagnostic delays in PCOS.

The current body of research on AI in PCOS has revealed high rates of sensitivity and accuracy of PCOS detection. This implies that a well-designed AI/ML based program has the potential to significantly enhance our capability to diagnose PCOS early, with associated cost savings and a reduced burden of PCOS on patients and on the health system. However, several gaps remain in the domain of AI/ML based detection of PCOS. First, we noted that only a third of studies (32%) used standardized criteria such as the Rotterdam, NIH and International PCOS criteria as reference standards when evaluating AI in PCOS. This presents a high possibility misclassification of disease and biased detection estimates. Second, there was considerable heterogeneity in assessed AI-based studies, with some relying exclusively on a single parameter of PCOS diagnosis such as radiological, biochemical, or clinical features, despite Rotterdam criteria recommending diagnosis based on more than one of these elements. Third, a large number of assessed studies did not exhaustively report methodology/algorithms for AI based diagnosis, presenting concerns about the reproducibility of their findings. Most studies also relied on observational/retrospective data without use of prospective studies or validation datasets, limiting their applicability (66). A fourth major gap was the inadequate utilization of electronic health records, one of the most promising avenues for AI integration due to their potential for synthesizing clinical, biochemical, radiological, and genetic information and reducing lead time to the diagnosis in PCOS. This warrants further investigation in future studies. Finally, we noted that a vast number of AI/ML based studies were conducted in non-healthcare settings (71%) with non-healthcare investigators (97%). This raises the possibility of reduced applicability and relevance of studies in the clinical management of PCOS since such studies, while being technically robust, may not account for clinically important variables and outcomes. It is therefore important for physicians to become more aware of the advantages of AI/ML based methodologies and for physicians and computational scientists interested in AI/ML to work together to optimize the power of these new tools. Moreover, future AI/ML studies with applications for PCOS or other conditions, should make greater efforts to increase the methodological quality to increase the validity of the results. For this, we recommend the following five measures to improve the applicability of AI/ML for diagnosing PCOS and improving its care.

1. Increase collaboration between clinicians, researchers, and computational biologists.

2. Set up combined registries of data that include defined clinical, radiological (including images), and laboratory data (with reference values) of PCOS patients.

3. Use standardized criteria to train machine learning models as the standard reference and perform robust training and validation studies in PCOS patients.

3. Since some of the data used to develop the model may have some variation by time, it is important that future studies also test for performance (accuracy measures) consistency across time.

4. Enhance integration of population-based studies [e.g. All of Us, NHANES (67, 68)] with electronic health datasets to identify risk factors and risk enhancers for PCOS.

5. Include commonly used biochemical tests such as AMH, gonadal hormones, markers of insulin resistance and others in AI/ML to identify reliable biomarkers that can aid the diagnosis of PCOS.

To our knowledge, this is the first systematic review of AI/ML in the diagnosis of PCOS, spanning all published studies to date. We followed the methodological standards for systematic reviews proscribed by PRISMA guidelines. Despite the absence of a methodological assessment tool for evaluation of AI/ML based studies at the time of execution of this review, we performed a thorough evaluation of the quality by adapting the QUADAS-2 tool and adding relevant questions for the AI/ML interventions evaluated. Although not a weakness of our methods, confidence in our results is limited by the relatively small number of studies conducted on this subject, the heterogeneity of available data, and the risk of bias in primary studies. Broadly, poor dataset sourcing using non-standardized criteria, inconsistent use of best-practice machine learning methods, and limited clinical affiliations among authorship all undermined confidence in our selected studies.

In conclusion, our findings suggest that there is a high potential of AI/ML based programs in the diagnosis and care of PCOS, but that future studies should focus on enhancing methodological robustness and incorporating variables and outcomes of clinical importance.

Data availability statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

Author contributions

FB and SS designed the study. FB and EB participated in all phases of the conduction of the study. FB, EB, SS, AR, JO, HP, and EL participated in screening, data extraction, and manuscript writing. JH, RR-G and SS reviewed the final version of the manuscript. All authors contributed to the article and approved the submitted version.

Funding

Intramural Research Program (ZIDES102465 and ZID ES103323) of the National Institute of Environmental Health Sciences, National Institutes of Health, United States.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fendo.2023.1106625/full#supplementary-material

References

1. Carmina E, Lobo RA. Polycystic ovary syndrome (PCOS): arguably the most common endocrinopathy is associated with significant morbidity in women. J Clin Endocrinol Metab (1999) 84(6):1897–9. doi: 10.1210/jcem.84.6.5803

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Knochenhauer ES, Key TJ, Kahsar-Miller M, Waggoner W, Boots LR, Azziz R. Prevalence of the polycystic ovary syndrome in unselected black and white women of the southeastern United States: a prospective study. J Clin Endocrinol Metab (1998) 83(9):3078–82. doi: 10.1210/jc.83.9.3078

PubMed Abstract | CrossRef Full Text | Google Scholar

3. Yildiz BO, Bozdag G, Yapici Z, Esinler I, Yarali H. Prevalence, phenotype and cardiometabolic risk of polycystic ovary syndrome under different diagnostic criteria. Hum Reprod (2012) 27(10):3067–73. doi: 10.1093/humrep/des232

PubMed Abstract | CrossRef Full Text | Google Scholar

4. Azziz R, Woods KS, Reyna R, Key TJ, Knochenhauer ES, Yildiz BO. The prevalence and features of the polycystic ovary syndrome in an unselected population. J Clin Endocrinol Metab (2004) 89(6):2745–9. doi: 10.1210/jc.2003-032046

PubMed Abstract | CrossRef Full Text | Google Scholar

5. IHME. Polycystic ovarian syndrome. GBD Results, 2019 (2022). Available at: https://www.healthdata.org/results/gbd_summaries/2019/polycystic-ovarian-syndrome-level-4-cause.