Deep-learning models for image-based gynecological cancer diagnosis: a systematic review and meta- analysis

Taddese, Asefa Adimasu; Tilahun, Binyam Chakilu; Awoke, Tadesse; Atnafu, Asmamaw; Mamuye, Adane; Mengiste, Shegaw Anagaw

doi:10.3389/fonc.2023.1216326

SYSTEMATIC REVIEW article

Front. Oncol. , 11 January 2024

Sec. Cancer Imaging and Image-directed Interventions

Volume 13 - 2023 | https://doi.org/10.3389/fonc.2023.1216326

Deep-learning models for image-based gynecological cancer diagnosis: a systematic review and meta- analysis

Asefa Adimasu Taddese^1,2*

Binyam Chakilu Tilahun^1,2

Tadesse Awoke³

Asmamaw Atnafu^2,4

Adane Mamuye^2,5

Shegaw Anagaw Mengiste⁶

¹Department of Health Informatics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
²eHealthlab Ethiopia Research Center, University of Gondar, Gondar, Ethiopia
³Department of Epidemiology and Biostatistics, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
⁴Department of Health Systems and Policy, Institute of Public Health, College of Medicine and Health Sciences, University of Gondar, Gondar, Ethiopia
⁵School of Information Technology and Engineering, Addis Ababa University, Addis Ababa, Ethiopia
⁶Department of Business, History and Social Sciences, University of Southeastern Norway, Vestfold, Vestfold, Norway

Introduction: Gynecological cancers pose a significant threat to women worldwide, especially those in resource-limited settings. Human analysis of images remains the primary method of diagnosis, but it can be inconsistent and inaccurate. Deep learning (DL) can potentially enhance image-based diagnosis by providing objective and accurate results. This systematic review and meta-analysis aimed to summarize the recent advances of deep learning (DL) techniques for gynecological cancer diagnosis using various images and explore their future implications.

Methods: The study followed the PRISMA-2 guidelines, and the protocol was registered in PROSPERO. Five databases were searched for articles published from January 2018 to December 2022. Articles that focused on five types of gynecological cancer and used DL for diagnosis were selected. Two reviewers assessed the articles for eligibility and quality using the QUADAS-2 tool. Data was extracted from each study, and the performance of DL techniques for gynecological cancer classification was estimated by pooling and transforming sensitivity and specificity values using a random-effects model.

Results: The review included 48 studies, and the meta-analysis included 24 studies. The studies used different images and models to diagnose different gynecological cancers. The most popular models were ResNet, VGGNet, and UNet. DL algorithms showed more sensitivity but less specificity compared to machine learning (ML) methods. The AUC of the summary receiver operating characteristic plot was higher for DL algorithms than for ML methods. Of the 48 studies included, 41 were at low risk of bias.

Conclusion: This review highlights the potential of DL in improving the screening and diagnosis of gynecological cancer, particularly in resource-limited settings. However, the high heterogeneity and quality of the studies could affect the validity of the results. Further research is necessary to validate the findings of this study and to explore the potential of DL in improving gynecological cancer diagnosis.

1 Introduction

Gynecological cancers refer to various types of cancers that affect the female reproductive system, including ovarian, cervical, uterine, vaginal, and vulvar cancers (1, 2). These cancers are a significant public health concern worldwide, with approximately 1.3 million new cases and 500,000 deaths annually (3–5). The high mortality rate associated with gynecological cancers can be attributed to late diagnosis, which underscores the importance of early detection and treatment (6).

Medical imaging, such as ultrasound, computed tomography (CT), and magnetic resonance imaging (MRI), plays a crucial role in the early detection and diagnosis of gynecological cancers. However, accurate interpretation of these images can be challenging, and traditional diagnostic methods may not always provide accurate results (7–9). In recent years, deep-learning models have emerged as a promising tool for improving the accuracy and efficiency of gynecological cancer diagnosis from medical images (10–16).

Deep-learning models are a subset of artificial intelligence (AI) that can learn to recognize patterns and features in data without being explicitly programmed. These models can be trained using large datasets of medical images to detect subtle differences between normal and abnormal tissue (17–19). Studies have shown that deep-learning models can achieve high accuracy in detecting gynecological cancers from medical images, outperforming traditional diagnostic methods such as manual interpretation by human experts (20–30).

However, the effectiveness and reliability of these models in clinical settings remain unclear, and a comprehensive review of the existing literature is necessary.

The present study aims to conduct a systematic review and meta-analysis of the current literature on deep-learning models for image-based gynecological cancer diagnosis. The review will address the following research questions:

1. What machine-learning and deep-learning models have been developed for image-based gynecological cancer diagnosis, and how do they compare in terms of performance and accuracy?

2. What are the limitations and challenges associated with the use of deep-learning models in gynecological cancer diagnosis?

3. What are the implications of the findings for the clinical application of deep-learning models in image-based gynecological cancer diagnosis?

By answering these research questions, this study provides valuable insights into the potential of deep-learning models for improving the accuracy and efficiency of gynecological cancer diagnosis and inform future research and clinical practice in this field.

2 Methods

2.1 Literature search

The literature search was conducted in accordance with the Enhancing the Quality and Transparency of Health Research (EQUATOR) Reporting Guidelines and the Preferred Reporting Items for Systematic Reviews (PRISMA-2) (31). A protocol for the study was registered in PROSPERO (ID No CRD42023421847). The search was performed on the following databases and websites: PubMed, Embase, Scopus, Google, and Google Scholar. The search was conducted from January 8 to 30, 2023, and included articles published between January 2017 and December 2022. The snowball method was used to identify relevant articles from the reference lists of retrieved articles. The sample search terms included gynecologic cancer, diagnosis, prognosis, deep learning, AI, artificial intelligence, machine learning, and neural network.

2.2 Inclusion and exclusion criteria

To be included in the review, studies had to meet the following criteria: (i) consideration of at least one of the five types of gynecologic cancers (cervical, ovarian, uterine, vaginal, or vulvar); (ii) use of at least one deep learning technique as a classifier; (iii) reporting of at least one performance evaluation measure for deep learning-based image segmentation of gynecologic cancers; (iv) publication between January 2017 and August 2022; (v) full-text publication in English; and (vi) availability of full-text articles. Abstracts and preprints were excluded.

2.3 Assessment of methodologic quality

The full texts of the selected articles were retrieved and assessed for eligibility by the same two reviewers. Two researchers reviewed (ZA and AA) the titles and abstracts of retrieved articles and applied inclusion and exclusion criteria. The full texts of qualifying articles were retrieved and reviewed to confirm study eligibility. Any disagreements were resolved by discussion or by consulting a third reviewer. The following data were extracted from each included study: authors, year of publication, country of origin, cancer type, image modality, data source, data size, data preprocessing, ML technique, performance metrics, validation method, and main findings. The quality of the studies was assessed using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool, which evaluates the risk of bias and applicability of studies based on four domains: patient selection, index test, reference standard, and flow and timing. The final criterion is based on the risk of bias with respect to concerns about applicability. Rating risks of bias was determined as high, low, or unclear (32).

2.4 Data extraction

The following information was extracted from each included study: first author, year of publication, country of origin, cancer type, image modality, data source, data size, data preprocessing, ML technique, performance metrics, validation method, and main findings. The PRISMA guidelines were followed for data extraction (31).

2.5 Qualitative synthesis

A qualitative synthesis was performed to provide a narrative summary of findings from the included studies. This synthesis involved a thematic analysis to identify common themes across the literature. Reviewers independently analyzed articles, extracted key findings, and identified themes related to the use of deep learning techniques for image-based gynecological cancer diagnosis.

A constant comparative approach was used to identify similarities and differences across studies. Any discrepancies were resolved through discussion, with a third reviewer consulted when necessary. The themes identified during the qualitative synthesis were summarized and reported in the results section. The goal was to offer a comprehensive overview of the existing literature and pinpoint areas for future research.

2.6 Meta-analysis

A meta-analysis was performed to estimate the pooled performance of deep learning techniques for image-based gynecological cancer diagnosis. Only studies that reported sensitivity and specificity values or provided sufficient data to calculate them were included in the meta-analysis. The sensitivity and specificity values were transformed into logit values and pooled using a random-effects model. The heterogeneity of the studies was evaluated using Higgins’ I2 statistic (33). Subgroup analyses were performed based on the types of algorithm used (34). The summary receiver operating characteristic (SROC) curve and the area under the SROC curve (AUC) were calculated to summarize the overall diagnostic accuracy of advanced ML techniques. The statistical analyses were performed using R software with the “meta” and “mada” packages.

3 Results

3.1 Study selection

The search strategy yielded 1,002 articles from the four databases. After removing duplicates, 836 articles remained for title and abstract screening. Of these, 357 were removed by title and abstract and then 339 articles were not had full texts and removed it. The remaining 140 were full manuscripts and eligible for final screening. However, 92 articles were excluded because of the absence of image data and performance measure. Finally, 48 studies were eligible and included in our systematic review and of which 24 studies were available for meta-analysis (Figure 1).

Figure 1

Figure 1 Shows the PRISMA flowchart of the sudy selection process. PRISMA stands for preferred Reporting items for Systematic Review and Meta-Analysis. He flowchart displays the search methodology and literature selection process.

3.2 Study characteristics

A total of 48 studies were included in this analysis. The studies used various imaging modalities, including cytology (20 studies), colposcopy (15 studies), MRI (8 studies), CT scan (4 studies), and hysteroscopy (1 study). The studies were published between 2017 and 2022, with the majority (19 studies) being published in 2022. The types of cancer studied included cervical cancer (30 studies), endometrial cancer (6 studies), ovarian cancer (9 studies), combined gynecologic cancer (1 study), and vulvar and vaginal cancer (2 studies). The number of images used by literatures were ranged from 34 to 67811, with a mean of 2209 and a standard deviation of 5830. 40 studies focused on automatic image classification using various deep learning models. One study focused on the quantification of abnormalities in gynecologic cytopathology with deep learning and the remaining seven studies focused on automatic tumor segmentation using deep learning techniques (see details in Supplementary File 1).

3.3 Preprocessing and feature extraction techniques

To create deep learning models for the diagnosis of gynecologic cancer based on images, various researchers have employed distinct methodologies. Nonetheless, a consensus among the majority of authors is that initial image preprocessing, involving various techniques, is essential. Following this, the identification and extraction of crucial features are commonly performed before proceeding with post-processing steps (for additional information, please refer to Supplementary File 2).

The 12 articles reviewed various pre-processing and feature extraction techniques to enhance image quality and extract relevant information for analysis. These techniques were used in different combinations in the different studies. Some studies use a combination of pre-processing techniques and feature extraction methods, while others focus on one particular technique or method. For instance, Bhatt et al. (35) used techniques such as horizontal flipping, inverse rotation, random scaling, and progressive resizing to augment data in Pap smear whole slide images. Chandran et al. (36) employed techniques such as random rotation, random brightness, random crop, random blur, and max pooling to preprocess colposcopy images for the diagnosis of cervical cancer. Cheng et al. (37) used techniques such as threshold truncation, normalization, zooming, and max pooling for image preprocessing. Chen et al. (38) utilized techniques such as random rotation, oversampling, and max pooling for image preprocessing. Cho et al. (39) used techniques such as automatic central cropping, min-max normalization, data augmentation, and Test-Time Augmentations (TTA) for image preprocessing. Lastly, Dai et al. (40) used techniques such as normalization, image resizing, N4BiasFieldCorrection using ANT, and bilinear interpolation for image preprocessing in MRI images. Feature extraction methods included in the articles were Otsu model-based, Principal Component Analysis (PCA), Progressive Resizing technique, Max Pooling, t-SNE, Test-Time Augmentations (TTA), Bilinear interpolation, MobileNetv2, and Super pixel gap-search (Markov random field) (see details in Table 1).

Table 1

Table 1 Indicates different pre-processing and feature extraction techniques used in the reviewed articles.

3.4 Deep learning models in gynecologic cancer diagnosis

In the context of gynecologic cancer diagnosis, deep learning models have been widely used. Many studies have employed convolutional neural networks (CNNs) or their variants, such as VGGNet, UNet, ResNet, InceptionNet, MobileNet, EfficientNet, DenseNet, YOLO, DResNet, CE-Net, HIENet, Xception, MIA3G, Hybrid, 3D VB-Net, ShuffleNet and ColpNet. Other models, such as autoencoders (AE), recurrent neural networks (RNN) and ensemble learning methods (CNN with SVM or XGBoost), have also been utilized. Among these models, ResNet is the most popular (appearing in 15 studies), followed by VGGNet and UNet (appearing in 12 studies each), InceptionNet and EfficientNet (appearing in 8 studies each), and DResNet, CE-Net, HIENet, KCNN, and MIA3G (appearing in only one study each) (see Figure 2).

Figure 2

Figure 2 The figure provides an overview of the diversiy and popularity of models used in different studies for the diagnosis of gynecologic cancers.

3.5 Deep learning for gynecologic cancer segmentation

Seven studies that used different deep learning models, including 3D-UNet, 3D VB-Net, ResNet18, 2D-RefineNet, CE-Net, and fully convolutional neural networks, were reviewed for gynecologic cancer segmentation. The studies also used different types of images (CT or MRI) and had various sample sizes, ranging from 130 to 826 images. Various model performance measures were reported, such as the 95% Hausdorf distance (HD_95), Dice Similarity Coefficient (DSC), Mean Surface Distance (MSD), Jaccard index (JI), and Average Surface Distance (ASD).

The DSC score, which measures the overlap between automatic and manual segmentation, was one of the most commonly used performance measures. Cheng et al. (37) achieved the highest DSC score for CTV segmentation using 3D-UNet on 400 MRI images, with a score of 0.93. For bladder segmentation, Ding et al. (47) and Cheng et al. (37) achieved the highest DSC score using 3D-UNet on 130 and 400 MRI images, respectively, with a score of 0.91. Ma et al. (48) achieved the highest DSC score for rectum segmentation using 3D VB-Net on 200 CT images, with a score of 0.88. Ding et al. (47) also achieved the highest DSC score for femoral head segmentation using 3D VB-Net on 130 MRI images, with a score of 0.92.

The HD_95, which measures the maximum distance between automatic and manual segmentation boundaries, was another commonly used performance measure. Ding et al. (47) achieved the lowest HD_95 for CTV segmentation using 3D-UNet on 130 MRI images, with a score of 10.03. Ma et al. (48) achieved the lowest HD_95 for bladder and rectum segmentation using 3D VB-Net on 200 CT images, with scores of 4.86 and 4.11, respectively. Ma et al. (48) also achieved the lowest HD_95 for femoral head segmentation using 3D VB-Net on 302 CT images, with a score of 4.86(see details in Table 2).

Table 2

Table 2 Indicates different deep learning for gynecologic cancer segmentation.

3.6 Deep learning models used for abnormality detection

Several models have been utilized by Ke and Shen for automatic abnormality detection from medical images, including U-Net, Mask RCNN, 3D-UNet, and a ResNet-U-Net hybrid. Performance metrics such as pixel accuracy, mean pixel accuracy, and mean IoU were used to evaluate the models, with the ResNet-U-Net hybrid achieving the highest performance, scoring 97.4%-pixel accuracy, 95.5% mean pixel accuracy, and 91.3% mean IoU. On the other hand, U-Net had the lowest performance, with a pixel accuracy of 91.3%, mean pixel accuracy of 90.6%, and mean IoU of 89.9%. Figure 3 displays the deep learning models utilized for quantifying images (See Figure 3).

Figure 3

Figure 3 Showed deep learning models used to quantify images.

3.7 Deep learning models used for automatic image classification

In the literature reviewed for gynecologic cancer screening and diagnosis, various deep learning models were employed, including ResNet50, Colponet, ResNeSt, N-Net, 3D-UNet, and YOLOv3. Several studies reported high performance measures, with ResNet-v2 used by AbuKhalil et al. (41) achieving 96.7% precision, 97.39% sensitivity, and 96.61% accuracy on 918 images. Bhatt et al. (35) utilized convNet with transfer learning and progressive resizing with K-Nearest Neighbour and EfficientNet-B3 on 917 and 966 images, respectively, achieving 78.14% and 99.01% accuracy (Table 3 for details).

Table 3

Table 3 The table provides an overview of the performance and diversity of deep learning models used in different studies for gynecologic cancer screening and diagnosis.

In this review, ResNet, a commonly used CNN network, has demonstrated effectiveness in gynecologic cancer tasks like cervical cancer detection, endometrial cancer diagnosis, and ovarian cancer classification. ResNets utilize skip connections, enabling the network to learn identity mappings, preventing overfitting. ResNets consist of residual blocks with convolutional layers, batch normalization, and activation functions. Skip connections can be implemented using identity or projection shortcuts, impacting network performance (66) (see Supplementary File 3 for detail).

3.8 Pooled performance of DL algorithms

In this section, the researchers analyzed 48 studies that applied deep learning (DL) algorithms for diagnosing gynecologic cancer. However, only 24 of these studies had sufficient data to calculate the diagnostic accuracy using contingency tables. The authors used hierarchical summary receiver operating characteristic (SROC) curves to summarize the overall performance of the DL algorithms across the studies. The sensitivity and specificity were plotted for each study, with sensitivity measuring the proportion of true positives and specificity measuring the proportion of true negatives. The area under the curve (AUC) was used as a measure of the overall accuracy of the algorithm. The results showed that the pooled sensitivity and specificity for all DL algorithms were 89.40% (95% CI, 86.19–92.62%) and 87.6% (95% CI, 82.6–92.46%), respectively, with an AUC of 0.88 (95% CI, 0.84–0.93). Some studies used more than one DL algorithm and reported the best accuracy among them. The authors also summarized the performance of the best DL algorithms across the studies, with pooled sensitivity and specificity of 68.1% (95% CI, 57.2–80.9) and 94.1% (95% CI, 89.6–96.7), respectively, and an AUC of 0.81 (95% CI, 0.90–0.94). These results demonstrate that DL algorithms have a high diagnostic accuracy for gynecologic cancer (See Figures 4, 5 for details).

Figure 4

Figure 4 Summary estimate of pooled specificity of 24 studies using forest plot.

Figure 5

Figure 5 Summary estimate of pooled sensitivity of 24 studies using forest plot.

3.9 Subgroup meta-analyses

In this analysis, 24 studies were used to compare the performance of deep learning (DL) algorithms and machine learning (ML) methods in diagnosing gynecologic cancer. A random model was utilized to calculate the pooled sensitivity and specificity for each algorithm type. The results showed that DL algorithms had higher sensitivity (80% [95% CI, 73.1 – 89.7%]) and lower specificity (91.9% [88.6 – 94.4%]) compared to ML methods (sensitivity: 34.6% [95% CI, 18.2-65.8%]; specificity: 97.6% [95% CI, 86.0- 99.6%]). Additionally, the area under the curve (AUC) was higher for DL algorithms (0.86 [0.81–0.92]) than for ML methods (0.66 [0.52–0.83]), which is a measure of the overall accuracy of the algorithm. These findings suggest that DL algorithms perform better than ML methods in detecting gynecologic cancer (See Figure 6 for details).

Figure 6

Figure 6 Pooled performance of DL algorithms versus ML using the same sample.

3.10 Heterogeneity analysis

The analysis of heterogeneity among studies comparing deep learning (DL) algorithms and traditional machine learning (ML) approaches for gynecologic cancer detection revealed a high degree of variability (I2 = 98.1%, p < 0.0001). Using the inverse variance method and the DerSimonian-Laird estimator, the pooled sensitivity (SE) and specificity (SP) for the two types of algorithms were calculated. The results showed that DL algorithms had a significantly higher SE (98.9% [98.7%; 99.1%]) than ML methods (34.6% [95% CI, 18.2-65.8%]), while ML methods had a higher SP (97.6% [95% CI, 86.0- 99.6%]) than DL algorithms (91.9% [88.6 – 94.4%]). The area under the curve (AUC) was also found to be higher for DL algorithms (0.86 [0.81–0.92]) than for ML methods (0.66 [0.52–0.83]), indicating that DL algorithms are more accurate at detecting gynecologic cancer. These findings highlight the potential of DL algorithms in improving the accuracy of gynecologic cancer diagnosis, but also call for further investigation to address the heterogeneity observed among studies (see Figure 5 for detail). The results of our analysis show that deep learning (DL) algorithms are significantly more effective than traditional machine learning (ML) approaches when it comes to correctly classifying patients with gynecologic cancer. Specifically, the pooled odds ratio (OR) of the random effect model was found to be 56.2459 [95% CI, 28.3682; 111.5195), with a p-value less than 0.0001. This means that DL algorithms are approximately 56 times more likely than ML methods to make accurate diagnoses of gynecologic cancer. These findings have important implications for the development of diagnostic tools and the treatment of patients with gynecologic cancer, suggesting that DL algorithms may offer a more effective and reliable approach to diagnosis and treatment (See Figure 7 for details). In this study, the authors used advanced deep learning models to diagnose gynecologic cancer based on various data types. The SROC plot showed that the deep learning models had a high sensitivity and specificity for gynecologic cancer diagnosis. The AUC was 0.81, which indicates a good performance (See Figure 8 for details).

Figure 7

Figure 7 Summary of pooled odds ratio of 24 studies using forest plot.

Figure 8

Figure 8 Summary of the receiver operating characteristic (SROC) plot of the advanced machine learning algorithms.

3.11 Quality assessment

We used QUADAS-2 to evaluate the quality of the studies and presented a summary of findings with a suitable diagram in Figure 9. Out of 48 studies, 41 had low risk of bias and 7 had high or unclear risk of bias. Four studies had high or unclear risk of bias in the patient selection domain because they did not report their inclusion or exclusion criteria or they excluded patients improperly. Two studies had high or unclear risk of bias in the index test domain because they did not have a predefined threshold (see Figure 9).

Figure 9

Figure 9 Risk of bias and concern of applicability for each item in included studies.

4 Discussion

The studies included in this review showed a great diversity in terms of imaging modalities, publication years, types of cancer, number of images, and deep learning applications. The studies were mostly recent, with almost half of them being published in 2022. Cervical cancer was the most common type of cancer studied, followed by ovarian cancer and endometrial cancer. Gynecologic cancer and vulvar and vaginal cancer were the least common types of cancer studied. The number of images used in the studies varied widely, from a few dozens to tens of thousands. The majority of the studies focused on automatic image classification using deep learning models, such as convolutional neural networks, recurrent neural networks, or attention mechanisms. Only a few studies focused on quantification or segmentation of gynecologic abnormalities using deep learning techniques.

Cytology and colposcopy were the most common imaging modalities used, followed by MRI and CT scan. Hysteroscopy was the least common modality used. Cytology is a simple, inexpensive, and widely available method for screening and diagnosing cervical cancer. However, it has low sensitivity and specificity, especially for high-grade lesions and adenocarcinoma. It also requires adequate sampling and interpretation by trained personnel. Cytology alone is not sufficient for staging cervical cancer or detecting recurrence (67). Colposcopy is a visual examination of the cervix using a magnifying device. It can identify abnormal areas that may need biopsy or treatment. It can also assess the extent of cervical lesions and guide conization or excision procedures. However, colposcopy is operator-dependent and subjective. It may miss lesions in the endocervical canal or outside the transformation zone. It also has limited value for staging cervical cancer or detecting recurrence (67). MRI is regarded as the gold standard for local staging of most gynecologic malignancies (68). It has superb soft tissue contrast and resolution without exposing the patients to ionizing radiation. It can delineate tumor size, depth of invasion, parametrial involvement, lymph node status, and distant metastasis. Advances in functional MRI with diffusion-weighted and dynamic contrast-enhanced sequences provide more detailed information regarding tumor cellularity, vascularity, and viability (69). However, MRI is expensive, time-consuming, and not widely available. It may also have artifacts or false-positive findings due to inflammation, fibrosis, or post-treatment changes (68).

CT scan is a fast and widely available imaging modality that can evaluate the whole abdomen and pelvis in one examination. It can detect enlarged lymph nodes, ascites, peritoneal implants, liver metastasis, and other signs of advanced disease. However, CT scan has low sensitivity and specificity for local staging of gynecologic cancers. It also exposes the patients to ionizing radiation (69).

This review revealed that various pre-processing and feature extraction techniques were applied to gynecologic cancer image analysis in different studies. The most common pre-processing techniques used by different authors are filtering, normalization data augmentation and histogram matching. Filtering is a technique used to remove noise and artifacts from the images, such as Gaussian noise, speckle noise, or motion blur. It can be done using different methods, such as average filter, median filter, adaptive median filter, or Gaussian filter (70, 71). Normalization technique can adjust the intensity values of the images to a common scale, such as 0-1 or 0-255. It helps to reduce the effect of illumination variations and enhance the contrast of the images (70, 72). Data augmentation is a technique of generating new images from the existing ones by applying transformations, such as rotation, flipping, scaling, cropping, or zooming. It helps to increase the size and diversity of the dataset and reduce the overfitting problem (70, 72). And also, Histogram matching is a pre-processing technique used to modify the histogram of an image to match the histogram of another image. It helps to improve the quality and consistency of the images and reduce the effect of scanner variations (72).

Principal Component analysis (PCA), max pooling, t-distributed stochastic neighbor embedding (t-SNE) and Test-time augmentations (TTA) techniques are widely used feature extraction techniques in the literature. Principal component analysis (PCA can reduces the dimensionality of the data by projecting it onto a lower-dimensional subspace that captures most of the variance (71). Max pooling reduces the size of the feature maps by applying a max operation over a sliding window and it can helps to extract the most salient features and make them invariant to small translations (71). T-distributed stochastic neighbor embedding (T-SNE) can reduces the dimensionality of the data by embedding it into a lower-dimensional space that preserves the local similarities. T-SNE can help visualize and cluster high-dimensional data. Test-time augmentations (TTA) technique applies data augmentation techniques at test time and averages the predictions from multiple augmented images and it helps to improve the robustness and accuracy of the classification.

Our results also showed that 3D VB-Net achieved the best performance among the DL models for gynecological cancer segmentation. The 3D VB-Net model had an average HD_95 of 5.48 mm, which means that 95% of the distances between the predicted and ground truth boundaries were less than 5.48 mm. The model also had an average DSC of 81%, which means that the overlap between the predicted and ground truth regions was 81%. The model had an average MSD of 1.63 mm, which means that the average distance between the predicted and ground truth centroids was 1.63 mm. Finally, the model had an average JI of 75%, which means that the ratio of the intersection and union of the predicted and ground truth regions was 75%. These metrics indicate that the 3D VB-Net model was able to segment the gynecological tumors accurately and consistently. The worst performance was obtained by ResNet18, which had an average DSC of only 82%. This means that the ResNet18 model had a lower overlap between the predicted and ground truth regions than the other models. The other models had similar performance, with average HD_95 ranging from 10.03 to 11.2 mm, DSC from 80 to 85%, MSD from 1.15 to 2.58 mm, and JI from 75 to 77%. These metrics indicate that the other models were able to segment the gynecological tumors reasonably well, but not as well as the 3D VB-Net model. Our findings are consistent with previous studies that reported superior performance of 3D VB-Net over other DL models for prostate cancer segmentation (64, 73, 74). The advantages of 3D VB-Net include its ability to capture volumetric information from MRI images, which is important for tumor detection and characterization. The model also uses a Variational auto encoder for feature extraction, which is a generative model that can learn a latent representation of the data and reconstruct it with minimal error. The model also incorporates boundary loss for accurate segmentation, which is a loss function that penalizes the deviation of the predicted boundaries from the ground truth boundaries (64). On the other hand, ResNet18 performed poorly in our study, which might be due to its shallow architecture and lack of spatial information (49). ResNet18 is a convolutional neural network that has only 18 layers, which might not be enough to learn complex features from MRI images. The model also does not use any spatial information, such as coordinates or distances, which might be useful for tumor localization and segmentation.

The best performance of models for gynecologic cancer classification was obtained by Bhatt et al. (35) who used EfficientNet-B3, which is a convolutional neural network that uses a compound scaling method to balance the depth, width, and resolution of the network. They trained their model on 966 images of patients with cervical cancer and evaluated it on 101 images of patients with benign or malignant lesions. They achieved an accuracy of 99.01%, which means that they correctly classified 99.01% of the lesions as benign or malignant. They also achieved a precision of 99.15%, which means that 99.15% of the lesions that they predicted as malignant were actually malignant. They achieved a sensitivity of 98.89%, which means that they detected 98.89% of the malignant lesions in the dataset. They achieved a specificity of 99.02%, which means that they rejected 99.02% of the benign lesions in the dataset. They achieved an F1-score of 98.87%, which is a harmonic mean of precision and sensitivity that measures the balance between them. Another model that achieved a high performance for gynecologic cancer classification was obtained by Dai et al. (40) who used 3D-UNet, which is a deep convolutional neural network that can segment 3D volumes. They trained their model on 86 MRI images of patients with cervical or ovarian cancer and evaluated it on 43 MRI images of patients with benign or malignant tumors. They achieved an accuracy of 94.19%, which means that they correctly classified 94.19% of the tumors as benign or malignant. They also achieved a precision of 95.35%, which means that 95.35% of the tumors that they predicted as malignant were actually malignant. They achieved a sensitivity of 93.02%, which means that they detected 93.02% of the malignant tumors in the dataset. They achieved a specificity of 95.35%, which means that they rejected 95.35% of the benign tumors in the dataset. They achieved an F1-score of 94.19%, which is a harmonic mean of precision and sensitivity that measures the balance between them. They achieved an AUC of 0.94, which is the area under the receiver operating characteristic curve that measures how well the model can distinguish between benign and malignant tumors. Our findings are consistent with previous studies that reported the benefits of CNNs with transfer learning and progressive resizing for cervical cancer detection (75) and 3D-UNet or its variants for ovarian cancer detection (52, 54, 76). The advantages of these models include their ability to learn high-level features from medical images, such as edges, shapes, textures, and patterns that are relevant for tumor identification and characterization. They also have the ability to adapt to different domains and modalities, such as histopathology, ultrasound, or MRI, by transferring the knowledge learned from one domain or modality to another. They can also handle large-scale and imbalanced datasets, such as those with more benign than malignant tumors or vice versa, by using data augmentation techniques or class weighting schemes to increase the diversity and balance of the data. Moreover, they can improve segmentation accuracy by using 3D information from MRI images, such as depth and volume, and by using loss functions that emphasize the boundary accuracy, such as dice loss or focal loss.

Other studies have also used deep learning models and techniques for gynecologic cancer diagnosis using different types of images. For example, Gao et al. (2021) used a deep convolutional neural network (DCNN) to diagnose ovarian cancer using multimodal medical images (FDG-PET/CT) with an accuracy of 0.94 and an AUC of 0.98 (77). Ho et al. (2022) used deep interactive learning to diagnose BRCA mutation status in ovarian cancer using H&E-stained whole slide images with an accuracy of 0.86 and an AUC of 0.91 (78). Li et al. (79) used a self-adapting ensemble method to diagnose gynecological brachytherapy on CT images with an accuracy of 0.88 and an AUC of 0.93 (80).

A meta-analysis examined the SE and SP of DL algorithms and traditional ML approaches for diagnosing COVID-19 from chest X-rays by using studies that reported these measures (81). The inverse variance method and the DerSimonian-Laird estimator were utilized to synthesize the results and evaluate the heterogeneity. The studies exhibited very high heterogeneity (I2 = 98.1% [97.7%; 98.4%], p < 0.0001). The DL algorithms demonstrated significantly higher SE (98.9% [98.7%; 99.1%]) and SP (97.5% [96.9%; 97.9%]) than the traditional ML approaches (p < 0.0001), indicating better diagnostic performance.

The random-effects model indicated that deep learning algorithms had a much higher odds ratio (OR) of 56.2459 [95% CI, 28.3682; 111.5195) and p-value <0.0001 for gynecologic cancer detection than machine learning algorithms, meaning that they were about 56 times more likely to make a correct diagnosis. This is a larger OR than the one found for COVID-19 detection from chest X-rays, which was 9.8 [95% CI, 6.1; 15.7] and p-value <0.0001, meaning that deep learning algorithms were about 10 times more likely to make a correct diagnosis than machine learning algorithms (82). This suggests that deep learning algorithms had a bigger edge over machine learning algorithms for gynecologic cancer detection than for COVID-19 detection.

The diagnostic accuracy of advanced deep learning models for gynecologic cancer was assessed by pooling the sensitivity, specificity, and SROC curve from different studies. The AUC of the SROC curve was 0.81, which indicates a moderate level of accuracy. This is lower than the AUC of 0.86 reported for deep learning models for COVID-19 detection from chest X-rays (83), suggesting that gynecologic cancer diagnosis is more challenging and requires further improvement of the algorithms.

5 Conclusion

This review demonstrates that deep learning techniques have been widely utilized in various imaging modalities for detecting, segmenting, and diagnosing gynecologic cancers. Medical image analysis, including lesion segmentation, classification, detection, and quantification, has exhibited tremendous potential with the use of deep learning. The studies reviewed in this paper employed imaging modalities such as cytology, colposcopy, MRI, CT scan, and hysteroscopy. Cytology is utilized for cervical cancer screening and diagnosis by examining cells or tissue fragments under a microscope, while colposcopy inspects the cervix with a magnifying device and is typically done after an abnormal cytology result. MRI and CT scans are non-invasive techniques used to visualize the structure and function of gynecologic organs, while hysteroscopy views the inside of the uterus with a thin camera. These imaging modalities aid in the diagnosis of cervical cancer and other gynecologic cancers, such as endometrial cancer and ovarian cancer. However, they also have drawbacks, including subjectivity, error, low resolution, noise, artifact, and variability. Deep learning can assist cytologists, colposcopists, radiologists, and gynecologists in overcoming these challenges by improving the accuracy, efficiency, objectivity, quality, segmentation, and interpretation of these images.

Normalization, rotation, cropping, and filtering were the most commonly used pre-processing techniques. Max pooling, principal component analysis, and progressive resizing were the most frequently utilized feature extraction techniques. These methods can help achieve higher accuracy, efficiency, and objectivity in diagnosing and prognosing gynecologic cancers using various types of images. However, there is no universal or optimal set of techniques for different imaging modalities, settings, and objectives. As a result, it is critical to choose and customize these methods according to the specific needs and challenges of each application.

The review indicates that neural network architectures such as 3D-UNet, 3D VB-Net, ResNet18, 2D-RefineNet, CE-Net, or fully convolutional neural networks have been frequently utilized for image segmentation and classification, achieving high performance.

6 Implication and limitations of the study

Our review demonstrates that CNNs and their variants are effective in detecting gynecologic cancer using medical images, which can aid in diagnosis and treatment. However, our study has some limitations that need to be addressed. First, we did not investigate other types of cancer like breast, lung which are highly prevalent and the review also includes only few imaging modalities so, it lacks the generalizability issue. Since the review focused on comparing the performance of different machine learning and deep learning models, we don’t know the performance of these models as compared with human experts. The summary roc curve of the review indicates that DL algorithms surpass traditional ML approaches in diagnosing gynecologic cancer using medical images. This implies that DL algorithms can improve the early detection and treatment of these diseases, particularly in resource-limited environments where imaging is more feasible than other modalities. However, the diagnostic accuracy of DL algorithms for gynecologic cancer is moderate and requires further improvement. This suggests that there are still challenges and limitations in implementing DL algorithms for this complex and diverse disease, and that additional research is necessary to optimize these algorithms’ performance and generalizability. Therefore, future research should concentrate on developing more robust, dependable, transparent, and ethical deep learning models and techniques for diagnosing gynecologic cancer.

Author contributions

AT conceived the study, designed the search strategy, performed the literature review, data extraction, quality assessment, and meta-analysis. He also drafted the manuscript and approved the final version. BT conceived the study, designed the search strategy, performed the literature review, data extraction, quality assessment, and meta-analysis. He also revised the manuscript and approved the final version. TA contributed to the data extraction, quality assessment, and meta-analysis. He also revised the manuscript and approved the final version. AA contributed to the data extraction, quality assessment, and meta-analysis. He also revised the manuscript and approved the final version. AM revised the manuscript critically for important intellectual content. He also approved the final version. SAM revised the manuscript critically for important intellectual content. He also approved the final version. All authors contributed to the article and approved the submitted version.

Funding

The author(s) declare that no financial support was received for the research, authorship, and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fonc.2023.1216326/full#supplementary-material

References

1. Arnold M, Moore SP, Hassler S, Ellison-Loschmann L, Forman D, Bray F. The burden of stomach cancer in indigenous populations: a systematic review and global assessment. Gut (2014) 63(1):64–71. doi: 10.1136/gutjnl-2013-305033

PubMed Abstract | CrossRef Full Text | Google Scholar

2. Ferlay J, Soerjomataram I, Dikshit R, Eser S, Mathers C, Rebelo M, et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J cancer (2015) 136(5):E359–E86. doi: 10.1002/ijc.29210

PubMed Abstract | CrossRef Full Text | Google Scholar

3. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: Cancer J Clin (2021) 71(3):209–49. doi: 10.3322/caac.21660

PubMed Abstract | CrossRef Full Text | Google Scholar

4. Chen WQ, Li H, Sun KX, Zheng RS, Zhang SW, Zeng HM, et al. [Report of cancer incidence and mortality in China 2014]. Zhonghua zhong liu za zhi [Chinese J oncology] (2018) 40(1):5–13. doi: 10.3760/cma.j.issn.0253-3766.2018.01.002

Deep-learning models for image-based gynecological cancer diagnosis: a systematic review and meta- analysis

1 Introduction

2 Methods

2.1 Literature search

2.2 Inclusion and exclusion criteria

2.3 Assessment of methodologic quality

2.4 Data extraction

2.5 Qualitative synthesis

2.6 Meta-analysis

3 Results

3.1 Study selection

3.2 Study characteristics

3.3 Preprocessing and feature extraction techniques

3.4 Deep learning models in gynecologic cancer diagnosis

3.5 Deep learning for gynecologic cancer segmentation

3.6 Deep learning models used for abnormality detection

3.7 Deep learning models used for automatic image classification

3.8 Pooled performance of DL algorithms

3.9 Subgroup meta-analyses

3.10 Heterogeneity analysis

3.11 Quality assessment

4 Discussion

5 Conclusion

6 Implication and limitations of the study

Author contributions

Funding

Conflict of interest

Publisher’s note

Supplementary material

References

94% of researchers rate our articles as excellent or good