To explore the feasibility of a deep learning three-dimensional (3D) V-Net convolutional neural network to construct high-resolution computed tomography (HRCT)-based auditory ossicle structure recognition and segmentation models.
The temporal bone HRCT images of 158 patients were collected retrospectively, and the malleus, incus, and stapes were manually segmented. The 3D V-Net and U-Net convolutional neural networks were selected as the deep learning methods for segmenting the auditory ossicles. The temporal bone images were randomized into a training set (126 cases), a test set (16 cases), and a validation set (16 cases). Taking the results of manual segmentation as a control, the segmentation results of each model were compared.
The Dice similarity coefficients (DSCs) of the malleus, incus, and stapes, which were automatically segmented with a 3D V-Net convolutional neural network and manually segmented from the HRCT images, were 0.920 ± 0.014, 0.925 ± 0.014, and 0.835 ± 0.035, respectively. The average surface distance (ASD) was 0.257 ± 0.054, 0.236 ± 0.047, and 0.258 ± 0.077, respectively. The Hausdorff distance (HD) 95 was 1.016 ± 0.080, 1.000 ± 0.000, and 1.027 ± 0.102, respectively. The DSCs of the malleus, incus, and stapes, which were automatically segmented using the 3D U-Net convolutional neural network and manually segmented from the HRCT images, were 0.876 ± 0.025, 0.889 ± 0.023, and 0.758 ± 0.044, respectively. The ASD was 0.439 ± 0.208, 0.361 ± 0.077, and 0.433 ± 0.108, respectively. The HD 95 was 1.361 ± 0.872, 1.174 ± 0.350, and 1.455 ± 0.618, respectively. As these results demonstrated, there was a statistically significant difference between the two groups (
The 3D V-Net convolutional neural network yielded automatic recognition and segmentation of the auditory ossicles and produced similar accuracy to manual segmentation results.