Deep learning-based head and neck lymph node level (HN_LNL) autodelineation is of high relevance to radiotherapy research and clinical treatment planning but still underinvestigated in academic literature. In particular, there is no publicly available open-source solution for large-scale autosegmentation of HN_LNL in the research setting.
An expert-delineated cohort of 35 planning CTs was used for training of an nnU-net 3D-fullres/2D-ensemble model for autosegmentation of 20 different HN_LNL. A second cohort acquired at the same institution later in time served as the test set (n = 20). In a completely blinded evaluation, 3 clinical experts rated the quality of deep learning autosegmentations in a head-to-head comparison with expert-created contours. For a subgroup of 10 cases, intraobserver variability was compared to the average deep learning autosegmentation accuracy on the original and recontoured set of expert segmentations. A postprocessing step to adjust craniocaudal boundaries of level autosegmentations to the CT slice plane was introduced and the effect of autocontour consistency with CT slice plane orientation on geometric accuracy and expert rating was investigated.
Blinded expert ratings for deep learning segmentations and expert-created contours were not significantly different. Deep learning segmentations with slice plane adjustment were rated numerically higher (mean, 81.0 vs. 79.6, p = 0.185) and deep learning segmentations without slice plane adjustment were rated numerically lower (77.2 vs. 79.6, p = 0.167) than manually drawn contours. In a head-to-head comparison, deep learning segmentations with CT slice plane adjustment were rated significantly better than deep learning contours without slice plane adjustment (81.0 vs. 77.2, p = 0.004). Geometric accuracy of deep learning segmentations was not different from intraobserver variability (mean Dice per level, 0.76 vs. 0.77, p = 0.307). Clinical significance of contour consistency with CT slice plane orientation was not represented by geometric accuracy metrics (volumetric Dice, 0.78 vs. 0.78, p = 0.703).
We show that a nnU-net 3D-fullres/2D-ensemble model can be used for highly accurate autodelineation of HN_LNL using only a limited training dataset that is ideally suited for large-scale standardized autodelineation of HN_LNL in the research setting. Geometric accuracy metrics are only an imperfect surrogate for blinded expert rating.