AUTHOR=Ye Xianghua , Guo Dazhou , Tseng Chen-Kan , Ge Jia , Hung Tsung-Min , Pai Ping-Ching , Ren Yanping , Zheng Lu , Zhu Xinli , Peng Ling , Chen Ying , Chen Xiaohua , Chou Chen-Yu , Chen Danni , Yu Jiaze , Chen Yuzhen , Jiao Feiran , Xin Yi , Huang Lingyun , Xie Guotong , Xiao Jing , Lu Le , Yan Senxiang , Jin Dakai , Ho Tsung-Ying TITLE=Multi-Institutional Validation of Two-Streamed Deep Learning Method for Automated Delineation of Esophageal Gross Tumor Volume Using Planning CT and FDG-PET/CT JOURNAL=Frontiers in Oncology VOLUME=Volume 11 - 2021 YEAR=2022 URL=https://www.frontiersin.org/journals/oncology/articles/10.3389/fonc.2021.785788 DOI=10.3389/fonc.2021.785788 ISSN=2234-943X ABSTRACT=Background: The current clinical workflow for esophageal gross tumor volume (GTV) contouring relies on manual delineation of high labor-costs and inter-user variability. Purpose: To validate the clinical applicability of a deep learning multi-modality esophageal GTV contouring model, developed at one institution whereas tested at multiple institutions. Methods and Materials: We collected 606 patients with esophageal cancer retrospectively from four institutions. 252 patients from institution-1 contained both a treatment planning-CT (pCT) and a pair of diagnostic FDG-PET/CT; 354 patients from other three institutions had only pCT scans under different staging protocols or lacking PET scanners. A two-streamed deep learning model for GTV segmentation was developed using pCT and PET/CT scans of a subset (148 patients) from institution-1. This built model had the flexibility of segmenting GTVs via only pCT, or pCT+PET/CT combined when available. For independent evaluation, the rest 104 patients from institution-1 behaved as unseen internal testing, and 354 patients from other three institutions were used for external testing. Degrees of manual revision were further evaluated by human experts to assess the contour-editing effort. Furthermore, the deep model’s performance was compared against four radiation oncologists in a multi-user study using 20 randomly chosen external patients. Contouring accuracy and time were recorded for the pre- and post-deep learning assisted delineation process. Results: Our two-streamed deep model achieved high segmentation accuracy in internal testing (mean Dice score (DSC): 0.81 using pCT and 0.83 using pCT+PET) and generalized well to external evaluation (mean DSC: 0.80 using pCT). Experts’ assessment showed that the predicted contours of 88% patients need only minor or no revision. In multi-user evaluation, with the assistance of deep model, inter-observer variation and required contouring time were reduced by 37.6% and 48.0%, respectively. Conclusions: Deep learning predicted GTV contours were in close agreement with the ground truth and could be adopted clinically with mostly minor or no changes.