AUTHOR=Akazawa Kentaro , Sakamoto Ryo , Nakajima Satoshi , Wu Dan , Li Yue , Oishi Kenichi , Faria Andreia V. , Yamada Kei , Togashi Kaori , Lyketsos Constantine G. , Miller Michael I. , Mori Susumu 

TITLE=Automated Generation of Radiologic Descriptions on Brain Volume Changes From T1-Weighted MR Images: Initial Assessment of Feasibility

JOURNAL=Frontiers in Neurology

VOLUME=10

YEAR=2019

URL=https://www.frontiersin.org/journals/neurology/articles/10.3389/fneur.2019.00007

DOI=10.3389/fneur.2019.00007

ISSN=1664-2295

ABSTRACT=<p><bold>Purpose:</bold> To examine the feasibility and potential difficulties of automatically generating radiologic reports (RRs) to articulate the clinically important features of brain magnetic resonance (MR) images.</p><p><bold>Materials and Methods:</bold> We focused on examining brain atrophy by using magnetization-prepared rapid gradient-echo (MPRAGE) images. The technology was based on multi-atlas whole-brain segmentation that identified 283 structures, from which larger superstructures were created to represent the anatomic units most frequently used in RRs. Through two layers of data-reduction filters, based on anatomic and clinical knowledge, raw images (~10 MB) were converted to a few kilobytes of human-readable sentences. The tool was applied to images from 92 patients with memory problems, and the results were compared to RRs independently produced by three experienced radiologists. The mechanisms of disagreement were investigated to understand where machine–human interface succeeded or failed.</p><p><bold>Results:</bold> The automatically generated sentences had low sensitivity (mean: 24.5%) and precision (mean: 24.9%) values; these were significantly lower than the inter-rater sensitivity (mean: 32.7%) and precision (mean: 32.2%) of the radiologists. The causes of disagreement were divided into six error categories: mismatch of anatomic definitions (7.2 ± 9.3%), data-reduction errors (11.4 ± 3.9%), translator errors (3.1 ± 3.1%), difference in the spatial extent of used anatomic terms (8.3 ± 6.7%), segmentation quality (9.8 ± 2.0%), and threshold for sentence-triggering (60.2 ± 16.3%).</p><p><bold>Conclusion:</bold> These error mechanisms raise interesting questions about the potential of automated report generation and the quality of image reading by humans. The most significant discrepancy between the human and automatically generated RRs was caused by the sentence-triggering threshold (the degree of abnormality), which was fixed to z-score &gt;2.0 for the automated generation, while the thresholds by radiologists varied among different anatomical structures.</p>