AUTHOR=Theyers Athena E. , Zamyadi Mojdeh , O'Reilly Mark , Bartha Robert , Symons Sean , MacQueen Glenda M. , Hassel Stefanie , Lerch Jason P. , Anagnostou Evdokia , Lam Raymond W. , Frey Benicio N. , Milev Roumen , Müller Daniel J. , Kennedy Sidney H. , Scott Christopher J. M. , Strother Stephen C. , Arnott Stephen R. 

TITLE=Multisite Comparison of MRI Defacing Software Across Multiple Cohorts

JOURNAL=Frontiers in Psychiatry

VOLUME=12

YEAR=2021

URL=https://www.frontiersin.org/journals/psychiatry/articles/10.3389/fpsyt.2021.617997

DOI=10.3389/fpsyt.2021.617997

ISSN=1664-0640

ABSTRACT=<p>With improvements to both scan quality and facial recognition software, there is an increased risk of participants being identified by a 3D render of their structural neuroimaging scans, even when all other personal information has been removed. To prevent this, facial features should be removed before data are shared or openly released, but while there are several publicly available software algorithms to do this, there has been no comprehensive review of their accuracy within the general population. To address this, we tested multiple algorithms on 300 scans from three neuroscience research projects, funded in part by the Ontario Brain Institute, to cover a wide range of ages (3–85 years) and multiple patient cohorts. While skull stripping is more thorough at removing identifiable features, we focused mainly on defacing software, as skull stripping also removes potentially useful information, which may be required for future analyses. We tested six publicly available algorithms (afni_refacer, deepdefacer, mri_deface, mridefacer, pydeface, quickshear), with one skull stripper (FreeSurfer) included for comparison. Accuracy was measured through a pass/fail system with two criteria; one, that all facial features had been removed and two, that no brain tissue was removed in the process. A subset of defaced scans were also run through several preprocessing pipelines to ensure that none of the algorithms would alter the resulting outputs. We found that the success rates varied strongly between defacers, with afni_refacer (89%) and pydeface (83%) having the highest rates, overall. In both cases, the primary source of failure came from a single dataset that the defacer appeared to struggle with - the youngest cohort (3–20 years) for afni_refacer and the oldest (44–85 years) for pydeface, demonstrating that defacer performance not only depends on the data provided, but that this effect varies between algorithms. While there were some very minor differences between the preprocessing results for defaced and original scans, none of these were significant and were within the range of variation between using different NIfTI converters, or using raw DICOM files.</p>