AUTHOR=Brinks Ralph , Tönnies Thaddäus , Hoyer Annika TITLE=Importance of Diagnostic Accuracy in Big Data: False-Positive Diagnoses of Type 2 Diabetes in Health Insurance Claims Data of 70 Million Germans JOURNAL=Frontiers in Epidemiology VOLUME=2 YEAR=2022 URL=https://www.frontiersin.org/journals/epidemiology/articles/10.3389/fepid.2022.887335 DOI=10.3389/fepid.2022.887335 ISSN=2674-1199 ABSTRACT=

Large data sets comprising diagnoses of chronic conditions are becoming increasingly available for research purposes. In Germany, it is planned that aggregated claims data – including medical diagnoses from the statutory health insurance – with roughly 70 million insurants will be published regularly. The validity of the diagnoses in such big datasets can hardly be assessed. In case the dataset comprises prevalence, incidence, and mortality, it is possible to estimate the proportion of false-positive diagnoses using mathematical relations from the illness-death model. We apply the method to age-specific aggregated claims data from 70 million Germans about type 2 diabetes in Germany stratified by sex and report the findings in terms of the age-specific ratio of false-positive diagnoses of type 2 diabetes (FPR) in the dataset. The FPR for men and women changes with age. In men, the FPR increases linearly from 1 to 3 per 1,000 in the age group of 30–50 years. For age between 50 and 80 years, FPR remains below 4 per 1,000. After 80 years of age, we have an increase to approximately 5 per 1,000. In women, we find a steep increase from age 30 to 60 years, the peak FPR is reached at approximately 12 per 1,000 between 60 and 70 years of age. After age 70 years, the FPR of women drops tremendously. In all age groups, the FPR is higher in women than in men. In terms of absolute numbers, we find that there are 217,000 people with a false-positive diagnosis in the dataset (95% confidence interval, CI: 204–229), the vast majority being women (172,000, 95% CI: 162–180). Our work indicates that possible false-positive (and negative) diagnoses should appropriately be dealt with in claims data, for example, by the inclusion of age- and sex-specific error terms in statistical models, to avoid potentially biased or wrong conclusions.