Language Screening in 3-Year-Olds: Development and Validation of a Feasible and Effective Instrument for Pediatric Primary Care

Holzinger, Daniel; Weber, Christoph; Barbaresi, William; Beitel, Christoph; Fellinger, Johannes

doi:10.3389/fped.2021.752141

ORIGINAL RESEARCH article

Front. Pediatr., 23 November 2021

Sec. Children and Health

Volume 9 - 2021 | https://doi.org/10.3389/fped.2021.752141

This article is part of the Research TopicSurveillance of Language Development in Pre-School ChildrenView all 13 articles

Language Screening in 3-Year-Olds: Development and Validation of a Feasible and Effective Instrument for Pediatric Primary Care

Daniel Holzinger^1,2,3^*

Christoph Weber^2,4

William Barbaresi^2,5

Christoph Beitel¹

Johannes Fellinger^1,2,6

¹Institute of Neurology of Senses and Language, Hospital of St. John of God, Linz, Austria
²Research Institute for Developmental Medicine, Johannes Kepler University Linz, Linz, Austria
³Institute of Linguistics, University of Graz, Graz, Austria
⁴Department for Inclusive Education, University of Education Upper Austria, Linz, Austria
⁵Division of Developmental Medicine, Department of Pediatrics, Boston Children's Hospital and Harvard Medical School, Boston, MA, United States
⁶Division of Social Psychiatry, University Clinic for Psychiatry and Psychotherapy, Medical University of Vienna, Vienna, Austria

Objective: The study was aimed at evaluating the validity and feasibility of SPES-3 (Sprachentwicklungsscreening), a language screening in 3-year-old children within the constraints of regular preventive medical check-ups.

Methods: A four-component screening measure including parental reports on the child's expressive vocabulary and grammar based on the MacArthur Communicative Development Inventory and pediatrician-administered standardized assessments of noun plurals and sentence comprehension was used in a sample of 2,044 consecutively seen children in 30 pediatric offices. One-hundred forty-four children (70 who failed and 74 who passed the screener) comprised the validation sample and also underwent follow-up gold standard assessment. To avoid verification and spectrum bias multiple imputation of missing diagnosis for children who did not undergo gold standard assessment was used. Independent diagnoses by two experts blinded to the screening results were considered gold standard for diagnosing language disorder. Screening accuracy of each of the four subscales was analyzed using receiver operator characteristic (ROC) curves. Feasibility was assessed by use of a questionnaire completed by the pediatricians.

Results: The two parental screening subscales demonstrated excellent accuracy with area under the curve (AUC) scores of 0.910 and 0.908 whereas AUC scores were significantly lower for the subscales directly administered by the pediatricians (0.816 and 0.705). A composite score based on both parental screening scales (AUC = 0.946) outperformed single subscales. A cut off of 41.69 on a T-scale resulted in about 20% positive screens and showed good sensitivity (0.878) and specificity (0.876). Practicability, acceptability and sustainability of the screening measure were mostly rated as high.

Conclusion: The parent-reported subscales of the SPES-3 language screener are a promising screening tool for use in primary pediatric care settings.

Introduction

Depending on the definition used, 2–10% of pre-school-age children experience delayed language acquisition, which makes language disorder (LD) one of the most prevalent developmental disorders (1, 2). However, there is no generally accepted definition of what constitutes a LD. A recent consensus statement on terminology and criteria for language problems in children (3) has resulted in the endorsement of the term Developmental Language Disorder (DLD) for language difficulties that are associated with functional impairment and poor prognosis but have no known biomedical etiology. DLD continues to be a clinical diagnosis since functional impairment and prognosis need to be assessed by appropriately trained clinicians. In addition, the degree of language delay and the linguistic dimensions (phonology, vocabulary, morphology, syntax, pragmatics) and modalities (expressive and receptive) encompassed have not been specified in the consensus document.

In an English population study, Norbury et al. (2) found a prevalence of DLD (of unknown origin) of 7.58%, while 2.34% of LDs were associated with intellectual disability and/or a medical diagnosis (total approximately 10% of LDs from all causes). They defined DLD as scores of −1.5 standard deviations (SD) and below on at least two of five language domains. Similarly, other researchers (4–6) classified a child as “specifically language impaired” whenever language performance was below −1.25 SD in at least two language domains measured by norm-referenced tests. In the absence of a generally accepted measurable gold standard for the definition of LDs, we based our definition on the previously mentioned classifications, which are commonly used in research, with an expected prevalence rate of about 10%.

LDs can affect multiple domains of development through adolescence and adulthood. Children with delayed language development are at increased risk for poor socio-emotional, health (7), behavior and academic outcomes (8, 9) and later unemployment (10) with corresponding costs and loss of human potential.

There is growing evidence that intervention for children with LD may be effective. Direct treatments by a specialist and indirect treatments mediated by caregivers have been shown to have positive effects (11–14). Hence, it is essential to identify children who require educational or therapeutic support in their language learning in order to offer timely and effective intervention.

Based on a number of methodological problems identified in their systematic review of language screening (e.g., lack of information on the effects of age, setting and administrator on screening accuracy) and insufficient evidence for long-term outcomes of language interventions, Nelson et al. (15) did not recommend universal language screening. In a more recent systematic review, Wallace et al. (16) reported on the accuracy of some screening measures for identification of children with language impairment although evidence of feasibility in primary-care settings remained inadequate. Consequently, the U.S. Preventive Taskforce did not recommend universal screening for language delay (17). Given the international systematic reviews (1, 15) and following the National Health and Medical Research Council (18), the German Institute for Quality and Efficiency in Health Care (19) also considered the available evidence insufficient to recommend the implementation of language screening in Germany. In the same vein, a systematic review evaluating the effectiveness of systematic population-based screening for specific language impairment in pre-school children in Germany (20) concluded that the accuracy of German screening measures had not yet been sufficiently examined.

Systems for general health check-ups have been established in many countries in the Western world and could be suitable opportunities for identification of language delays. However, due to a lack of accurate and feasible instruments, routine well-baby check-ups are generally not used for systematic language screening. One exception is in the Netherlands, where a five-minute interview/test procedure (VTO language screening) administered by a youth health care physician to 24 month-old children led to more cases with language impairment being identified than by the regular procedure (0.4–2.4%), and at a significantly younger age than in regions in which the regular detection procedures were used (21). However, as demonstrated by the low number of cases, many 2-year-old children with language impairment were not identified (low sensitivity of 24–52%). Given the high instability of language development trajectories in young children, Law et al. (1) recommended the age span of 3–5 years as the optimal period for language screening.

A combination of (i) observations by parents with extensive long-time knowledge of their children's behaviors in everyday life and (ii) standardized assessments by pediatricians is essential to avoid assessment bias (22). A comparative study of direct assessments and parent reports of language and pragmatics revealed patterns of difference that indicated a need to collect assessment data from multiple informants (23). Nevertheless, convergent validity of parent-reported tools in comparison to direct assessments of language is usually high (24, 25). Mere elicitation of parental concerns has therefore also been described as a valid method for identifying an increased risk for developmental disorders (26–28).

Particularly at the age of 3 years, grammatical knowledge is a good marker of language development (29–31). Furthermore, grammatical skills can usually be reliably assessed within a shorter time than expressive or receptive vocabulary knowledge. In addition, a comprehensive measure for identifying language disorders must refer to both production and comprehension (4). Another reason for including language reception in a screening measure is its ability to predict the persistence of language difficulties (32). As parental reports of language comprehension have often been found to overestimate children's skills (33), a direct assessment of language comprehension should be considered.

Upper Austria is a federal state with a population of 1.45 million inhabitants (with 13.297 births in 2007). Well-child visits are provided free of charge by community pediatricians and general practitioners. In 2010, when the validation study was conducted, 9,125 health check-ups (68.6% of the children born in 2007) were carried out with 3-year-old children, 66% of which by pediatricians and the remainder by general practitioners.

The aim of this study was the validation of a screening tool for LDs for 3-year-old German-speaking children in a pediatric primary-care setting in Upper Austria. The new instrument, the SPES-3 (Sprachentwicklungsscreening) that has been developed in a pilot study includes both parent observations and direct assessments of the child by the primary-care pediatrician. As required for comprehensive language assessments, the screening measure takes various linguistic domains (grammar and vocabulary) and modes (receptive and expressive) into account. For use in primary-care settings, the screening tool must have high acceptability and require little time to administer while maximizing accuracy.

Methods and Procedures

Construction of the Screening Measure and Pilot Testing

Both parent-administered screening scales are based on the MacArthur-Bates Communicative Development Inventories (MCDI), Level III (34), and require parents to systematically report their observations of their child's development of expressive grammar and expressive vocabulary. Inspired by the MCDI concept for grammar assessment a number of morphosyntactic structures of German that typically emerge around age 3 were selected to be presented to the child in the form of 27 pairs of correct and incorrect options. Parents are asked to select the option they are more likely to observe in their child's spontaneous use of language. For expressive vocabulary, 100 words from the MCDI-III English word list (34) translated into Austrian German (35) were chosen. Parents are asked to indicate whether their child uses the words in their expressive communication or not.

The screening subscales administered by the pediatricians were compiled from pre-existing subtests of a German standardized language test (Sprachentwicklungstest SETK 3-5; (36), The first subscale includes 20 items that assess the production of noun plurals: The pediatrician presents and names a pictured item in the singular and asks the child to produce the respective plural form supported by a picture that shows several identical items. The second subscale assesses sentence comprehension: Single sentences are read aloud (9 items), and the child is asked to point to the corresponding picture from a selection of four. For an overview of all screening scales (see Table 1).

TABLE 1

Table 1. Screening subscales.

The preliminary version of the screening procedure was first used in 2006 in a pilot study in close cooperation with the Pediatric Association and the Department of Health and Social Affairs in Upper Austria, with the aim of developing language screening instruments to be used in pediatric primary care within regular health check-ups at ages 2 and 3. The pilot study included 1,730 non-preselected 3-year-old children with German as their first language, who were consecutively assessed by a group of 24 primary care pediatricians recruited from across the state of Upper Austria (whose participation was voluntary).

Based on the pilot study, the parental report of expressive grammar scale, which initially contained 27 items, was shortened by excluding 14 items with low response rates, low difficulty and low item-scale correlation. Thus, the final screening measure used in the validation study covered 13 items for expressive grammar, 100 items for expressive vocabulary, 20 items for noun plurals and 9 items for sentence comprehension. All items were scored either as 0 (does not apply/false) or 1 (applies/correct), and for each subscale a sum over the item scores was computed. Internal consistency (Kuder and Richardson Formula 20) was excellent for the parental screening scales (ρ_KR20 = 0.98 for expressive vocabulary and 0.90 for expressive grammar) and adequate for the screening scales directly administered with the children [ρ_KR20 = 0.78 for noun plurals and sentence comprehension; (36)] For the questionnaires to be completed by the parents (expressive grammar and expressive vocabulary) cut-off scores for the validation study were derived from the 10th percentiles (SD 1.25) based on the final instruments, as suggested by the literature (4–6). Normative data to determine cut-off scores were available for both of the standardized assessments to be completed by pediatricians (noun plurals and sentence comprehension).

Gold Standard Assessment of Language Disorder

Given the lack of well-defined standards for the diagnosis of LDs, independent expert diagnoses by two experienced clinical linguists blinded to the language-screening results were considered the gold standard. Their diagnostic decisions on LD were based on the results of two standardized tests assessing the linguistic domains of expressive sentence grammar, noun plural production, sentence comprehension [SETK 3-5; (36)] and expressive vocabulary [AWST-R; (37)] that had been administered by other linguists and a short video sample (5–10 min) of spontaneous language of each child in a play and/or dialogic picture-book situation. For their decision on LD both clinical linguists followed international research (4–6) and classified a child as language delayed when language performance was at about the 10th percentile or lower in at least two of the four measured language domains and observations of spontaneous language production confirmed the significant language difficulties. Inter-rater reliability (kappa) between the linguists' diagnoses (+/– LD) was 0.95. Discrepancies were resolved by consensus decisions between the two raters.

For the assessment of non-verbal intelligence, the Snijders Oomen Non-verbal Intelligence Test [SON-R 2 ½-7; (38)] was administered by clinical psychologists. Pure tone audiometry was used to assess hearing.

Feasibility Measures

Feasibility was measured primarily by use of a questionnaire completed by the pediatricians who participated in the study and by the completeness of screening tests administered. Following the guidelines suggested by Bowen et al. (39), four dimensions of feasibility were investigated. All questions of the pediatric questionnaire were rated on a 4-point Likert scale (very good-good-difficult-very difficult).

Practicality

Practicality was operationalized by the extent to which administration of the screening was considered possible within the time constraints of pediatric primary care. In addition, the pediatricians were asked about the ease of administration of each of the two screening measures (noun plurals and sentence comprehension) and to evaluate parental difficulties in completing the two parental subscales of the screening. In addition, the pediatricians ranked five pre-specified factors that might challenge the completion of the screening measure within their respective settings.

Acceptability

Acceptability refers to the children's, parents' and pediatricians' reactions to the screening measures. Child acceptance was measured by the percentage of screening subscales administered by the pediatricians that were fully completed, as this reflects child compliance with the test. In addition, the pediatricians assessed parental acceptance of the inclusion of language screening in the 3-year medical check-up. Finally, the pediatricians were asked to rate the meaningfulness of including a language screening within the regular well-baby check-ups.

Sustainability

Sustainability refers to the likelihood of language screening in this form being continued within the present system of preventive medical care in Austria. Pediatricians were asked in the questionnaire whether they intended to continue the language screening after the study ended (yes, to a limited extent, no).

Study Procedures and Recruitment

In 2009, all primary care pediatricians of the province of Upper Austria were invited to participate in the validation study of the language screening. Thirty-six out of 60 pediatricians participated in a half-day training and in the subsequent implementation of the screening procedures. The participating pediatricians served all major geographical areas of Upper Austria. Over a 1-year period (2010) 2,635 3-year-old children (19.8% of the entire 2007 Upper Austrian birth cohort) were screened by the 30 pediatricians who ultimately participated in the study.

Overall, 591 children were excluded from the study: 31 children with missing data on their date of birth, one child with missing data on the screening date, 95 children who were outside the age range (34–38 months), 349 children from families whose primary family language was other than German, 95 children with missing data on both parental screening subscales and 22 with missing data on both pediatric screening subscales. The remaining sample (n = 2,044) equaled 15.4% of the entire 2007 birth cohort and about 22% of the children born in 2007 by native-born mothers. The mean age was 36.03 months (SD = 0.994). 50.9% were boys, 4.9% were multiple births, 50.7% of the children had older siblings, and 8.6% of children were born prematurely. Compared to the 2007 birth cohort for Upper Austria, the sample was representative in terms of sex ratio and prematurity rate. Multiple births were slightly overrepresented (4.9% vs. 3.4% in the birth cohort; χ²(1) = 14.003, p < 0.001). To assess the representativeness of the sample in terms of maternal education (proportion of mothers with university entrance qualification or tertiary degree), we calculated an age-adjusted comparison value based on the educational level of all women (i.e., not only mothers) from Upper Austria. Among the participating families the percentage of mothers with either a university entrance qualification or a tertiary degree was comparable to the age-adjusted population (39.4% vs. 41.6%; χ²(1) = 4.072, p < 0.05).

The full screening was administered in the course of well-child visits in the pediatric medical practices. Parents completed a screening package that included demographic information in addition to their language observations (expressive grammar and expressive vocabulary). All parents gave written permission for scientific use of their children's anonymized data for scientific purposes. The study was approved by the ethics committee of the Hospital of St. John of God, Linz. Pediatricians administered two screening subscales (noun plurals and sentence comprehension). All screening data were sent back to the clinic conducting the research. In accordance with Tomblin et al. (40), a screening test was considered a fail if the results of at least two of the four subtests were 1.25 standard deviations below the age norm. Using this ex-ante definition for positive screening results, 21.7% of the sample was considered a screening fail.

To validate the screening measures, a sample for full gold standard assessment was recruited in a two-step procedure. All pediatricians were instructed to refer children with positive screening results to a single specialized program for gold standard examination, which was performed in 70 children. Second, to evaluate specificity of the screening tool, four pediatricians from different regions were asked to invite a random subsample of children from the whole cohort, excluding those (n = 70) who had already undergone the gold standard assessment. The sample was stratified by gender, maternal education, position among siblings and single parenthood. The recruitment procedures (see Figure 1) led in total to a validation sample of n = 144 (i.e., 7% of the total sample), that did not deviate from the remaining sample in terms of demographic variables (see sample description).

FIGURE 1

Figure 1. Recruitment and participation.

Analytic Strategy

First, we report descriptive statistics for the total sample and the validation sample. Second, we present results on the screening accuracy of each subscale based on receiver operator characteristic (ROC) analyses. Areas under the curve (AUCs) ≥0.9 are regarded as excellent, AUCs ≥0.8 and < 0.9 as good, AUCs ≥0.7 and < 0.8 as fair, and tests with AUCs < 0.7 as poor (41). DeLong's method for comparing AUCs of different tests (42) was applied to analyze differences in diagnostic accuracy between the subtests. Further, logistic regression was applied to identify subscales with significant independent contributions to the prediction of the gold standard diagnosis. The aim of this analytic step was to reduce the number of screening subtests while optimizing diagnostic accuracy. Lastly, we evaluated possible cut-off scores for the final screening composite by estimating sensitivity, specificity, positive predictive values (PPV), negative predictive values (NPV), and diagnostic likelihood ratios for positive and negative screening results (DLR+ and DLR–, respectively). DLR+ and DLR– are alternative measures of the accuracy of a diagnostic test and have the advantage, unlike predictive values, not to depend on the prevalence of the disorder under investigation [(43); for an explanation see Supplementary Material]. DLR+ is the multiplicative change in the pre-screening odds of having a LD given a positive screening result (i.e., post-screening odds = DLR+ × pre-screening odds) and DLR– is the change in the pre-screening odds of having a LD given a negative screening result (post-screening odds = DLR– × pre-screening odds). Following Jaeschke et al. (44), DLR+ values ≥ 10 and DLR– ≤ 0.1 indicate large changes in pre-screening odds, DLR+ ≤ 10 and >5, and DLR– > 0.1 and ≤ 0.2 indicate moderate changes, DLR+ ≤ 5 and >2, and DLR– > 0.2 and ≤ 0.5 indicate small changes. DLR+ <2 and DLR– > 0.5 are rarely important. Logistic regression models were conducted using Mplus 8 (45), for the ROC analysis the pROC package (46) in R was employed, and cut-off values were determined with the R-OptimalCutpoints package (47).

Notably, the recruitment procedure led to oversampling of children with positive screening results, which is not uncommon for screening validation studies (48). Ideally, the validation sample should reflect the patient population, the given overrepresentation of positive screening results induces bias in measures of screening accuracy [i.e., verification bias; (49)]. In order to deal with this bias, we used multiple imputation (MI) of missing diagnosis status for children who did not undergo gold standard assessment (50). Missing values on screening subscales (ranging between 0.4 and 12.8% in the full sample) were also imputed. We used the Blimp imputation software (51, 52) to generate 50 imputed data sets using a chained equation imputation procedure that takes the clustering of children within pediatricians into account. Beside the study variables (i.e., sociodemographic variables and screening results) we used various auxiliar variables (e.g., parental concern about language development) that were predictive of diagnosis status. Estimates and their standard errors were computed according to Rubin's combining rules (53). Even though recent simulation studies do not indicate that high proportions (up to 90%) of missing data might bias estimates based on MI data sets (54), we also report results for the original validation sample (n = 144) as Supplementary Material for completeness.

Results

Figure 2 shows the distributions of the screening subscales. As expected, the scores on all scales were skewed to the left. Means and standard deviations were M = 10.22 (SD = 3.67) for expressive grammar and M = 73.00 (SD = 21.76) for expressive vocabulary, M = 12.05 (SD = 5.19) for noun plurals, M = 7.48 (SD = 1.48) for sentence comprehension. The correlations between screening subscales were moderate to high. The parental scales correlated with r = 0.56 (p < 0.001). The correlation of the pediatricians' scales was r = 0.40 (p < 0.001). Further correlations were: r_{expressive vocabulary, sentence comprehension} = 0.30, p < 0.001; r_{expressive vocabulary, noun plurals} = 0.40, p < 0.001; r_{expressive grammar, sentence comprehension} = 0.36; p < 0.001; r_{expressive grammar, noun plurals} = 0.43; p < 0.001.

FIGURE 2

Figure 2. Distribution of the screening subscales (raw scores).

Moreover, analyses showed that children with positive screening results (ex-ante definition) who attended the gold standard assessment showed lower scores in three subscales than children with positive screening results who did not attend the assessment (expressive grammar: d = 0.60, p < 0.001; sentence comprehension: d = 0.29, p < 0.05; noun plurals: d = 0.51; p < 0.001). Thus, the validation sample may also be subject to spectrum bias (i.e., screening positives include primarily the “sickest of the sick” and not the full spectrum of positive screens), which is associated with overestimation of sensitivity, specificity, and AUC (49). However, the used multiple imputation procedure is also suitable to counteract spectrum bias. Based on the gold standard, 27.8% of the validation sample (n = 144) had a LD. Notably, all children with LD had positive screening results (i.e., sensitivity = 1.00). After imputation 11.7% of the children are classified as having a LD.

As children are clustered within pediatricians, we estimated the intraclass correlation (ICC) to evaluate whether there are differences between pediatricians in the subscales. We found that pediatricians accounted for ≈14% of the differences in the pediatrician-reported subscales (ICC_{noun plurals} = 0.144, p < 0.001; ICC_{sentence comprehension} = 0.139, p < 0.001). If differences between pediatricians were due to population differences in the catchment areas, we would also expect comparable ICCs for the parent reports. However, ICCs for parent reports were smaller (ICC_{expressive vocabulary} = 0.049, p = 0.005, ICC_{expressive grammar} = 0.021, p = 0.05). Thus, these findings indicate that pediatricians differed significantly in their application of the screening tools, which calls into question the objectivity of implementation.

Diagnostic Accuracy of Subscales

Results of the ROC analysis for the screening subscales are shown in Table 2. The AUCs ranged from fair (AUC_{sentence comprehension} = 0.705, DeLong 95% CI = [0.623, 0.786]) to excellent (AUC_{expressive grammar} = 0.910, DeLong 95% CI = [0.859, 0.960]). The second parent-reported subscale showed an almost identical excellent AUC value (AUC_{expressive vocabulary} = 0.908, DeLong 95% CI = [0.864, 0.952]). The AUC for noun plurals was good (AUC_{noun plurals} = 0.816, DeLong 95% CI = [0.745, 0.887]). As indicated by DeLong tests for paired ROC curves, parent-reported scales outperformed the screening subscales administered by pediatricians. All AUC differences between parent-reported and pediatrician-administered scales were significant, and noun plurals outperformed sentence comprehension.

TABLE 2

Table 2. Diagnostic accuracy of the screening subscales.

To examine independent contributions of subscales to predicting the gold-standard diagnosis logistic regression analyses were performed (Table 3). The two parent-reported subscales independently predicted LD. After controlling for parent-reported screening tests, none of the pediatrician-administered subscales significantly predicted LD. Standardized coefficient (b) for the expressive vocabulary subscale was −0.408 (p < 0.001) and −0.388 (p < 0.001) for the expressive grammar subscale. Notably, as indicated by the overlapping confidence intervals of the standardized logistic regression coefficients, both parent reported scales had roughly the same weight in predicting LD.

TABLE 3

Table 3. Logistic regression predicting LD on the basis of screening subtests.

Diagnostic Accuracy of the Composite Screening Score

Given the results of the logistic regression models, a composite screening score based on both significant predictors (expressive vocabulary and expressive grammar) was computed. As both parent reported scales contributed almost equally to the prediction of LD, a composite score was computed as the mean of the z-scores of expressive vocabulary and expressive grammar. The AUC for the composite score was excellent at 0.946 (DeLong 95% CI = [0.883, 1.000]). DeLong tests for paired ROC curves indicate that the composite outperformed the single parent reported scales (composite score vs. expressive vocabulary: ΔAUC = 0.038, t-value = 2.380, p = 0.019; composite score vs. expressive grammar: ΔAUC = 0.036, t-value = 2.102, p =0.037).

Cut-Off Estimation

In the next step, cut-off values for the screening composite were estimated (Table 4). For ease of interpretation, the composite score was transformed into a T metric (i.e., M = 50, SD = 10). First, we estimated the cut-off by setting sensitivity equal to specificity using the “SpEqualSe” criterion in the Optimal Cutoff Package (47). A cut-off at 41.69 was most efficient. Table 5 reports the classification results for this cut-off that resulted in satisfactory accuracy statistics: sensitivity = 0.878 (95%-CI = [0.770, 0.985]), specificity = 0.876 (95%-CI = [0.856, 0.895]), PPV = 0.438 (95%-CI = [0.333, 0.544]), NPV = 0.984 (95%-CI = [0.967, 1.000]), DLR+ = 7.078 (95%-CI = [5.779, 8.378]), DLR– = 0.140 (95%-CI = [0.018, 0.261]). The cut-off would have resulted in 20.0% screening fails and consequently in a relatively high number of clinical evaluations required. Lower cut-offs resulted in fewer screening fails and higher PPV, DLR+, DLR– and specificity, but also in lower sensitivity (see Table 4). Thus, lower cut-offs would yield more false-negative results.

TABLE 4

Table 4. Diagnostic accuracy statistics for various cut-offs.

TABLE 5

Table 5. Classification table.