ORIGINAL RESEARCH article

Front. Psychol., 18 July 2018

Sec. Quantitative Psychology and Measurement

Volume 9 - 2018 | https://doi.org/10.3389/fpsyg.2018.01225

Development and Validation of an Item Bank for Depression Screening in the Chinese Population Using Computer Adaptive Testing: A Simulation Study

  • School of Psychology, Jiangxi Normal University, Nanchang, China

Abstract

With the increasing prevalence of depression, creating a simple and precise tool for measuring depression is becoming more important. This study developed a computer adaptive testing for depression (CAT-Depression) from a Chinese sample. The depression item bank was constructed from a sample of 1,135 participants with or without depression using the Graded Response Model (GRM; Samejima, 1969). The final depression item bank with strict unidimensionality comprised 68 items, which had local independence, good item-fit, high discrimination, no differential item functioning (DIF), and each item measured at least one symptom of diagnostic criteria for depression in ICD-10. In addition, the mean IRT discrimination of the item bank reached 1.784, which clearly showed that the item bank of CAT-Depression was high-quality. Moreover, a simulation CAT study with real response data was conducted to investigate the characteristics, marginal reliability, criterion-related validity, and predictive utility (sensitivity and specificity) of CAT-Depression. The results revealed that the proposed CAT-Depression had acceptable and reasonable marginal reliability, criterion-related validity, and sensitivity and specificity.

Introduction

Depression is one of the most prevalent psychological and behavioral disorders, and the number of people who commit suicide because of depression is growing. By the year 2020, depression will account for 5.7% of the total burden of disease (Dennis et al., 2016), and will be the second greatest disease leading to disability and death after coronary heart disease according to the World Health Organization (Dennis and Hodnett, 2014). At present, the number of depressed patients who choose to seek medical treatment is growing, thus, it is very essential to have accurate assessment and diagnosis of patients with depression and provide timely treatment.

In the past, evaluation of depression was predominantly based on questionnaires that were compiled according to the classical test theory (CTT) framework. These questionnaires include the Center for Epidemiological Studies-Depression Scale (CES-D; Radloff, 1977), the Self-rating depression scale (SDS; Zung, 1965), the Patient health questionnaire (PHQ-9; Kroenke et al., 2001), and the Beck Depression Inventory (BDI; Beck et al., 1987). These questionnaires under the CTT framework have fixed lengths, and usually contain items corresponding to various levels of depression. A large number of items may deviate from the symptoms of respondents with depression, in that respondents are commonly required to answer each item of a questionnaire, which may increase patients' unnecessary measurement burden, therefore, reducing respondents' enthusiasm. Moreover, these cannot provide respondents with more information about their depressive state.

In recent decades, a large number of researchers have used item response theory (IRT) to improve existing depression scales. For example, Aggen et al. (2005) used the Rasch model and a 2-parameter logistic (2PL) model to test the psychometric characteristics of major depression in the Diagnostic and Statistical Manual of Mental Disorders (DSM-III-R), and Stansbury et al. (2006) also applied the IRT model in an analysis of the CES-D scale. The latest and probably most fascinating new perspective provided by IRT is computerized adaptive testing (CAT), which is a form of testing that uses a computer to automatically select appropriate items for the examinee (Almond and Mislevy, 1998). In brief, CAT selects an item that is appropriate to the examinee's theta from an item bank, then updates the examinee's theta according to the responses to this item. This process is repeated until the examinee's theta is accurately estimated. We have found that CAT is an effective way to administer items. Several studies have shown that CAT can largely limit items administered, to reduce patients' burden without loss of measurement accuracy and can save patients and diagnostician considerable time. In addition, examinees' motivation to respond increased, because the selected items corresponded highly to the examinee's theta (Gibbons et al., 2008), and the examinee may think that the test was tailored for their own condition. Furthermore, the test administrator could control the standard error (SE) of the measurement and reduce test length with negligible loss of reliability and measurement precision (Gershon, 2005). CAT also has disadvantages, such as being a complex technique, having high initial costs, and requiring a substantial amount of human and financial resources to organize a CAT program. However, the advantages significantly outweigh the disadvantages (Meijer and Nering, 1999).

Regarding CAT for depression, different versions have been researched. For example, Gardner et al. (2004) used the Graded Response Model (GRM; Samejima, 1969) to model the BDI and developed a computer-adaptive version of the BDI. Smits et al. (2011) developed a computer-adaptive version of the CES-D using the GRM. Fliege et al. (2005) developed a CAT for depression whose items were from several depression scales using the Generalized Partial Credit Model (GPCM; Muraki, 1997). Gibbons et al. (2012) developed a CAT for depression using the bifactor model. Forkmann et al. (2013) developed a CAT for depression with good screening performance (Forkmann et al., 2009). Flens et al. (2017) developed a Dutch-Flemish version of CAT for depression based on the patient-reported outcomes measurement information system (PROMIS).

Although there were several existing studies on the development of CAT for depression, there are still some issues that need to be further addressed. First, some versions of CAT for depression (e.g., Gardner et al., 2004; Smits et al., 2011) were developed based on only one depression scale, which meant that there were very few items in the item bank tailored for different respondents/patients. Second, methodologically, there are a large number of IRT models that can fit CAT under the framework of IRT. However, few studies have compared different IRT models in their CAT and selected one optimal model to fit the CAT based on the test-level model-fit check or other methodological considerations. Thirdly, the samples in existing studies of CAT for depression were from different countries. For example, Gardner et al. (2004) used a European-American sample, Fliege et al. (2005) and Forkmann et al. (2013) used a German sample, while Smits et al. (2011) used a Dutch sample. However, no studies have used a Chinese sample to develop CAT for depression. More importantly, according to the investigation of the National Health and Family Commission of the People's Republic of China, the prevalence of depression in China ranged from 1.6 to 4.1% in 2015 (National Health and Family Commission of the People's Republic of China, 2015). In other words, there were about 22.4 million to 57.4 million people suffering from depression in China. It is therefore necessary to develop an efficient and accurate CAT to measure and diagnose depression in China.

In this study, we hope to address the aforementioned issues by developing a new, more efficient and accurate CAT for depression (hereby referred to as CAT-Depression) by using a Chinese sample. The items in the CAT-Depression bank were preliminarily selected from ten widely-used depression scales according to the symptom criteria of depression defined in ICD-10. The preliminarily selected items measured at least one symptom criterion of depression defined in ICD-10. In addition, five commonly polytomously-scored IRT models, that is, GRM, GPCM, Partial Credit Model (PCM; Masters, 1982), Rating Scale Model (RSM; Andrich, 1978), and Nominal Response Model (NRM; Bock, 1972), were compared based on test-level model-fit checks to choose one optimal model to fit CAT-Depression. Then, several statistical analyses, including a unidimensionality check, local independence check, item-level model-fit check, and discrimination and differential item functioning (DIF) analyses were conducted to create the final item bank of CAT-Depression. Items with local independence, high discrimination, good item-fit, no DIF, and that measured at least one symptom criterion of depression in ICD-10 were included in the final item bank of CAT-Depression with unidimensionality. Finally, a CAT simulation study was carried out to investigate the marginal reliability, criterion-related validity, and predictive utility (sensitivity and specificity) of CAT-Depression.

Methods

Participants

Participants included healthy individuals and patients with depression, aged from 13 to 80 (M = 30.19, SD = 12.23). The healthy individuals were primarily from some social groups and colleges whereas the patients with depression were recruited from eight psychiatric hospitals or mental health centers in China. A total of 1,135 participants were recruited for the study, including 922 healthy individuals and 213 patients with a doctor's diagnosis. Table 1 contains other detailed demographic information. The patients with depression were recruited with the following exclusion criteria: those with a history of psychosis, schizoaffective disorder, or schizophrenia; those with alcohol or drug dependence in the past 3 months but not excluding patients with mood disorder; and those with organic neuropsychiatric syndromes such as Alzheimer disease, Parkinson disease, etc. There were also exclusion criteria for healthy individuals: those with a history of psychosis, schizoaffective disorder, or schizophrenia; those with any psychiatric diagnosis within the past 12 months; and those who received treatment for psychiatric problems within the past 12 months. Any patients or healthy individuals who met any of the exclusion criteria were not chosen to participate in this study.

Table 1

VariablesCategoryFrequencyPercent (%)
GenderMale41436.50
Female71963.30
Missing20.20
AgeUnder 25 years49743.80
25 and above52145.90
Missing11710.30
RegionRural72864.10
City40335.50
Missing40.40

Demographic characteristics (N = 1,135).

The present study was carried out following the recommendations of psychometrics studies on mental health at the Research Center of Mental Health, Jiangxi Normal University. The protocol was approved by the Research Center of Mental Health, Jiangxi Normal University. Informed consent was obtained from all participants in accordance with the Declaration of Helsinki. Parental informed consent was also obtained for participants aged below 16.

Measures

The CAT-Depression originally consisted of 117 items. Based on the depression symptom criteria in the ICD-10, items were carefully selected from 10 Chinese-versions of self-rating questionnaires, including the Center for Epidemiological Studies-Depression Scale (CES-D; Radloff, 1977), the Self-rating depression scale (SDS; Zung, 1965), the Patient health questionnaire (PHQ-9; Kroenke et al., 2001), the Beck Depression Inventory (BDI; Beck et al., 1987), the Automatic Thoughts Questionnaire (ATQ; Hollon and Kendall, 1980), the Hospital Anxiety and Depression Scale (HADS; Zigmond et al., 1983), the Minnesota Multiphasic Personality inventory (MMPI; Hathaway and McKinley, 1942), the self-report symptom inventory Symptom checklist 90 (SCL-90; Derogatis, 1994), the Carroll's depression scale (CRS; Hamilton, 1967), and the Brief depression scale (BDS; Koenig et al., 1992). Eighteen items measured behavior-related depressive symptoms in the ICD-10, 43 items measured cognition-related depressive symptoms, 34 items measured mood-related depressive symptoms, fourteen items measured somatic-related depressive symptoms, and eight items measured the symptom of suicidal thoughts.

Items that measured at least one symptom criterion of depression defined in the ICD-10 were preliminarily chosen, and all items employed a 2-week recall period and four response categories. For example, item 13 (Little interest or pleasure in doing things) measured the ICD-10 depression symptom of loss of interest or pleasure; item 51 (Feeling tired or having little energy) measured the symptom of lack of energy or excessive fatigue; item 64 (Poor appetite or overeating) measured the symptom of appetite change; item 69 (I'm no good) measured the symptom of inferiority and loss of self-confidence; item 89 (Have you thought about ending it all) measured the symptom of suicidal thoughts.

Data analysis

Data analysis included two parts: construction of the CAT-Depression item bank, and the CAT-Depression simulation study. The first part was the development of the CAT-Depression item bank, while the second part focused on determining the reliability, validity, and predictive utility (sensitivity and specificity) of CAT-Depression. In the first part of construction of the CAT-Depression item bank, statistical analyses based on IRT were sequentially carried out, including the IRT analyses of unidimensionality, local independence, item fit, discrimination, and DIF.

Construction of the CAT-depression item bank

Unidimensionality

Unidimensionality is a crucial assumption in IRT, and item banks were regarded as unidimensional if the person's latent trait level of the item measures, rather than other factors, resulted in the person's response. Many IRT models assume unidimensionality, such as the two-parameter Logistic model (2PL) and three-parameter Logistic model (3PL) for dichotomous response data, and the GRM, the GPCM, the PCM, the RSM, and the NRM for polytomous response data. Therefore, it is essential to assess whether the item bank is sufficiently unidimensional (Reeve et al., 2007). Confirmatory factor analysis (CFA) was used to evaluate the unidimensionality of the item bank, and a one-factor model CFA was conducted by using the program Mplus 7.0 (Muthén and Muthén, 2012). In CFA, given that the items were polytomous, we used weighted least squares means and a variance adjusted (WLSMA) estimation method, which has a more precise estimation when the variables are categorical data (Beauducel and Herzberg, 2006; Resnik et al., 2012). If the comparative fit index (CFI) ≥ 0.90, the Tucker-Lewis index (TLI) ≥ 0.90, and the root mean square error of approximation (RMSEA) ≤ 0.08, unidimensionality of the item bank was considered sufficient (Kline, 2010). Items with factor loadings greater than 0.3 and significant at p = 0.05 were retained in the development of the item bank in this procedure.

Local independence

Local independence is also a vital assumption in IRT. We used Yen's Q3 statistic (Yen, 1993) to evaluate this assumption, and Q3 values higher than 0.36 were considered locally dependent (Flens et al., 2017). Therefore, one item with Q3 > 0.36 in each item pair was deleted in this study. Local independence analysis was conducted via the R package mirt (Version 1.24; Chalmers, 2012).

Test fit and IRT model selection

In this study, five polytomously-scored IRT models (i.e., the GRM, the GPCM, the PCM, the RSM, and the NRM) were simultaneously applied to fit the selected items of CAT-Depression, and the optimal model was selected based on test-level model-fit indices for further analysis. The widely-used test-level model-fit indices include−2Log-Likelihood (-2LL; Spiegelhalter et al., 1998), Akaike's information criterion (AIC; Akaike, 1974), and the Bayesian information criterion (BIC; Schwarz, 1978). Smaller values of these test-fit indices indicate better model-fit, thus we selected the model with smaller test-fit indices for the later analysis. Model selection was conducted by using the software flexMIRT (Version 3.51; Cai, 2017).

Item fit

Item fit was evaluated by the S-χ2 statistic (Kang and Chen, 2008), which quantifies and compares the differences between observed frequencies and expected frequencies under the IRT model. Items with p values of S-χ2 less than 0.001 were deemed to have poor item-fit (Flens et al., 2017) and were eliminated. In this study, a stricter rule was used instead of the recommendation of Flens et al. (2017): items with p values of S-χ2 less than 0.01 were deemed to have poor item-fit and eliminated. Item fit was also conducted by using the R package mirt (Version 1.24; Chalmers, 2012).

Discrimination

Item discrimination shows the extent to which individuals with similar scores can be differentiated via an item. An item with high discrimination implies that this item is preferable to distinguish whether individuals exhibit signs of depression. Therefore, a high discrimination parameter of one item suggests that this item is of high-quality and is helpful to obtain more precise estimation of a population latent trait. Moreover, item discrimination has an important impact on item information or standard error of measurement, which was used to decide which item was selected in the CAT environment. We used the software flexMIRT (Version 3.51; Cai, 2017) to estimate item parameters via the optimal model based on a test-level model-fit check and chose items with discrimination more than 0.8.

Differential item functioning

Given the importance of a questionnaire having no measurement bias in practice, DIF analysis was used here to assess systematic errors due to group bias (Zumbo, 1999). Ordinal logistic regression (Crane et al., 2006) was employed to perform DIF analysis under the optimal model based on a test-level model-fit check via the package lordif (Version 0.3-3; Choi, 2015). Change in McFadden's pseudo R2 was used to evaluate effect size, and the hypothesis of no DIF was rejected when R2 change was equal to or greater than 0.2 (Flens et al., 2017). Therefore, these items with changes in McFadden's pseudo R2 ≥ 0.2 were removed from the final analysis. We evaluated DIF for region (rural, city), gender (male, female) and age (under 25 years, 25 and above) groups.

The IRT analyses of unidimensionality, local independence, item fit, discrimination, and DIF were sequentially performed until all remaining items of CAT-Depression sufficiently satisfied the above rules (i.e., unidimensionality, local independence, good item-fit, high discrimination, and having no DIF). Items that satisfied all the following criteria was included in the item bank of CAT-Depression: (1) measuring at least one depression symptom, (2) satisfying the hypothesis of measuring one main dimension in IRT, (3) satisfying the hypothesis of local independence in IRT, (4) fitting the IRT model well, (5) having high discrimination with more than 0.8, and (6) having no DIF. Subsequently, by using the optimal model based on test-level model-fit check, the item parameter and theta parameter of the final item bank were re-estimated for the further CAT via the software flexMIRT (Version 3.51; Cai, 2017).

CAT-depression simulation study

The CAT simulation study with the real participants' responses data in paper and pencil (P&P) was conducted to investigate the characteristics, marginal reliability, criterion-related validity, and predictive utility (sensitivity and specificity) of the CAT-Depression.

Starting point

In the CAT simulation, item selection is dependent on the examinee's responses to a given item. However, the examinee knows nothing about prior information at the beginning (Kreitzberg and Jones, 1980). The first item was randomly selected from the final depression item bank (Magis and Barrada, 2017), as this method is simple and effective.

Scoring algorithm

After the execution of each item, the examinees' depression theta was updated with the expected a posteriori method (EAP; Bock and Mislevy, 1982) based on his or her real response of the selected item in P&P, θk refers to the quadrature points and serves as a replacement of the specific ability value. Given an ability value θk, Li(θk) is the likelihood function of examinee i with a specific response pattern, where A(θk) is the weight of the quadrature points, and . The calculations of EAP are not complex, are noniterative (Bock and Mislevy, 1982), EAP algorithms are a better choice because of their efficiency and stability.

Item selection algorithm

The new item with the highest information at that estimated theta point was selected using the maximum Fisher information (MFI) criterion (Baker, 1992) when the examinee's theta was updated. The Fisher information is then defined as where is the item information function of item j given the , where is the estimated theta, is the probability of getting score k given , K is the total score of item j, and is the first derivative of to . MFI criterion can not only ensure efficiency, but can also actively control measurement error, and is a widely used item selection algorithm.

Stopping rule

In this study, the termination rules were based on the standard error (SE) of measurement, which meant the test was terminated if the pre-specified value of SE was met or the item bank was exhausted. SE for a trait level can be defined as the reciprocal of the square root of the value of the test information function at that trait level (Magis and Raiche, 2012), where n is the number of items the examinee has answered; the stopping rule ensures the precision of parameter estimation, and makes the test result fair for each examinee. Several cut-off values of SE (theta) were used in the CAT-Depression simulation: all items in the bank administered (None), SE (theta) ≤ 0.2, SE (theta) ≤ 0.3, SE (theta) ≤ 0.4, SE (theta) ≤ 0.5, and SE (theta) ≤ 0.6, respectively. The R (Version 3.4.1; Coreteam, 2015) and R package catR (Version 3.12; Magis and Barrada, 2017) were used here for the above analysis.

Characteristics of CAT-depression

To explore the characteristics of the CAT-Depression, several statistics were calculated: the mean and standard deviation (SD) of items used, the mean SE of theta estimates, the Pearson's correlation between the estimated theta in CAT-Depression, and theta estimations using the entire item bank, and the marginal reliability that was the mean reliability for all levels of theta (Smits et al., 2011). In a CAT framework, each individual's information can be obtained based on the administered item parameters and his or her responses to these items. The corresponding reliability of each individual can be derived via the following formula (Samejima, 1994) when the mean and SD of theta are fixed to 0 and 1, respectively, the I(θi) is the test information for the i-th individual, while the rxx(θi) is the corresponding reliability in IRT for the i-th individual. Then, we can calculate the mean reliability of all individuals to get the marginal reliability. Furthermore, we plotted the number of selected items as a function of the final theta estimation and the test information curve under several stopping rules. The test information displays the measurement precision of CAT-Depression, and the greater the value, the smaller the error of the theta estimation.

Criterion-related validity and predictive utility (sensitivity and specificity) of CAT-depression

To further investigate the criterion-related validity and predictive utility (sensitivity and specificity) of CAT-Depression, the CES-D, SDS, and PHQ-9 scales, which are widely-used and well-validated in diagnosing depression, were selected as criterion scales. The Pearson's correlations between the estimated theta in the CAT-Depression and the standard scores of the SDS, the sum score of the CES-D, and the PHQ-9 were calculated to address the criterion-related validity of CAT-Depression. Then, the area under (AUC) the receiver operating characteristic curve (ROC) was used as an additional criterion to investigate the predictive utility (sensitivity and specificity) (Smits et al., 2011) of CAT-Depression. We used the CES-D, SDS, and PHQ-9, respectively, as the classified variable for depression, and the estimated theta in CAT-Depression was used as a continuous variable to plot the ROC curve under each stopping rule via the software SPSS 17.0. The AUC is a statistic used to evaluate ROC curve, and its value ranged from 0.5 to 1. A larger AUC value indicates a better diagnostic effect (Kraemer and Kupfer, 2006). The predictive utility of the estimated theta for diagnosing depression is similar to random guessing when AUC = 0.5, while it is perfect when AUC = 1. The AUC ranged from 0.5 to 0.7, from 0.7 to 0.9, and from 0.9 to 1, which indicates the predictive utilities were small, moderate, and high, respectively (Forkmann et al., 2013). Determination of the critical value was calculated by maximizing the Youden-Index (YI = sensitivity + specificity − 1) (Schisterman et al., 2005). Here sensitivity refers to the probability that a patient is accurately diagnosed with a disease, and specificity refers to the probability that general people are diagnosed with no illness; the larger the value of these two indicators, the better the effect of the diagnosis.

Results

Construction of the CAT-depression item bank

Unidimensionality and local independence

In the one-factor model CFA run in the initial CAT-Depression item bank of 117 items, 23 items were eliminated because the factor loadings were less than 0.3 or not significant at p = 0.05. After excluding the 23 items with low factor loadings or non-significance from the item bank, we re-ran the one-factor model CFA based on the remaining 94 items. Results of the one-factor model CFA of the 94 remaining items in the item bank showed acceptable model fit: CFI = 0.902, TLI = 0.900 and RMSEA = 0.051. The results clearly showed that the remaining item bank (including 94 items, see Table 2) met the assumption of unidimensionality. Table 2 shows that 15 items were deemed to be locally dependent as their Yen's Q3 statistic was greater than 0.36, items with local dependence were removed from the item bank.

Table 2

ItemAssessed symptomFactor loadingFactor loadingExcluded due to…ItemAssessed symptomFactor loadingFactor loadingExcluded due to…
CFA1CFA2CFA1CFA2
1Mood0.710.7160Cognition0.54CFA
2Mood0.730.75Q361Cognition0.810.82
3Mood0.510.5262Cognition0.330.37S-X2+ Discrimination
4Mood0.580.59Q363Mood0.77CFA
5Mood0.750.76Q3+S-X264Somatic0.470.50
6Mood0.590.52Q365Behavior0.47CFA
7Mood0.670.5966Cognition0.610.63
8Cognition0.570.5967Cognition0.580.59Q3
9Somatic0.430.4468Cognition0.730.74
10Somatic0.22CFA69Cognition0.780.79
11Cognition0.21CFA70Behavior0.600.61
12Mood0.76CFA71Behavior0.760.77
13Mood0.670.6972Cognition0.770.78
14Cognition0.660.6773Cognition0.750.76
15Mood0.750.7674Cognition0.790.79
16Mood0.600.6275Cognition0.720.73
17Mood0.78CFA76Cognition0.67CFA
18Cognition0.660.6177Mood0.760.78
19Mood0.640.66Q3+DIF78Cognition0.670.69
20Mood0.60.6279Cognition0.71CFA
21Mood0.430.4580Cognition0.760.77
22Mood0.800.881Cognition0.640.65
23Cognition0.630.6582Mood0.630.64
24Mood0.720.74Q383Cognition0.830.83S-X2
25Somatic0.32CFA84Cognition0.820.82
26Mood0.710.72S-X285Cognition0.41CFA
27Mood0.480.48DIF86Cognition0.820.82
28Mood0.670.5987Cognition−0.40CFA
29Mood0.800.8188Suicide0.840.84Q3
30Mood0.430.38Discrimination89Suicide0.780.79
31Mood0.68CFA90Cognition0.610.63
32Mood0.390.32Q3+Discrimination91Suicide0.850.84Q3
33Behavior0.620.6392Suicide0.870.87Q3
34Behavior0.740.7693Mood0.490.51DIF
35Somatic0.24CFA94Cognition0.61CFA
36Behavior0.360.31Discrimination95Suicide0.78CFA
37Somatic0.530.5596Somatic0.68CFA
38Somatic0.650.56Q397Somatic0.410.41Q3+ Discrimination
39Behavior0.690.6998Somatic0.48CFA
40Cognition0.740.7599Mood0.800.80
41Cognition0.670.61100Somatic0.480.49
42Somatic0.340.36Discrimination101Somatic0.12CFA
43Behavior0.680.70102Cognition0.760.78Q3
44Behavior0.600.62103Behavior0.590.61
45Behavior0.57CFA104Behavior0.790.80
46Cognition0.820.82105Cognition0.730.75
47Somatic0.500.52106Suicide0.780.78
48Cognition0.430.39Discrimination107Mood0.720.73
49Cognition0.570.60108Behavior0.290.32Discrimination
50Cognition0.550.56109Suicide0.850.86
51Behavior0.650.67110Cognition0.880.88
52Mood0.710.72111Behavior0.820.83
53Behavior0.720.74Q3112Mood0.52CFA
54Cognition0.590.61113Cognition0.650.67
55Cognition0.660.68114Behavior0.710.72
56Cognition0.720.73115Behavior0.780.79
57Mood0.71CFA116Mood0.74CFA
58Cognition0.590.61117Suicide0.690.68S-X2
59Cognition0.600.54

Factor loading of the CFA1 and CFA2 and reasons for exclusion.

CFA1, the first CFA run of 117 items; CFA2, the second CFA run of 94 items.

Test fit and IRT model selection

Test fit statistics of the GRM, the GPCM, the PCM, the RSM, and the NRM were documented in Table 3. For the GRM, −2LL = 179,190.50, AIC = 179,942.50, and BIC = 181,835.43. All relative fit indices of the GRM were less than those of the other four IRT models, which suggested that the GRM fitted the data better than others. Therefore, the GRM was finally applied to the IRT analysis and CAT-Depression.

Table 3

Model-2LLAICBIC
Graded Response Model179,190.50179,942.50181,835.43
Generalized Partial Credit Model180,630.54181,382.54183,275.47
Partial Credit Model185,706.78186,272.78187,697.52
Rating Scale Model190,184.94190,378.94190,867.28
Nominal Response Model179,792.25180,920.25183,759.64

Test-level model-fit for five polytomously-scored IRT models.

−2LL,−2Log-Likelihood; AIC, Akaike's information criterion; BIC, Bayesian information criterion.

Discrimination, item model-fit, and differential item functioning

Results of item fit and discrimination indicated that five items did not fit the GRM and the discriminations of eight items were less than 0.8 (see Table 2). Regarding DIF, there was no DIF item in the area group, while there were two DIF items in the gender group, and one DIF item in the age group. For gender level, the values of R2 change were 0.03 and 0.04 for item 19 and item 93, respectively. For age level, the value of R2 change was 0.02 for item 27. All in all, 11 items with low discrimination (less than 0.8), did not fit the GRM, or having DIF were removed (see Table 2) from further IRT analysis.

Up until this point, the final item bank for CAT-Depression comprised 68 items after 49 items were excluded for the above psychometric reasons. Table 2 displays the eliminated items and reasons for elimination.

Table 4 displays the item parameters of the final item bank of CAT-Depression. The discrimination parameter of the final item bank ranged from 0.84 to 3.14 with an average value of 1.784, which clearly showed that the final item bank of CAT-Depression was of high quality.

Table 4

ItemAbbreviated item contentAb1b2b3
1Pessimism1.68−0.671.402.49
2Happy as others1.03−1.130.842.27
3Happiness1.24−1.570.472.38
4Being reproached1.30−0.152.244.12
5Loss of appetite0.840.133.175.39
6Loss of interest1.62−0.991.972.97
7Being despised1.69−0.061.913.27
8Still depressed with others' help2.050.021.452.45
9Agitation1.39−1.171.483.16
10Good as others1.29−1.130.662.34
11Boring1.40−1.161.002.67
12Social withdrawal0.88−0.372.063.63
13Loss of pleasure2.320.361.622.50
14Concentration difficulty1.51−0.741.302.73
15Interested in everything around1.27−1.760.021.99
16Gloomy mood2.49−0.571.172.24
17So tired and unable to do anything1.47−0.511.753
18Everything is laborious2.04−0.311.392.52
19Sleep disorders1.12−1.262.043.33
20Feel like body had rotted away1.701.112.233.30
21Difficulties around2.050.021.652.73
22Future is promising1.34−1.110.522.18
23Hypodynamia1.65−0.471.492.67
24Difficult to start1.48−0.781.392.78
25Sense of failure2.6201.312.12
26Rapid heart beat1.090.172.864.46
27Immersion in the past1.34−1.141.062.48
28Draw a blank1.22−1.201.132.98
29Tiredness or fatigue1.62−1.301.783.09
30Fear1.99−0.351.542.80
31Indecisiveness1.39−1.021.102.70
32Reasoning difficulty1.64−0.611.412.75
33Mind blank1.97−0.171.542.60
34Irresolution1.44−1.041.152.66
35Clear mind1.25−1.330.602.37
36Disappointment2.64−0.291.232.12
37Poor appetite or eating too much1.03−1.542.413.96
38Unattractiveness feelings1.50−0.831.332.63
39Worse than others1.99−0.461.262.49
40Self-assessment low2.40−0.101.382.38
41Talk less1.45−0.631.332.68
42Uncalm2.27−0.321.232.31
44Not needed feelings2.18−0.211.292.26
45Self-dislike2.290.401.652.53
46Loneliness1.99−0.481.122.03
47Feel like crying2.050.281.752.74
48Self-criticalness1.66−0.752.132.96
49Helplessness2.25−0.161.392.22
50Unfriendly treatment feelings1.550.192.343.60
51Irritability1.50−0.311.733.18
52Future is not appealing2.560.411.602.44
53Have no future2.590.271.552.45
54Suicidal thoughts1.870.681.892.73
55Concentration or memory difficulty1.41−1.012.213.49
56Sadness2.40−0.641.152.03
57Eating too much or little0.990.092.664.29
58Dilatory or intense behavior1.31−0.692.603.74
59Restlessness2.31−0.251.332.31
60Unpopularity2.100.121.732.79
61Others' life will be better without me1.961.012.143.01
62Smile less1.87−0.131.212.22
63No good things2.940.261.582.23
64Despair3.140.411.482.45
65Unable to continue daily work2.610.411.612.32
66Regret and upset1.71−1.161.192.41
67Unable to provide for oneself1.810.382.073.08
68Unable to restart2.270.421.652.39

Item parameters of the final item bank with GRM.

a, Discrimination parameter; b, Threshold parameter.

CAT-depression simulation study

Characteristics of CAT-depression

Table 5 presents the CAT-Depression outcomes with the real individuals' dataset under different stopping rules. About 27.61 items on average (SD = 16.17) were administered for latent theta estimation under the stopping rule SE (theta) ≤ 0.2, while setting the stopping rule up to SE (theta)> 0.2 would lead to considerable further item savings (MSE(theta) ≤ 0.3 = 11.46, SDSE (theta) ≤ 0.3 = 9.57; MSE(theta) ≤ 0.4 = 6.48, SDSE(theta) ≤ 0.4 = 5.77; MSE(theta) ≤ 0.5 = 4.36, SDSE(theta) ≤ 0.5 = 3.53; (Table 5), and only a mean of 3.27 (SD = 1.58) was needed for latent theta estimation under the stopping rule SE (theta) ≤ 0.6. The Pearson's correlation between the estimated theta in the CAT-Depression and the estimated theta via the entire item bank ranged from 0.88 to 0.99 crossing a different stopping rule, which indicated that the adaptive algorithm was efficient for CAT-Depression.

Table 5

Stopping ruleNumber of items usedMean SE(theta)Marginal reliabilityr
MeanSD
None6800.150.971
SE (theta) ≤ 0.63.271.580.520.730.88
SE (theta) ≤ 0.54.363.530.450.790.90
SE (theta) ≤ 0.46.485.770.380.860.93
SE (theta) ≤ 0.311.469.570.290.910.96
SE (theta) ≤ 0.227.6116.170.200.960.99

Characteristics of the CAT-Depression under several stopping rules.

None, the entire item bank was administered; r, the Pearson's correlation between the estimated theta in the CAT-Depression and the estimated theta via the entire item bank.

Figure 1 displays the number of administered items along with test information under each stopping rule. Evidently, a large number of items were administered for individuals with lower theta and the test information was lower. Fewer items were administered for individuals with middle or high theta and the test information was high. For example, under the stopping rule SE (theta) ≤ 0.2, (1) the test information was less than 10 for those whose theta ranged from −3 to −1.5 even if the entire item bank was administered to them; while (2) the test information was over 25 for those whose theta ranged from 0 to 2.5 with about 20 administered items to them.

Figure 1

The results of the marginal reliability of CAT-Depression are documented in Figure 2 and Table 5. As shown in Table 5, the marginal reliabilities under different stopping rules varied from 0.73 to 0.97, with an average of 0.87, which were generally acceptable for individuals. Figure 2 displays the reliability for each individual under each stopping rule. Under two stopping rules, that is, SE (theta) ≤ 0.2 and SE (theta) ≤ 0.3, the reliabilities were above 0.9 for most individuals, and under the stopping rule SE (theta) ≤ 0.4 the reliabilities were above 0.85 (with an average of 0.96) for most individuals. These results again indicated that CAT-Depression had a high reliability for most individuals. In addition, the marginal reliability for individuals with theta estimation more than −2 under stopping rule SE (theta) ≤ 0.2 was maximal, while the marginal reliability under stopping rule SE (theta) ≤ 0.2, stopping rule SE (theta) ≤ 0.3, and stopping rule SE (theta) ≤ 0.4 were almost equal when theta estimation was less than −2. Individuals always had the minimum marginal reliability under stopping rule SE (theta) ≤ 0.6, regardless of theta estimation.

Figure 2

The content validity of CAT-depression

In the final item bank (N = 68), 13 items measured behavior-related depressive symptoms, 31 items measured cognition-related depressive symptoms, 16 items measured mood-related depressive symptoms, 5 items measured somatic-related depressive symptoms, and 3 items measured the symptom of suicidal thoughts according to the evaluation of the items in the item bank by five psychiatrists engaged in the diagnosis and treatment of depression for more than 10 years. All the symptoms in the ICD-10 were measured, therefore the final item bank had good content validity.

The criterion-related validity of CAT-depression

The Pearson's correlations between CAT-Depression estimated theta and three depression-related scales (i.e., CES-D, SDS, and PHQ-9) were calculated to explore the criterion-related validity of CAT-Depression. As documented in Table 6, the Pearson's correlations between CAT-Depression and CES-D ranged from 0.82 to 0.94 under different stopping rules while the Pearson's correlations with SDS and PHQ-9 ranged from 0.74 to 0.83 and from 0.66 to 0.74, respectively. These results indicated that CAT-Depression had an acceptable and reasonable criterion-related validity no matter which type of stopping rule was used.

Table 6

Stopping ruleCES-D (95% CI)SDS (95% CI)PHQ-9 (95% CI)
None0.941(0.934–0.947)0.836(0.817–0.852)0.740(0.713–0.766)
SE (theta) ≤ 0.60.825(0.806–0.843)0.742(0.715–0.767)0.641(0.605–0.674)
SE (theta) ≤ 0.50.838(0.820–0.855)0.752(0.725–0.776)0.663(0.629–0.695)
SE (theta) ≤ 0.40.867(0.851–0.880)0.771(0.746–0.794)0.684(0.651–0.714)
SE (theta) ≤ 0.30.895(0.883–0.906)0.793(0.770–0.814)0.689(0.657–0.718)
SE (theta) ≤ 0.20.922(0.913–0.930)0.814(0.793–0.833)0.713(0.683–0.740)

Criterion-related validity of CAT-Depression with external criteria scales under different stopping rules.

95% CI, 95% confidence intervals; None, the entire item bank was administered.

The predictive utility (sensitivity and specificity) of CAT-depression

The ROC analysis for CAT-Depression is presented in Table 7. Results of the CAT's diagnostic accuracy based on the CES-D, SDS, and PHQ-9 scales revealed that the AUC values based on the three scales were 0.977 (sensitivity = 0.905, specificity = 0.930, cut-off = 0.025), 0.898 (sensitivity = 0.787 specificity = 0.866, cut-off = −0.465), and 0.886 (sensitivity = 0.764, specificity = 0.860, cut-off = −0.275), respectively, when no stopping rule was applied. Then the AUC values dropped to 0.916 (sensitivity = 0.855, specificity = 0.804, cut-off = −0.025), 0.884 (sensitivity = 0.707, specificity = 0.837, cut-off = −0.232), and 0.831 (sensitivity = 0.645, specificity = 0.873, cut-off = −0.024), respectively, conditional on a stopping rule of SE (theta) ≤ 0.6. Even so, the AUC values under all stopping rules were also higher than the critical value 0.7, which is universally used as the lower bound for moderate predictive utility. Overall, the sensitivity and specificity of CAT-Depression were both acceptable. For example, the sensitivity and specificity of CAT-Depression were 0.937 and 0.851, respectively, under the stopping rule of SE (theta) ≤ 0.2 while using the CES-D as the classification criteria for depression.

Table 7

Stopping ruleCES-DSDSPHQ-9
AUC (95%CI)Cut-offSeSpYIAUC(95%CI)Cut-offSeSpYIAUC(95%CI)Cut-offSeSpYI
None0.977(0.971–0.984)0.0250.9050.9300.8350.898(0.878–0.918)−0.4650.7870.8660.6530.886(0.867–0.905)−0.2750.7640.8600.624
SE (theta) ≤ 0.60.916(0.901–0.932)−0.0250.8550.8040.6590.844(0.817–0.871)−0.2320.7070.8370.5440.831(0.806–0.856)−0.0240.6450.8730.518
SE (theta) ≤ 0.50.919(0.904–0.935)−0.0190.8480.8220.6700.851(0.826–0.876)−0.5370.8220.7330.5550.833(0.808–0.858)−0.0120.6110.8930.504
SE (theta) ≤ 0.40.938(0.925–0.951)−0.1050.9260.8050.7310.856(0.832–0.881)−0.3140.7200.8560.5770.847(0.824–0.871)−0.0200.6510.8900.54
SE (theta) ≤ 0.30.951(0.940–0.962)−0.0130.8790.8690.7490.869(0.846–0.892)−0.3060.7320.8610.5930.852(0.828–0.875)−0.0840.6790.8830.562
SE (theta) ≤ 0.20.963(0.954–0.972)−0.0940.9370.8510.7880.886(0.864–0.907)−0.4550.7880.8510.6390.865(0.843–0.886)−0.1250.6820.8860.568

The predictive utility (sensitivity and specificity) of the CAT-Depression under different stopping rules.

95% CI, 95% confidence intervals; None, No stopping rule was applied; AUC, Area Under Curve; Se, Sensitivity; Sp, Specificity; YI, Youden-Index.

Discussion

In this study, CAT-Depression was developed using the GRM in a Chinese sample and then the characteristics, marginal reliability, criterion-related validity, predictive utility (sensitivity and specificity), and diagnostic performance of CAT-Depression were investigated.

To construct a high-quality item bank for CAT-Depression, items were carefully selected from ten widely-used depression scales based on the symptom criteria of depression in ICD-10. Then, a strict unidimensionality local independence check was conducted to examine whether the assumptions of the IRT models were met. Moreover, five commonly used polytomous IRT models were compared based on real data to select one optimal model for CAT-Depression. Finally, analyses of item-fit, discrimination, and DIF were carried out to construct a high-quality item bank. Results showed that (1) the final unidimensional item bank included 68 items, and these items had local independence, good item fit, high discrimination, no DIF, and each item measured at least one symptom criterion of depression in ICD-10; (2) the mean IRT discrimination of the item bank reached 1.784, which clearly showed that the final item bank of CAT-Depression was high-quality; (3) the results of CAT-Depression revealed that about 11.46 items on average were used to estimate depression under stopping rule SE (theta) ≤ 0.3, while only about 4.36 items were needed with stopping rule SE (theta) ≤ 0.5, and fewer items were administered for individuals with middle or high theta. Additionally, the test information/reliability was high. Test information curve plots (Figure 1) revealed that the information of the depression item bank peaked on the right side of the depression theta scale, and a larger number of or even all selected items were needed for patients with a lower theta. Therefore, in the context of individuals with a similar degree of depression, small differences are more easily detected for respondents with high scores than for respondents with low scores of depression. This result is similar to previous studies (Smits et al., 2011). Reise and Waller (2009) believe psychopathology constructs may be unipolar, which would have led to this result. Moreover, we believe this phenomenon was normal, as the main goal of CAT-Depression is to diagnose the severity of depression rather than to diagnose the health of patients. Thus, it may provide more information for persons with depression, to diagnose them more precisely.

To further investigate the marginal reliability, criterion-related validity, and predictive utility (sensitivity and specificity) of CAT-Depression, a CAT simulation study with real data was carried out. The results revealed that: (1) CAT-Depression had an acceptable marginal reliability with an average of 0.87, ranging from 0.73 to 0.97; (2) CAT-Depression had reasonable and acceptable criterion-related validities (ranging from 0.66 to 0.94) with the CES-D, SDS, and PHQ-9. The criterion-related validity and diagnostic performance were greatest when the CES-D was used as the criterion scale. This may have been caused by most of the theta values of the patients in this study, which ranged from −2 to 2 (Only 4.4% of the patients were outside this range). Umegaki and Todo (2017) study showed that the CES-D scale provided more information than that of the SDS and the PHQ-9 within the range of about −2 to 2 of depression severity; (3) the sensitivity and specificity of the CAT-Depression were both acceptable, and especially the sensitivity and specificity of the CAT-Depression were 0.937 and 0.851, respectively, under the stopping rule SE (theta) ≤ 0.2 when using the CES-D as the classification criteria for depression. In addition, ROC curves analysis indicated that the diagnostic performance of the CAT theta affected by setting the stopping rule up was negligible [AUC(CES−D) = 0.916–0.977; AUC(SDS) = 0.844–0.898; AUC(PHQ−9) = 0.831–0.886]. CAT had good screening performance, and the AUC value was higher (0.831–0.977) than the value of the lower bound for a moderate predictive utility under all stopping rules. The sensitivity (0.611–0.937) and specificity (0.733–0.930) of CAT-Depression were both acceptable. The minimum probability that a patient was accurately diagnosed with a disease, and that general people were accurately diagnosed with no illness were 0.611 and 0.733, respectively, which were higher than the random level (0.5).

Although this study revealed that CAT-Depression had acceptable reliability, validity (including reasonable sensitivity and specificity), and good diagnostic performance, there were some limitations. The distribution of item content was generally uneven, 31 items measured cognition-related depressive symptoms and 16 items measured mood-related depressive symptoms, but only 5 items measured somatic-related depressive symptoms and 3 items measured suicidal ideation in the final depression item bank. The accuracy and validity of assessment for individuals with cognition-related and mood-related depressive symptoms are slightly higher than for individuals with somatic-related depressive symptoms and suicidal thoughts. However, depression had an important predictive effect on the morbidity and mortality of somatic diseases (Bush et al., 2001; Di et al., 2006) and suicidal ideation indicated a very severe depressive state. Future studies should add some items to CAT-Depression to address these issues. Additionally, a CAT simulation study with real response data was conducted in this article; however, a real CAT administration should be conducted in future research to deeply explore the efficiency of CAT-Depression. Different results may be produced by simulated and real CAT administration (Smits et al., 2011) in that there are many factors that affect individuals' responses in a real situation, such as answering time, individual mood, test environment, etc. Fortunately, a prior study showed that the results of simulated CAT were in line with actual CAT (Kocalevent et al., 2009). As a consequence, the present article still has some practical significance. Finally, the simulation study in this study indicated that the two stopping rules, SE (theta) ≤ 0.3 and SE (theta) ≤ 0.4, may be the most economical, precise, and valid stopping rules, and may be the best for actual CAT on depression administration. Future studies may investigate the economy, precision, and validity of CAT-Depression in practice.

Statements

Author contributions

QT thesis writing. YC and DT guide the thesis writing and data processing. QL and YZ data processing.

Funding

This work was supported by the National Natural Science Foundation of China (31660278, 31760288).

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  • 1

    AggenS. H.NealeM. C.KendlerK. S. (2005). DSM criteria for major depression: evaluating symptom patterns using latent-trait item response models. Psychol. Med.35, 475–487. 10.1017/S0033291704003563

  • 2

    AkaikeH. (1974). A new look at the statistical model identification. Automatic Control IEEE Trans.19, 716–723. 10.1109/TAC.1974.1100705

  • 3

    AlmondR. G.MislevyR. J. (1998). Graphical models and computerized adaptive testing. Appl. Psychol. Meas.23, 223–237. 10.1177/01466219922031347

  • 4

    AndrichD. (1978). A rating formulation for ordered response categories. Psychometrika43, 561–573. 10.1007/BF02293814

  • 5

    BakerF. B. (1992). Item Response Theory: Parameter Estimation Techniques. New York, NY: Marcel Dekker.

  • 6

    BeauducelA.HerzbergP. Y. (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Struct. Equation Model. Multidisc. J.13, 186–203. 10.1207/s15328007sem1302_2

  • 7

    BeckA. T.SteerR. A.BeckA. T.SteerR. A. (1987). The Revised Beck Depression Inventory, Vol. 21. Washington, DC: American University.

  • 8

    BockR. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika37, 29–51. 10.1007/BF02291411

  • 9

    BockR. D.MislevyR. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Appl. Psychol. Meas.6, 431–444. 10.1177/014662168200600405

  • 10

    BushD. E.ZiegelsteinR. C.TaybackM.RichterD.StevensS.ZahalskyH.et al. (2001). Even minimal symptoms of depression increase mortality risk after acute myocardial infarction. Am. J. Cardiol.88, 337–341. 10.1016/S0002-9149(01)01675-7

  • 11

    CaiL. (2017). flexMIRT®Version 3.51: Flexible Multilevel Multidimensional Item Analysis and Test Scoring [Computer software]. Chapel Hill, NC: Vector Psychometric Group.

  • 12

    ChalmersR. P. (2012). mirt: a multidimensional item response theory package for the r environment. J. Stat. Softw.48, 1–29. 10.18637/jss.v048.i06

  • 13

    ChoiS. W. (2015). Lordif: Logistic Ordinal Regression Differential Item Functioning Using IRT.

  • 14

    CoreteamR. (2015). R: a language and environment for statistical computing. Computing14, 12–21. 10.1890/0012-9658(2002)083[3097:CFHIWS]2.0.CO;2

  • 15

    CraneP. K.GibbonsL. E.JolleyL.VanB. G. (2006). Differential item functioning analysis with ordinal logistic regression techniques. DIFdetect Difwithpar. Med. Care44, 115–123. 10.1097/01.mlr.0000245183.28384.ed

  • 16

    DennisC. L.BrownH. K.MorrellJ. (2016). Interventions (Other Than Psychosocial, Psychological and Pharmacological) For Preventing Postpartum Depression. John Wiley & Sons, Ltd.

  • 17

    DennisC. L.HodnettE. (2014). Psychosocial and psychological interventions for treating postpartum depression. Cochrane Database Syst. Rev.89:92. 10.1002/14651858.CD006116

  • 18

    DerogatisL. (1994). SCL-90-R Symptom Checklist-90-R administration, Scoring and Procedures Manual. Minneapolis, MN: National Computer Systems.

  • 19

    DiB. M.LindnerH.HareD. L.KentS. (2006). Depression following acute coronary syndromes: a comparison between the Cardiac Depression Scale and the Beck Depression Inventory II. J. Psychosom. Res.60, 13–20. 10.1016/j.jpsychores.2005.06.003

  • 20

    FlensG.SmitsN.TerweeC. B.DekkerJ.HuijbrechtsI.de BeursE. (2017). Development of a computer adaptive test for depression based on the dutch-flemish version of the PROMIS Item Bank. Eval. Health Prof.40, 79–105. 10.1177/0163278716684168

  • 21

    FliegeH.BeckerJ.WalterO. B.BjornerJ. B.KlappB. F.RoseM. (2005). Development of a computer-adaptive test for depression (D-CAT). Qual. Life Res.14, 2277–2291. 10.1007/s11136-005-6651-9

  • 22

    ForkmannT.BoeckerM.NorraC.EberleN.KircherT.SchauerteP.et al. (2009). Development of an item bank for the assessment of depression in persons with mental illnesses and physical diseases using Rasch analysis. Rehabil. Psychol.54, 186–197. 10.1037/a0015612

  • 23

    ForkmannT.KroehneU.WirtzM.NorraC.BaumeisterH.GauggelS.et al. (2013). Adaptive screening for depression—Recalibration of an item bank for the assessment of depression in persons with mental and somatic diseases and evaluation in a simulated computer-adaptive test environment. J. Psychosom. Res.75, 437–443. 10.1016/j.jpsychores.2013.08.022

  • 24

    GardnerW.ShearK.KelleherK. J.PajerK. A.MammenO.BuysseD.et al. (2004). Computerized adaptive measurement of depression: a simulation study. BMC Psychiatry4:13. 10.1186/1471-244X-4-13

  • 25

    GershonR. C. (2005). Computer adaptive testing. J. Appl. Meas.6, 109–27.

  • 26

    GibbonsR. D.WeissD. J.KupferD. J.FrankE.FagioliniA.GrochocinskiV. J.et al. (2008). Using computerized adaptive testing to reduce the burden of mental health assessment. Psychiatric Serv.59, 361–368. 10.1176/ps.2008.59.4.361

  • 27

    GibbonsR. D.WeissD. J.PilkonisP. A.FrankE.MooreT.KimJ. B.et al. (2012). Development of a computerized adaptive test for depression. Arch. Gen. Psychiatry69, 1104–1112. 10.1001/archgenpsychiatry.2012.14

  • 28

    HamiltonM. (1967). Development of a rating scale for primary depressive illness. Br. J. Soc. Clin. Psychol.6, 278–296. 10.1111/j.2044-8260.1967.tb00530.x

  • 29

    HathawayS. R.McKinleyJ. C. (1942). A multiphasic personality schedule (Minnesota): III. Meas. Symptom. Depress. J. Psychol.14, 73–84. 10.1080/00223980.1942.9917111

  • 30

    HollonS. D.KendallP. C. (1980). Cognitive self-statements in depression: development of an automatic thoughts questionnaire. Cogn. Ther. Res.4, 383–395. 10.1007/BF01178214

  • 31

    KangT.ChenT. T. (2008). Performance of the generalized S-X2 item fit index for polytomous IRT models. J. Edu. Meas.45, 391–406. 10.1111/j.1745-3984.2008.00071.x

  • 32

    KlineR. B. (2010). Principles and Practice of Structural Equation Modeling, 3rd Edn.New York, NY: Guilford Press.

  • 33

    KocaleventR. D.RoseM.BeckerJ.WalterO. B.FliegeH.BjornerJ. B.et al. (2009). An evaluation of patient-reported outcomes found computerized adaptive testing was efficient in assessing stress perception. J. Clin. Epidemiol.62, 278–287. 10.1016/j.jclinepi.2008.03.003

  • 34

    KoenigH. G.CohenH. J.BlazerD. G.MeadorK. G.WestlundR. (1992). A brief depression scale for use in the medically ill. Int. J. Psychiatry Med.22, 183–95. 10.2190/M1F5-F40P-C4KD-YPA3

  • 35

    KraemerH. C.KupferD. J. (2006). Size of treatment effects and their importance to clinical research and practice. Biol. Psychiatry59, 990–996. 10.1016/j.biopsych.2005.09.014

  • 36

    KreitzbergC. B.JonesD. H. (1980). An empirical study of the broad range tailored test of verbal ability. Ets Res. Report1980, i−232. 10.1002/j.2333-8504.1980.tb01195.x

  • 37

    KroenkeK.SpitzerR. L.WilliamsJ. B. (2001). The PHQ-9: validity of a brief depression severity measure. J. Gen. Intern. Med.16, 606–613. 10.1046/j.1525-1497.2001.016009606.x

  • 38

    MagisD.BarradaJ. R. (2017). Computerized Adaptive Testing with R: recent updates of the package catR. J. Stat. Softw.76, 1–19. 10.18637/jss.v076.c01

  • 39

    MagisD.RaicheG. (2012). Random generation of response patterns under computerized adaptive testing with the R package catR. J. Stat. Softw.48, 1–31. 10.18637/jss.v048.i08

  • 40

    MastersG. N. (1982). A rasch model for partial credit scoring. Psychometrika47, 149–174. 10.1007/BF02296272

  • 41

    MeijerR. R.NeringM. L. (1999). Computerized adaptive testing: overview and introduction. Appl. Psychol. Meas.23, 187–194. 10.1177/01466219922031310

  • 42

    MurakiE. (1997). A Generalized Partial Credit Model. Appl. Psychol. Meas.16, 153–164. 10.1007/978-1-4757-2691-6_9

  • 43

    MuthénL. K.MuthénB. O. (2012). Mplus Version 7 User's Guide. Los Angeles, CA: Muthén & Muthén.

  • 44

    National Health Family Commission of the People's Republic of China (2015). October 9, 2015 National Health and Family of the People's Republic of China Commission routine press conference written record. Available online at: http://www.nhfpc.gov.cn/zhuz/xwfb/201510/f9658b3c67aa437c92f0755fd901b638.shtml

  • 45

    RadloffL. S. (1977). The CES-D Scale: a self-report depression scale for research in the general population. Appl. Psychol. Meas.1, 385–401. 10.1177/014662167700100306

  • 46

    ReeveB. B.HaysR. D.BjornerJ. B.CookK. F.CraneP. K.TeresiJ. A.et al. (2007). Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med. Care45, S22–3110.1097/01.mlr.0000250483.85507.04

  • 47

    ReiseS. P.WallerN. G. (2009). Item response theory and clinical measurement. Annu. Rev. Clin. Psychol.5, 27–48. 10.1146/annurev.clinpsy.032408.153553

  • 48

    ResnikL.TianF.NiP.JetteA. (2012). Computer-adaptive test to measure community reintegration of Veterans. J. Rehabil. Res. Dev.49, 557–566. 10.1682/JRRD.2011.04.0081

  • 49

    SamejimaF. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monogr.17, 5–17. 10.1007/BF03372160

  • 50

    SamejimaF. (1994). Estimation of reliability coefficients using the test information function and its modifications. Appl. Psychol. Meas.18, 229–244. 10.1177/014662169401800304

  • 51

    SchistermanE. F.PerkinsN. J.LiuA.BondellH. (2005). Optimal cut-point and its corresponding Youden Index to discriminate individuals using pooled blood samples. Epidemiology16, 73–81. 10.1097/01.ede.0000147512.81966.ba

  • 52

    SchwarzG. (1978). Estimating the Dimension of a Model. Annals Stat.6, 461–464. 10.1214/aos/1176345415

  • 53

    SmitsN.CuijpersP.van StratenA. (2011). Applying computerized adaptive testing to the CES-D scale: a simulation study. Psychiatry Res.188, 147–155. 10.1016/j.psychres.2010.12.001

  • 54

    SpiegelhalterD. J.BestN. G.CarlinB. P. (1998). Bayesian deviance, the effective number of parameters, and the comparison of arbitrarily complex models. Res. Rep.98–009.

  • 55

    StansburyJ. P.RiedL. D.VelozoC. A. (2006). Unidimensionality and bandwidth in the Center for Epidemiologic Studies Depression (CES-D) Scale. J. Pers. Assess86, 10–22. 10.1207/s15327752jpa8601_03

  • 56

    UmegakiY.TodoN. (2017). Psychometric properties of the Japanese CES-D, SDS, and PHQ-9 Depression Scales in University Students. Psychol. Assess.29, 354–359. 10.1037/pas0000351

  • 57

    YenW. M. (1993). Scaling performance assessments: strategies for managing local item dependence. J. Edu. Meas.30, 187–213. 10.1111/j.1745-3984.1993.tb00423.x

  • 58

    ZigmondA. S.SnaithR. P.KitamuraT. (1983). Hospital anxiety and depression scale. Br. J. Psychiatry67, 361–370. 10.1111/j.1600-0447.1983.tb09716.x

  • 59

    ZumboB. D. (1999). A Handbook on the Theory and Methods of Differential Item Functioning (DIF): Logistic Regression Modeling as a Unitary Framework for Binary and Likert-Type (Ordinal) Item Scores.Ottawa, ON: National Defense Headquarters.

  • 60

    ZungW. W. K. (1965). A self-rating depression scale. Arch. Gen. Psychiatry12, 63–70. 10.1001/archpsyc.1965.01720310065008

Summary

Keywords

computerized adaptive testing, item response theory, depression assessment, IRT models, measurement

Citation

Tan Q, Cai Y, Li Q, Zhang Y and Tu D (2018) Development and Validation of an Item Bank for Depression Screening in the Chinese Population Using Computer Adaptive Testing: A Simulation Study. Front. Psychol. 9:1225. doi: 10.3389/fpsyg.2018.01225

Received

03 February 2018

Accepted

27 June 2018

Published

18 July 2018

Volume

9 - 2018

Edited by

Sergio Machado, Salgado de Oliveira University, Brazil

Reviewed by

Davide Marengo, Università degli Studi di Torino, Italy; Ioannis Tsaousis, University of Crete, Greece

Updates

Copyright

*Correspondence: Dongbo Tu Yan Cai

This article was submitted to Quantitative Psychology and Measurement, a section of the journal Frontiers in Psychology

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics