Are the PHQ-9 and GAD-7 Suitable for Use in India? A Psychometric Analysis

De Man, Jeroen; Absetz, Pilvikki; Sathish, Thirunavukkarasu; Desloge, Allissa; Haregu, Tilahun; Oldenburg, Brian; Johnson, Leslie C. M.; Thankappan, Kavumpurathu Raman; Williams, Emily D.

doi:10.3389/fpsyg.2021.676398

ORIGINAL RESEARCH article

Front. Psychol. , 13 May 2021

Sec. Quantitative Psychology and Measurement

Volume 12 - 2021 | https://doi.org/10.3389/fpsyg.2021.676398

Are the PHQ-9 and GAD-7 Suitable for Use in India? A Psychometric Analysis

$\r\nJeroen De Man*$ Jeroen De Man^1*

Pilvikki Absetz^2,3

Thirunavukkarasu Sathish⁴

Allissa Desloge⁵

Tilahun Haregu⁵

Brian Oldenburg⁵

Leslie C. M. Johnson^6,7

Kavumpurathu Raman Thankappan⁸

Emily D. Williams⁹

¹Department of Family Medicine and Population Health, University of Antwerp, Antwerp, Belgium
²Collaborative Care Systems Finland, Tampere University, Tampere, Finland
³University of Eastern Finland, Kuopio, Finland
⁴Population Health Research Institute, McMaster University, Hamilton, ON, Canada
⁵Melbourne School of Population and Global Health, University of Melbourne, Melbourne, VIC, Australia
⁶Department of Family and Preventive Medicine, School of Medicine, Emory University, Atlanta, GA, United States
⁷Hubert Department of Global Health, Rollins School of Public Health, Emory University, Atlanta, GA, United States
⁸Department of Public Health and Community Medicine, Central University of Kerala, Kasaragod, India
⁹School of Health Sciences, University of Surrey, Guildford, United Kingdom

Background: Cross-cultural evidence on the factorial structure and invariance of the PHQ-9 and the GAD-7 is lacking for South Asia. Recommendations on the use of unit-weighted scores of these scales (the sum of items’ scores) are not well-founded. This study aims to address these contextual and methodological gaps using data from a rural Indian population.

Methods: The study surveyed 1,209 participants of the Kerala Diabetes Prevention Program aged 30–60 years (n at risk of diabetes = 1,007 and n with diabetes = 202). 1,007 participants were surveyed over 2 years using the PHQ-9 and the GAD-7. Bifactor-(S – 1) modeling and multigroup confirmatory factor analysis were used.

Results: Factor analysis supported the existence of a somatic and cognitive/affective subcomponent for both scales, but less explicitly for the GAD-7. Hierarchical omega values were 0.72 for the PHQ-9 and 0.76 for the GAD-7. Both scales showed full scalar invariance and full or partial residual invariance across age, gender, education, status of diabetes and over time. Effect sizes between categories measured by unit-weighted scores versus latent means followed a similar trend but were systematically higher for the latent means. For both disorders, female gender and lower education were associated with higher symptom severity scores, which corresponds with regional and global trends.

Conclusions: For both scales, psychometric properties were comparable to studies in western settings. Distinct clinical profiles (somatic-cognitive) were supported for depression, and to a lesser extent for anxiety. Unit-weighted scores of the full scales should be used with caution, while scoring subscales is not recommended. The stability of these scales supports their use and allows for meaningful comparison across tested subgroups.

Clinical Trial Registration: Australia and New Zealand Clinical Trials Registry: ACTRN12611000262909

http://www.anzctr.org.au/Trial/Registration/TrialReview.aspx?id=336603&isReview=true.

Introduction

Depressive and anxiety disorders are among the ten leading causes of global disability (Vos et al., 2015) and more than 80% of people who have mental disorders reside in low- and middle income countries (LMICs) (World Health Organization, 2008). In India, depressive and anxiety disorders have been shown to both have a crude prevalence of 3.3% and were responsible for 33.8% (29.5–38.5) and 19.0% (15.9–22.4) of disease-adjusted life years attributable to mental disorders (Sagar et al., 2020).

Self-reported measurement tools are crucial to estimate the burden of depressive and anxiety disorders at population level, to determine how this burden relates to subgroup characteristics (e.g., sociodemographic characteristics, other health conditions, etc.), and to measure the effect of public health interventions. At individual level, these tools can enhance the reliability of diagnoses and their ease of use makes them particularly useful in settings with poor mental health service provision and a lack of specialized staff (Mughal et al., 2020). However, most of the established self-reported measurement tools have been developed and evaluated in Europe and North America and may not perform in an equivalent way in different cultures or settings (Dere et al., 2015; Mughal et al., 2020).

The Patient Health Questionnaire (PHQ-9) and Generalized Anxiety Disorder (GAD-7) assessment can be used as screening tools as well as measures of symptom severity for depression (PHQ-9) and different types of anxiety (GAD-7) (Kroenke et al., 2001; Spitzer et al., 2006). Both tools are based on the Diagnostic and Statistical Manual of Mental Disorders criteria and have been found to be valid measures for detecting and monitoring depression or anxiety disorders in western countries (Kroenke et al., 2010). In South Asia, a region home to one quarter of the world’s population, assessment of these scales‘ essential psychometric properties is lacking (Lamela et al., 2020). Below, we will discuss why assessment of such properties is crucial and their current level of evidence, focusing on: (1) the factorial structure; (2) the use of unit-weighted scores (i.e., the sum score of the item responses); and (3) invariance across subgroups.

The factorial structure of data from a specific population can provide empirical support for the potential existence of subdimensions of depression and anxiety disorders which can differ across cultures and settings (Leong and Tak, 2003). For instance, studies have revealed that somatic symptoms are more common in Indian people with depression compared with western populations (Grover et al., 2010). For the PHQ-9 and the GAD-7, which were initially intended to be used as unidimensional scales, a variety of measurement models have been proposed based on confirmatory factor analysis (CFA) investigations (Doi et al., 2018a,b; Moreno et al., 2019; Lamela et al., 2020). In these studies, mainly conducted in western settings, researchers have found both scales to fit unidimensional and multidimensional models (Lamela et al., 2020), a second-order model specifically for the GAD-7 (Doi et al., 2018a) and a bifactor model specifically for the PHQ-9 (Doi et al., 2018b). A recurrent finding is the identification of a factor consisting of items reflecting a somatic aspect and a factor consisting of items reflecting a cognitive/affective aspect (Beard and Björgvinsson, 2014; Lamela et al., 2020). For both disorders, these subdimensions correspond with clinical representations and pathophysiologic insights suggesting different subtypes of these conditions (Portman et al., 2011; Duivis et al., 2013; Lamela et al., 2020). For depression in particular, distinguishing between these subtypes has been shown important with regards to treatment and prognosis (Lamela et al., 2020). However, to our knowledge, no studies have assessed the factorial structure of these scales in India or anywhere else in South Asia. To examine the cross-cultural validity of these subdimensions, evidence on the factorial structure of these scales based on data from different settings is needed.

A key question when using self-reported measurement tools is to what extent the unit-weighted scores (i.e., the sum score of all or a subset of item responses) can be interpreted as a unidimensional representation of a specific construct. For instance, to what extent does the sum of all scale item scores represent depressive or anxiety symptom severity as an overarching construct. Items of these scales may belong to different subdimensions which may preclude the interpretation of the sum of their scores. In addition, clinicians or researchers may want to sum item responses belonging to one of these subdimensions and use this score as a reflection of that specific subdimension. Recent studies have defended scoring the total scale as well as its subdimensions for the PHQ-9 (Doi et al., 2018b; Lamela et al., 2020) and for the GAD-7 (Doi et al., 2018a) guided by the goodness of fit of CFA models. These recommendations are problematic for several reasons. First, Reise et al. (2013) argue that model selection based on CFA rarely informs researchers on the degree of multidimensionality such that it may justify the use of unit-weighted scores of subscales or total scales. Second, methodologists have called into question if specifying both, total and subdimension scores of the same scale can have an added value (Rodriguez et al., 2016a). Third, consensus is lacking on the choice of the most adequate CFA model for both, the PHQ-9 and GAD-7. To guide researchers on this matter, alternative analytic techniques have been proposed such as the Haberman (2008) procedure (Haberman, 2008) and the use of model-based reliability indices (Rodriguez et al., 2016b). To our knowledge, these techniques have not been applied yet to either the PHQ-9 or the GAD-7. Application of these techniques to data collected from different settings may redress current inconsistencies in recommendations on the use of unit-weighted scores.

Measurement invariance of a scale is an essential psychometric property when studying scale scores over time or between subgroups of a population (Vandenberg and Lance, 2000). Measurement invariance across subgroups corresponds to a latent construct being represented by the scale items in a similar way and suggests that this construct has a similar meaning to these groups. This implies that assessing invariance is crucial for the comparison of subgroups. When measurement invariance is violated across subgroups, prevalence or severity of a disorder may be under- or overestimated across these groups. Finally, analysis of invariance can provide insight into how the interpretation of a scale and the perception of an illness may differ across subgroups. This may have consequences with regards to the diagnosis and how people cope with their illness. Lack of invariance can lead to substantial bias when comparing different subgroups, especially when using unit-weighted scores (Steinmetz, 2013). Moreover, in addition to scalar invariance which is required to compare subgroups through a structural equation modeling (SEM) framework, the use of unit-weighted scores requires invariant indicator reliability (Vandenberg and Lance, 2000). Invariance of the PHQ-9 and GAD-7 scales has been supported by studies conducted in western settings for gender, ethnic and sociodemographic differences, but only few studies assessed invariance for age and over time (Hinz et al., 2017; Lamela et al., 2020). Moreover, other depression measures have shown non-invariance across different age groups (Estabrook et al., 2015). Only a handful of studies have tested invariance in LMICs and were focused on a specific population such as college students, pregnant women, etc. (Miranda and Scoppetta, 2018). To our knowledge, no studies have assessed any form of invariance of the PHQ-9 and GAD-7 in South Asian populations.

In sum, essential psychometric properties of the PHQ-9 and the GAD-7 remain underexplored among LMIC populations, in particular in South Asia. Furthermore, recommendations on the use of unit-weighted scores of these scales have been poorly supported.

The aim of this study was to simultaneously address these contextual and methodological gaps based on a state-of-the-art analytic approach and using data from a rural Indian population. Specifically, we aimed to: (1) assess existence of subdimensions in this population by assessing the factorial structure of data collected through these scales; (2) determine how precisely total and subscale scores reflect their intended constructs; (3) assess invariance across subgroups of age, gender, level of education, status of diabetes (at risk of versus with type 2 diabetes [T2D]) and different measurement occasions; and (4) if invariance could be established, compare the results of unit-weighted scores with latent means analysis in the assessment of differences in symptom severity across the same subgroups. With this last objective, we sought to assess: (1) the accuracy of unit-weighted scores when using latent mean levels as a standard; and (2) whether differences across subgroups correspond with regional trends that would support the validity of our data.

Materials and Methods

Participants

The analysis was based on data collected from participants who took part in the baseline, cross-sectional, community-based survey of a diabetes prevention program in the state of Kerala in India: the Kerala Diabetes Prevention Program (K-DPP). A detailed description of the study design, participant screening and recruitment has been previously published (Sathish et al., 2013, 2017, 2019). Briefly, K-DPP was a cluster randomized controlled trial conducted in 60 randomly selected polling areas (electoral divisions) from a taluk (sub-district) in Trivandrum district of Kerala state. People aged 30–60 years were selected randomly from the electoral roll of the 60 polling areas and were approached at their households by trained data collectors. We screened 3,421 individuals for eligibility and those with a history of diabetes or other major chronic conditions, taking drugs influencing glucose tolerance (e.g., steroids), and who were illiterate in the local language were excluded (n = 835). Potentially eligible individuals (n = 2586) were screened with the Indian Diabetes Risk Score, and those with a score of ≥60 (n = 1529) were invited to undergo a 2-h oral glucose tolerance test (OGTT) at community-based clinics. Of these, 1,209 attended the clinics, of which 1,007 individuals were at high risk for developing diabetes and 202 were diagnosed with diabetes. Participant screening and recruitment were completed between January and October 2013. The 1,007 individuals at high risk were followed-up after 1 and 2 years of enrollment. These follow-up points were used for the analysis of invariance at different measurement occasions. Mean age of participants was 46.0 (SD: 7.5), 45.8% were female, and 95% were married. 25.3, 51.3, and 23.4% attended primary, secondary and higher education, respectively (Sathish et al., 2017).

Measures and Data Collection

Both the nine-item PHQ-9 and the seven-item GAD-7 use 4-point Likert-scaled items ranging from 0 (not at all) to 3 (nearly every day) (Kroenke et al., 2001; Spitzer et al., 2006). For the GAD-7, items 4, 5, and 6 have been found to reflect a somatic dimension (Rutter and Brown, 2017). For the PHQ-9, this was the case for items 3, 4, and 5 and in some studies items 7 and 8 (Lamela et al., 2020). The scales were translated to Malayalam and back-translated to English and were pilot tested. Interviews were administered by trained interviewers.

Data-Analysis

Confirmatory Factor Analysis

To evaluate whether our data supported a two dimensional model, a selection of models was tested with specifications based on theory and findings from previous studies. The aim of this analysis was to test if a model with one or two factors would be acceptable, rather than to select a specific model solely based on a better model fit. For the PHQ-9, we tested a correlated 2-factor model with items 3–4–5 loading onto one factor as was proposed by others (Lamela et al., 2020). For the GAD-7, we tested a 1-factor model with and without correlated residuals for items 4, 5, and 6 and a correlated 2-factor model with items 4, 5, and 6 loading onto one factor (Beard and Björgvinsson, 2014; Rutter and Brown, 2017; Doi et al., 2018a). Assessment of these models was based on their χ²-values, the item indicators’ loadings and the following sample-corrected for non-normal data goodness of fit indices (Brosseau-Liard et al., 2012) with target values as proposed by Hu and Bentler (1999): the comparative fit index (CFI) (≥0.95), the Tucker-Lewis index (TLI) (≥0.95), the root mean square error of approximation (RMSEA) (≤0.06), and the standardized root mean square residual (SRMR) (≤0.08). Since items’ distributions departed from normality, we used maximum likelihood estimation with robust (Huber–White) standard errors and a scaled test statistic that is (asymptotically) equal to the Yuan–Bentler test statistic.

Haberman Procedure

This procedure assesses whether the subscore provides a more accurate estimate of the construct it measures than the total score (Haberman, 2008). The proportional reduction of mean squared error (PRMSE) based on total scores was compared with the PRMSE based on subscale scores. In case the latter would be smaller, there is no psychometric justification to report the subscale scores. In addition, a hypothesis test was performed justifying the reporting of subscale scores if Olkin’s Z statistic was higher than 1.64 (Sinharay, 2019). However, this procedure does not test whether subscale scores provide meaningful information, while taking into account the total score (Reise et al., 2013). To assess this, we used the indices described in the next paragraph.

Model-Based Psychometric Indices

Rodriguez et al. (2016a, b) proposed the following indices to assess the degree to which total and subscale scores reflect their intended constructs. Total omega (omegaT) estimates the proportion of variance in the unit-weighted total score due to all common factors including the general and group factors (Zinbarg et al., 2005). Hierarchical omega (omegaH) estimates the proportion variance due to a general factor. Omega hierarchical subscale (omegaHS) estimates the variance due to a specific group factor while controlling for the general factor. Recommended minimum values were described for OmegaH (0.70) and for OmegaHS (0.50) (Reise et al., 2013; Rodriguez et al., 2016a).

The indices were calculated based on a confirmatory bifactor modeling approach (Reise et al., 2013) using the semTools package in R (Jorgensen et al., 2020). A bifactor model was deemed appropriate to calculate these indices as it is a less restrictive model (than, e.g., a hierarchical model) and as the structure of the response data was assumed to be consistent with a bifactor structure: i.e., a single general trait reflecting the target construct and the presence of subdomain constructs due to clusters of similar items (Rodriguez et al., 2016a). Bifactor models can be specified by all items loading onto a general factor as well as on group factors representing the subdomains. In addition, group factors are assumed to be uncorrelated with other group factors as well as with the general factor. However, since only two group factors were present, these bifactor models will be unidentified, which implies that an infinite number of hierarchical factor models can be found for the same covariance matrix (Zinbarg et al., 2007). To address this problem, we used a modified version of the traditional bifactor model: the bifactor-(S − 1) model described by Eid et al. (2016). The name of this model refers to the number of specific group factors being less than the actual number of the scale’s subdimensions being considered. This modification makes the discarded group factor a reference group for the general factor and solves the identification problem (Eid et al., 2016).

Invariance

Invariant indicator reliabilities of unit-weighted scores across groups requires residual or strict invariance which was tested using multigroup confirmatory factor analysis (MGCFA) (Vandenberg and Lance, 2000). The following levels of invariance were assessed: equal form (i.e., configural invariance), equality of factor loadings (i.e., metric invariance), equality of indicator intercepts (scalar invariance), and equality of residuals (i.e., residual invariance) (Vandenberg and Lance, 2000). However, for residual invariance to be analogous to indicator reliability invariance, the last step requires invariance of factor variances which was tested first (Vandenberg and Lance, 2000). Criteria of invariance between nested models included a difference in CFI < −0.01 combined with difference in RMSEA < 0.015 or a non-significant scaled χ-square difference test (Putnick and Bornstein, 2016).

Group Differences

For the unit-weighted scores, standardized effect sizes between different subgroups were calculated using robust regression of the unit-weighted scores based on MM-estimation (i.e., an extension of the maximum likelihood estimate method). Standardized effect sizes of unit-weighted scores were compared with the standardized effect sizes of the difference in latent mean levels estimated through multigroup structural equation modeling. Data were analyzed using R software with the packages “lavaan” (Rosseel, 2012) and “semTools” (Jorgensen et al., 2020).

Missing Data

Missing data occurred in 0.4% of the GAD-7 data, in 3.4% of the PHQ-9 data and in 0.0% of the demographic variables (sex, education, and age). It was deemed implausible that the probability of missing data would significantly differ in specific groups or cases, assuming they were missing completely at random (MCAR). This was supported by Little’s test hypothesis not being rejected for a subset of the GAD-7 and demographic variables (p = 0.30) and the PHQ-9 and demographic variables (p = 0.07). For these reasons, complete case analysis was preferred.

Ethical Approval

The study was approved by the Institutional Ethics Committee of the Sree Chitra Tirunal Institute for Medical Sciences and Technology, Trivandrum, Kerala, and by the Human Research Ethics Committees of Monash University, Australia and the University of Melbourne, Australia. The study was also approved by the Health Ministry Screening Committee of the Government of India.

Results

Factor Structure

As mentioned previously, the aim of this analysis was to assess the existence of a somatic and a cognitive subdimension in the response data. For this purpose, we assessed whether model fit criteria of 1- and 2-factor models were acceptable (see Figures 1, 2 and Table 1). CFA of the models proposed for the PHQ-9 revealed an acceptable fit for a 2-factor model, but not for a 1-factor model (see Table 1). The factors of the 2-factor model were highly correlated (r = 0.77). Factor loading estimates revealed that the indicators were strongly related to their purported factors (range λ = 0.45–0.73) with p-values below 0.001 (see Figure 1).

FIGURE 1

Figure 1. Standardized factor loadings and error variances for a 1- and a 2-factor model of the PHQ-9. DEP, depression; COG, cognitive; SOM, somatic. (n = 1207).

FIGURE 2

Figure 2. Standardized factor loadings and error variances for a 1- and a 2-factor model, and a model with correlated residuals of the GAD-7. GAD, generalized anxiety disorder; COG, cognitive; SOM, somatic. (n = 1205).

TABLE 1

Table 1. Model fit of the PHQ-9 (n = 1207) and the GAD-7 (n = 1205).

CFA of the models proposed for the GAD-7 revealed an acceptable fit for a 1-factor model with correlated residuals between items 3, 5, and 6, and a 2-factor model (see Table 1). The correlation between the cognitive and somatic factor of the 2-factor model was high (r = 0.84). Factor loading estimates revealed that the indicators were strongly related to their purported factors (range λ = 0.58–0.88), except for item 6 (λ = 0.27) (see Figure 2). For all parameters, p-values were below 0.001 except for the correlation between the residuals of items 5 and 6 (p-value = 0.05).