Bayesian Analysis of a Quantile Multilevel Item Response Theory Model

Zhu, Hongyue; Gao, Wei; Zhang, Xue

doi:10.3389/fpsyg.2020.607731

ORIGINAL RESEARCH article

Front. Psychol., 08 January 2021

Sec. Quantitative Psychology and Measurement

Volume 11 - 2020 | https://doi.org/10.3389/fpsyg.2020.607731

Bayesian Analysis of a Quantile Multilevel Item Response Theory Model

$\r\nHongyue Zhu$ Hongyue Zhu¹

Wei Gao¹

Xue Zhang^2*

¹School of Mathematics and Statistics, Northeast Normal University, Changchun, China
²China Institute of Rural Education Development, Northeast Normal University, Changchun, China

Multilevel item response theory (MLIRT) models are used widely in educational and psychological research. This type of modeling has two or more levels, including an item response theory model as the measurement part and a linear-regression model as the structural part, the aim being to investigate the relation between explanatory variables and latent variables. However, the linear-regression structural model focuses on the relation between explanatory variables and latent variables, which is only from the perspective of the average tendency. When we need to explore the relationship between variables at various locations along the response distribution, quantile regression is more appropriate. To this end, a quantile-regression-type structural model named as the quantile MLIRT (Q-MLIRT) model is introduced under the MLIRT framework. The parameters of the proposed model are estimated using the Gibbs sampling algorithm, and comparison with the original (i.e., linear-regression-type) MLIRT model is conducted via a simulation study. The results show that the parameters of the Q-MLIRT model could be recovered well under different quantiles. Finally, a subset of data from PISA 2018 is analyzed to illustrate the application of the proposed model.

Introduction

Multilevel or hierarchical linear models are used widely in educational and psychological researches (e.g., Raudenbush, 1988; Goldstein, 1995; Snijders and Bosker, 1999). These models allow data to be collected at different levels; for example, test data are obtained from students, students are nested within schools, and so on. Item response theory (IRT) models can be plugged in multilevel models, known as multilevel IRT (MLIRT) models. In educational research and educational assessment, the MLIRT model is widely applied, where the main objective is to investigate the relationship between covariates (e.g., student’s social background, school financial resources) and outcome variables (e.g., student’s ability; Fox, 2010).

Adams et al. (1997) noted that a two-level IRT model could be seen as a multilevel perspective on item response modeling. In this formulation, we can divide the two-level IRT model in two components: the measurement part with the respective IRT measurement model and the structural part with the respective regression model. Furthermore, Kamata (2001) and Fox and Glas (2001) extended the two-level IRT model to three levels, with both studies proposing a two-level regression model (i.e., student-level and group-level covariates were analyzed in different levels) on the ability parameters as the structural model. Because the explanatory variables might not always be measured accurately, Fox and Glas (2003) showed how to model latent explanatory variables with measurement errors within the MLIRT model. Additionally, MLIRT models can also be applied to handle more-complicated factors such as longitudinal response data and response times. For example, Huang (2015); Schmidt et al. (2016) introduced MLIRT models for longitudinal data to assess the changes in students’ abilities over time. Also, when jointly modeling the responses and response times, the ability and speed parameters can be considered as outcome variables of a multivariate multilevel model for various analyses (e.g., Entink et al., 2009; Fox, 2010).

Regarding the estimation of MLIRT models, studies have shown that all the model parameters can be estimated simultaneously using a fully Bayesian method, regardless of whether the IRT model is a two-parameter model for dichotomous response data or a graded response model for polytomous response data (Fox, 2005; Natesan et al., 2010). Compared with the traditional two-stage estimation procedure, this estimation procedure in the MLIRT models leads to a proper treatment of the measurement error associated with the ability parameter (Adams et al., 1997; Fox and Glas, 2001, 2003; Fox, 2004, 2005, 2010). See Fox (2010) for more advantages of the MLIRT modeling framework.

One fact that needs to be noted is that the distribution of latent variables in the IRT model is often assumed to be normal (e.g., Bock and Lieberman, 1970; Bock and Aitkin, 1981; Wirth and Edwards, 2007). This assumption may be reasonably when students follow a normally distributed population, but in many cases, it is not satisfied (e.g., Woods, 2006, 2015; Woods and Thissen, 2006; Sass et al., 2008). It has been shown that when a non-normal latent distribution is assumed to be normal, estimates of IRT model parameters are biased (e.g., Stone, 1992; Sass et al., 2008; Woods, 2015). This issue has also been explored regarding multidimensional and graded response models (e.g., Finch, 2010, 2011; Svetina et al., 2017; Wang et al., 2018). However, there has been little attention to the non-normality of latent variables in the MLIRT framework.

On the other hand, the structural models in the MLIRT models are always linear regression (LR) approaches. Herein, LR generally assumes that the errors are normally distributed and should be homoscedastic, which are not often in accordance with practice, because of non-normality of latent variables or heterogeneity of errors (Harvey, 1976; Koenker and Bassett, 1978, 1982; Jarque and Bera, 1980; Tsionas, 2003). It is also well known that a LR model mainly focuses on the relation between explanatory variables and the conditional mean of latent variables, however, it is quite possible that the relationship between variables may vary at different points along the response distribution, researchers also might be interested in the relationships between variables on various locations of the distribution, such as on one or both of the tails of the distribution.

These problems can be addressed by the quantile regression (QR) (Koenker and Bassett, 1978), which is a valuable and robust tool for analyzing the conditional quantile functions of latent variables. When considering the relationship between explanatory variables and latent variables, QR allows to compute several different regression curves corresponding to the various percentage points of the distribution and thus obtain a more complete relationship between them. In addition, QR is regarded as a robust method in regression because the estimation results are insensitive to outliers or the non-normality of response distributions. Overview of QR methods, one can refer to Yu et al. (2003); Koenker and Ng (2005). As such, QR has been used by many researchers in the field of education (e.g., Wößmann, 2005; Haile and Nguyen, 2008; Chen and Chalhoub-Deville, 2014; Giambona and Porcu, 2015; Gürsakal et al., 2016; Costanzo and Desimoni, 2017). The aim of most studies has been to scrutinize inequality of opportunity in education through investigating the determinants of student achievement, or to analyze individual and/or family background determinants of student achievement with respect to test results.

In recent years, QR has also been applied successfully in latent variable models (e.g., Dunson et al., 2003; Allen and Powell, 2011; Burgette and Reiter, 2012; Wang et al., 2016; Feng et al., 2017; Belloni et al., 2019; Liu et al., 2020). Burgette and Reiter (2012) considered QR in a factor analysis model to analyze the effects of latent variables on the lower quantiles of the response distribution. Wang et al. (2016) introduced QR into a structural equation model (SEM) to assess the conditional quantile of the outcome latent variable given the explanatory latent variables and covariates. Liu et al. (2020) proposed a quantile hidden Markov model to examine the entire conditional distribution of the response given the hidden state and potential covariates. Factor analysis models can be converted to IRT models under possible parameterizations (Kamata and Bauer, 2008; Wang and Zhang, 2019). Therefore, it is reasonable and implementable to apply QR to MLIRT models.

This study is motivated by the MLIRT models focus largely on the relationship between variables at the mean level, which may result in neglecting potential differences across the response distribution. Herein, taking a two-level (i.e., item level and student level) IRT model as a demonstration, we embed the QR approach in a MLIRT model to obtain what we refer to as the quantile MLIRT (Q-MLIRT) model. This approach enables us to investigate the relationship between explanatory variables and latent variables at various locations of the response distribution, as well as having minimal assumptions on error terms, which is a flexible and applicable approach in general. Moreover, the Q-MLIRT model can be estimated in a Bayesian framework with a Markov chain Monte Carlo (MCMC) algorithm. It should be emphasized that the Q-MLIRT model is not a replacement but a supplement to the conventional MLIRT model. If the relationship between explanatory variables and latent variables on the entire distribution is of interest, Q-MLIRT is a good recommendation.

The rest of this paper is organized as follows. The proposed Q-MLIRT model is introduced in Section “Model Description,” the MCMC algorithm is presented in Section “Bayesian Estimation,” and a simulation study conducted to evaluate the empirical performance of the proposed method is reported in Section “Simulation Study.” A real data study to demonstrate the use of Q-MLIRT models is reported in Section “Empirical Example,” and finally a discussion and suggestions for further research are presented in Section “Discussion.”

Model Description

Before describing the Q-MLIRT model considered here, we revisit MLIRT models briefly. We consider only a two-level IRT model, namely an item level and a student level. As presented by Adams et al. (1997), the two-level IRT model consists of two components: the measurement part—an IRT measurement model—which describes the probability of observed responses conditional upon the latent variable, and the structural part—a LR model—which describes the between-student variation in the latent variable. The LR model is on ability parameters, where the ability parameter is dependent on the covariates:

θ_{i} = X_{i}^{T} β + δ_{i}, (1)

where θ_i is the latent trait (i.e., ability) of student i (i = 1,…,n), X_i is a q × 1 vector of known covariates (e.g., gender, socioeconomic status, and major) for student i, and β is the corresponding q × 1 vector of regression coefficients. We assume that δ_i are independently and identically normally distributed with mean zero and variance α². In this form, a student’s achievement or performance is represented as ability parameter θ_i, and the relationship between students’ abilities and the involved explanatory variables is reflected in the regression coefficients β. Note that the coefficients of all covariates are treated as fixed effects here.

LR analysis measures the relationship between explanatory variables and students’ abilities by modeling a conditional mean function of θ_i. Given explanatory variables X_i, then:

E (θ_{i} | X_{i}) = X_{i}^{T} β . (2)

That is, the conditional distribution of θ_i is assumed to be a linear combination of covariates X_i with normally distributed errors. In usual LR approaches, the data are used to find out a single regression line that minimizes the sum of squared errors (least squared estimation) to estimate the relationship. Consequently, the focus is on average performance of response variables about covariates. However, as mentioned above, the errors may be non-normally distributed or even heteroscedastic in reality, and when a distribution is asymmetric (e.g., heavy-tailed or skewed), the mean is not the center of the distribution, while the median is likely to be a more appropriate measure of central tendency than the mean (Edgeworth, 1888; Fox, 1997; Koenker and Ng, 2005; Hao and Naiman, 2007; Trafimow et al., 2018). What is more, the relation between explanatory variables and latent variables may be different across the entire distribution. As a result, the conditional mean fails to meet the research needs in many cases. Therefore, we must provide an elaborate or specific description of the inter-relationship among latent variables and explanatory variables.

A more appropriate analysis can be achieved from conditional quantiles of θ_i. QR provides a suitable method for modeling conditional quantile functions (Koenker and Hallock, 2001). Under a number of different quantiles τ ∈ (0,1), we say the τth quantile of δ_i is the value q_τ, for which p(δ_i < q_τ) = τ, that is Q_τ(δ_i) = q_τ. The τth conditional quantile of θ_i can be then expressed as:

Q_{τ} (θ_{i} | X_{i}) = X_{i}^{T} β_{τ}, (3)

where β_τ is a q × 1 vector of QR coefficients depend on τ, and it reflects the relationship between X_i and the τth conditional quantile of θ_i, which means the marginal effects of the explanatory variables may differ over quantiles of the distribution of θ_i. Since numerous quantiles (e.g., 25%, 50%, 75%) are modeled in QR, it is possible to understand the relationship between variables at various locations of the response distribution. For example, when τ = 0.5, $X_{i}^{T} β_{τ}$ is the conditional median of θ_i, and β_τ reflects the relationship between explanatory variables and the conditional median of θ_i. In QR approaches, the classical inference of β_τ is through a minimization of the weighted sum of absolute residuals for all the observations.

Therefore, θ_i can be assessed in the following QR-type structural model, by the τth conditional quantile of θ_i plus an error term:

θ_{i} = X_{i}^{T} β_{τ} + δ_{i} . (4)

Unlike in the LR model [i.e., Eq. (1)], here the distribution of δ_i is not specified. The only assumption is that the τth quantile of δ_i is zero, that is q_τ = 0, so that we have Eq. (3). In other words, the QR structural model does not rely on any parametric specification of the conditional distribution of θ_i. Note that when the error term δ_i in Eq. (4) is normally or symmetrically distributed, results of LR and QR at the median (τ = 0.5) are consistent, in other words, the MLIRT model of Eq. (1) is a special version of the Q-MLIRT model with normality of δ_i and τ = 0.5.

Therefore, the two-level Q-MLIRT model can be defined by combining Eq. (4) with an IRT model, namely:

P (Y_{i k} = 1 | θ_{i}, a_{k}, b_{k}) = Φ (a_{k} (θ_{i} - b_{k})), θ_{i} = X_{i}^{T} β_{τ} + δ_{i}, (5)

where we adopt a two-parameter normal ogive (2PNO) model in the first level. In this article, only dichotomously scored items are considered. Here, Φ denotes the cumulative standard normal distribution function, Y_ik = 1 means a correct response of student i on item k(k = 1,…K), and a_k and b_k are the discrimination and difficulty parameters, respectively, of item k. The second level is QR on the ability parameters with an arbitrary quantile τ ∈ (0,1), and the parameters have the same meanings as those in Eq. (4).

Herein, we fix the item parameters of one item as known (i.e., a_1=1 and b_1=0) to identify the Q-MLIRT model as did Fox and Glas (2001). In the subsequent analysis, we show that the proposed model offers a more complete view of the relationship among explanatory variables and latent variables at various locations of the response distribution, including the central tendency, upper tail, and lower tail. In many cases, the error terms are non-normally distributed or heterogeneous, which fails to meet the assumption of LR, but nevertheless QR is also suitable for such cases.

Bayesian Estimation

We adopt the Bayesian estimation approach in the QR structural model (e.g., Yu and Moyeed, 2001; Lee and Neocleous, 2010; Reich et al., 2011; Wang et al., 2016). Combined with the Gibbs sampling method for 2PNO models (Albert, 1992), a fully Bayesian approach is presented to estimate the parameters of the Q-MLIRT model. Similar to Wang et al. (2016), we adopt the asymmetric Laplace distribution1 (ALD¹; Yu and Moyeed, 2001) to approximate the distribution of error terms in the QR structural model. More details about the ALD and the Bayesian QR method are presented in Appendix. Specifically, we assume δ_i∼ALD(0,ω,τ), then θ_i follows $A L D (X_{i}^{T} β_{τ}, ω, τ)$ . As shown by Reed and Yu (2009); Kozumi and Kobayashi (2011), if a random variable follows ALD (μ,ω,τ) then it can be represented as a normal distribution, so we obtain:

θ_{i} | β_{τ}, ω, e_{i} \overset{i n d}{\sim} N (X_{i}^{T} β_{τ} + k_{1} e_{i}, k_{2} ω e_{i}), (6)

where $k_{1} = \frac{1 - 2 τ}{τ (1 - τ)}$ , $k_{2} = \frac{2}{τ (1 - τ)}$ , and e = {e₁,e₂,⋯,e_n} follows an exponential distribution with scale parameter ω. That is to say, the parameter space is augmented by a latent variable e. This mixed representation of $A L D (X_{i}^{T} β_{τ}, ω, τ)$ enables Bayesian inference based on the Gibbs sampling algorithm to estimate the parameters of the QR structural model.

Gibbs Sampling

From Eq. (5), the observations consist of the item responses Y in the first level and the explanatory variables X in the second level. As a result, the full posterior distribution of the parameters given the observed data is:

p (Z, θ, a, b, β_{τ}, e, ω | Y, X) \propto \prod_{i = 1}^{n} ((\prod_{k = 1}^{K} p (z_{i k} | θ_{i}, a_{k}, b_{k}, y_{i k})) p (θ_{i} | β_{τ}, e_{i}, ω, X_{i})) p (e | ω) p (a) p (b) p (β_{τ}) p (ω), (7)

where the augmented data z_ik follow:

p (z_{i k} | θ_{i}, a_{k}, b_{k}, y_{i k}) \propto φ (z_{i k}; a_{k} (θ_{i} - b_{k}), 1) [I (z_{i k} > 0) I (y_{i k} = 1) + I (z_{i k} \leq 0) I (y_{i k} = 0)],

where φ is the normal probability density function (PDF) and I(⋅) is an indicator function.

The following conjugate prior distributions of the model parameters are used:

a_{k} \sim N (μ_{a}, σ_{a}^{2}) I (a_{k} > 0),

b_{k} \sim N (μ_{b}, σ_{b}^{2}),

β_{τ} \sim N_{q} (Λ_{0 β}, H_{0 β}),

ω^{- 1} \sim G a m m a (α_{0 ω}, β_{0 ω}),

where μ_a, σ_a, μ_b, σ_b, Λ_0β, H_0β, α_0ω, and β_0ω are hyperparameters. The detailed procedure of the Gibbs sampling algorithm is summarized below.

Step 1: Sampling z_ik. Let ξ be the vector of all item parameters, given the parameters θ and ξ, the variables z_ik are independent. For i = 1,…,n and k = 1,…,K, the fully conditional posterior distribution of z_ik can be written as:

(z_{i k} | θ, ξ, y) \sim {\begin{matrix} N (a_{k} (θ_{i} - b_{k}), 1) truncated at the left by 0 if y_{i k} = 1 \\ N (a_{k} (θ_{i} - b_{k}), 1) truncated at the right by 0 if y_{i k} = 0 \end{matrix} . (8)

Step 2: Sampling a_k. According to the principle of conjugate distribution and combined with the prior of a, the fully conditional posterior density of a_k follows:

(a_{k} | z_{k}, θ, b_{k}) \sim N (\frac{\frac{μ_{a}}{σ_{a}^{2}} + \sum_{i = 1}^{n} [z_{i k} (θ_{i} - b_{k})]}{\frac{1}{σ_{a}^{2}} + \sum_{i = 1}^{n} {(θ_{i} - b_{k})}^{2}}, \frac{1}{\frac{1}{σ_{a}^{2}} + \sum_{i = 1}^{n} {(θ_{i} - b_{k})}^{2}}) . (9)

Step 3: Sampling b_k. The fully conditional posterior density of b_k is:

(b_{k} | z_{k}, θ, a_{k}) \sim N (\frac{\frac{μ_{b}}{σ_{b}^{2}} - \sum_{i = 1}^{n} (z_{i k} - a_{k} θ_{i}) a_{k}}{\frac{1}{σ_{b}^{2}} + n a_{k}^{2}}, \frac{1}{\frac{1}{σ_{b}^{2}} + n a_{k}^{2}}) . (10)

Step 4: Sampling θ_i. The ability parameters are independent given other parameters. Using Eqs. (6) and (8), it follows that:

f (θ_{i} | z_{i}, ξ, β_{τ}, ω, e_{i}) \propto f (z_{i} | θ_{i}, ξ) f (θ_{i} | β_{τ}, ω, e_{i}), (11)

Eq. (11) is a normal model, where θ_i has a normal prior parameterized by β_τ,e_i, and ω. So the fully conditional posterior density of θ_i is given by:

(θ_{i} | z_{i}, ξ, β_{τ}, ω, e_{i}) \sim N (\frac{\frac{μ_{θ}}{σ_{θ}^{2}} + \sum_{k = 1}^{K} (z_{i k} + a_{k} b_{k}) a_{k}}{\frac{1}{σ_{θ}^{2}} + \sum_{k = 1}^{K} a_{k}^{2}}, \frac{1}{\frac{1}{σ_{θ}^{2}} + \sum_{k = 1}^{K} a_{k}^{2}}), (12)

where $μ_{θ} = X_{i}^{T} β_{τ} + k_{1} e_{i}$ , $σ_{θ}^{2} = k_{2} ω e_{i}$ .

Step 5: Sampling ω. Letting v = ω⁻¹ and using the gamma conjugate prior density for v and the PDF of $A L D (X_{i}^{T} β_{τ τ}, ω, τ)$ , the fully conditional posterior distribution of v is given by:

p (v | θ, β_{τ}) \propto v^{n + α_{0 ω} - 1} e x p {- [β_{0 ω} + \sum_{i = 1}^{n} ρ_{τ} (θ_{i} - X_{i}^{T} β_{τ})] v},

from which we have:

(ω^{- 1} | θ, β_{τ}) \sim G a m m a (n + α_{0 ω}, β_{0 ω} + \sum_{i = 1}^{n} ρ_{τ} (θ_{i} - X_{i}^{T} β_{τ})) . (13)

Step 6: Sampling e_i. From Eq. (6) together with an exponential distribution of e_i with scale parameter ω, the fully conditional posterior distribution of $e_{i}^{- 1}$ follows an inverse Gaussian distribution.

(e_{i}^{- 1} | θ_{i}, β_{τ}, ω) \sim I n v e r s e - G a u s s i a n (\frac{\sqrt{2 k_{2} + k_{1}^{2}}}{| θ_{i} - X_{i}^{T} β_{τ} |}, \frac{2 k_{2} + k_{1}^{2}}{k_{2} ω}) . (14)

Step 7: Sampling β_τ . Because Eq. (6) is a normal distribution conditionally on e_i, it is straightforward to derive the fully conditional posterior distribution of β_τ, which is given by:

(β_{τ} | θ, e, ω) \sim N (μ_{β}, Σ_{β}^{- 1}), (15)

where $μ_{β} = \sum_{β}^{- 1} (H_{0 β}^{- 1} Λ_{0 β} + \sum_{i = 1}^{n} \frac{(θ_{i} - k_{1} e_{i}) X_{i}}{k_{2} ω e_{i}})$ and $\sum_{β} = H_{0 β}^{- 1} + \sum_{i = 1}^{n} \frac{X_{i} X_{i}^{T}}{k_{2} ω e_{i}}$ .

Simulation Study

In this section, a simulation study is carried out to evaluate the performance of the Q-MLIRT model. We evaluate the parameter recovery with the Gibbs sampler. The Gibbs sampling algorithm is implemented in MATLAB (MathWorks, 2016), and the source code is available to readers upon request.

Simulation Design

Two covariates X₁ and X₂ are considered, which are independent standard normal variables. The latent variable θ_i(i = 1,…,n) is generated from three different models:

Case 1: θ_i = β₁X_i1 + β₂X_i2 + δ_i, δ_i∼N(0, 0.5)

Case 2: θ_i = β₁X_i1 + β₂X_i2−ρ_τ|β₁X_i1 + β₂X_i2| + |β₁X_i1 + β₂X_i2|δ_i, δ_i∼N(0, 1)

Case 3:θ_i = β₁X_i1 + β₂X_i2 + δ_i. δ_i∼Gamma(0.5, 1)−0.5

Parameters β₁ and β₂ are set to 0.5. Case 1 represents a normal and homoscedastic error model, the assumption of which conforms to that of LR models, case 2 represents a heteroscedastic error model, and case 3 represents a skewed error model. Of these, cases 2 and 3 evaluate the Q-MLIRT model with heavy-tailed and non-normal latent variables, respectively. For cases 1 and 3, the conditional τth quantile of θ takes the form Q_τ(θ|X₁,X₂) = β_0τ + β_1τX₁ + β_2τX₂, where β_0τ = ρ_τ, β_1τ = 0.5, β_2τ = 0.5 for τ ∈ (0,1), and ρ_τ is the τth quantile of δ_i. For case 2, the conditional quantile function is Q_τ(θ|X₁,X₂) = β₁X₁ + β₂X₂−ρ_τ|β₁X₁ + β₂X₂| + ρ_τ|β₁X₁ + βX₂| = β_1τX₁ + β_2τX₂, where β_1τ = 0.5 and β_2τ = 0.5 for τ ∈ (0,1).

In order to estimate the QR coefficients β_1τ and β_2τ under different quantiles, and compare with their theoretical values 0.5, three quantile levels (i.e., τ = 0.25, 0.5, and 0.75) are chosen, which represent the lower, central, and upper tail of the response distribution, respectively. We consider three sample sizes (i.e., n = 500, 1000, 2000) and two test lengths (i.e., K = 20, 40). Overall, there are 54 conditions (3 structural models × 3 quantiles × 3 sample sizes × 2 test lengths). Response patterns are generated by the 2PNO model, with the discrimination parameters generated from U(0.5, 1.5) and the difficulty parameters generated from N(0,0.5).

For the conjugate normal prior of β_τ, we set the prior mean Λ_0β = (0, 0, 0) and the covariance matrix H_0β is 100 times an identity matrix; which is a type of weakly informative prior. For the conjugate inverse gamma prior of ω, we set α_0ω = 28 and β_0ω = 4. For the prior of item parameters a and b, we set μ_a = 0, σ_a = 200 and μ_b = 0, σ_b = 100, respectively.

The Gibbs sampling estimation procedure of the Q-MLIRT model was iterated 10,000 times. A burn-in period of 5,000 iterations was used, and the parameter estimate was the mean of the posterior distribution of the parameter. The convergence of the Gibbs sampler was checked by the Gelman–Rubin method (Gelman and Rubin, 1992), and the values of the potential scale reduction factor (PSRF) in the burn-in period were less than 1.10 for all parameters. Overall, 100 replications were conducted across all simulation conditions.

A related issue must be considered in the application of Q-MLIRT, namely that the estimates of item parameters differ for different quantiles because of the fact that the fully conditional posterior distribution of item parameters depends on τ. As can be seen from Eq. (6), the location and scale of θ_i are changed in the estimation process of item parameters under different quantiles. In this case, bias may not make sense, a large bias does not mean poor estimates, and the results of different quantiles are also not comparable. We expect the estimates of item parameters under different quantiles are close to each other after some transformation. To this end, we introduce a new measurement that is actually suitable for situations in which item parameters are estimated on different IRT scales, to further illustrate the similarity and comparability between them.

It is known that if θ_i have a transformation relationship on two scales, namely:

θ_{i}^{*} = p θ_{i} + t,

where p and t are constants, then the relationships between the item parameters can be formulated as:

a_{k}^{*} = \frac{a_{k}}{p},

b_{k}^{*} = p b_{k} + t .

In this case, we can use similarity functions, such as the cosine function, to measure the similarity of item parameters at different scales. Two vectors are defined to be similar if the distance between them is small, which is measured by the cosine of the angle between them. Let a = (a₁,…,a_k) and $a^{*} = (a_{1}^{*}, \dots, a_{k}^{*})$ be the discrimination parameter vectors on two different scales, then the cosine similarity between the two vectors is defined as:

\cos (a, a^{*}) = \frac{\sum_{k} a_{k} a_{k}^{*}}{\sqrt{\sum_{k} {(a_{k})}^{2}} \sqrt{\sum_{k} {(a_{k}^{*})}^{2}}} . (16)

If $a_{k}^{*} = \frac{a_{k}}{p}$ , then cos⁡(a,a^*) = 1. Similarly, the cosine similarity between two difficulty parameter vectors b = (b₁,…,b_k) and $b^{*} = (b_{1}^{*}, \dots, b_{k}^{*})$ can be defined as:

\cos (b, b^{*}) = \frac{\sum_{k} (b_{k} - E b) (b_{k}^{*} - E b^{*})}{\sqrt{\sum_{k} {(b_{k} - E b)}^{2}} \sqrt{\sum_{k} {(b_{k}^{*} - E b^{*})}^{2}}}, (17)

where Eb and Eb* are the means of the elements in vectors b and b^∗, respectively. If $b_{k}^{*} = p b_{k} + t$ , then cos⁡(b,b^*) = 1.

In this sense, if the cosine of the angle between the true-parameter vector and the estimated-parameter vector is close to 1, it can be concluded that these two vectors are very similar and close, and the estimates can also be called “good estimates;” conversely, if the cosine is away from 1, it indicates that these two vectors are very different, and the estimates can also be called “bad estimates.” The higher the cosine similarity, the closer they are. Note that the “good” and “bad” estimates are just defined in the context of the transformation relationship that we considered as above, and accuracy of all the item parameters’ estimations is measured by the cosine similarity. The upshot is, when the estimated results of item parameters under each quantile are close to each other and are both good (i.e., the calculated cosine similarity of item parameters under each quantile is close to 1), we say that the item parameters are well recovered and that any group of estimates can be selected. Otherwise, the recovery of item parameters is poor and the results of different quantiles are not comparable to one another to some extent.

Given that there is no existing work on Q-MLIRT in the literature, we compare the estimated results with the two-level structural IRT model denoted in Eq. (1), where the structural model is a LR on θ_i. Hereafter, we refer to it as the mean regression multilevel IRT (M-MLIRT) model. We estimate it through a fully Bayesian estimation procedure that is easily extracted from the estimation procedure in Fox and Glas (2001). We use the average root-mean-square error (RMSE), the average bias, and the average cosine similarity to evaluate parameter recovery.

Simulation Results

Tables 1, 2 summarize the parameter recovery of the item parameters for the Q-MLIRT model in comparison to the M-MLIRT model for small and large test lengths respectively. In general, the item parameters are estimated accurately in all the cases (i.e., cases 1, 2, and 3). For each sample size, estimates of the item parameters of the Q-MLIRT model under different quantiles are close to the true parameter values, reflected mainly in the facts that (i) the cosine similarities between the true-parameter vector and the estimated-parameter vectors under different quantiles are all very close to 1 and (ii) the RMSEs are all small. In addition, when the sample size increases from 500 to 2,000, the cosine similarities increase, the RMSE and the bias in almost cases decrease for the estimates of item parameters. Results show that with increasing test length, the accuracy of almost all item parameters’ estimations also has some improvement, this could because we can get more accurate information from students as they answer more items.

TABLE 1

Table 1. Summary of parameter recovery of item parameters when K = 20.

TABLE 2

Table 2. Summary of parameter recovery of item parameters when K = 40.

From Tables 1, 2, it is also indicated that, as for the estimation of item parameters, the item parameters’ estimations of the M-MLIRT model and the Q-MLIRT model are close in each case, mainly from the reflect of the cosine similarities and RMSEs. It should be noted that although for case 2 and case 3, the distribution of ability is non-normal, it seems not to have effects on the estimation of item parameters for the MLIRT model. Specifically, for case 3, the bias of the difficulty parameters for the M-MLIRT model are always larger than the Q-MLIRT model, but from the cosine similarity and RMSE points of view, the estimates of the difficulty parameters for the M-MLIRT model and Q-MLIRT model are close. The relationship between the cosine similarity and the bias, and how they behave in more different situations, are worthy of further study and discussion.

Tables 3, 4 give the estimated results of the structural regression coefficients for the Q-MLIRT model in comparison with those of the M-MLIRT model. First, it can be concluded that the regression coefficients are estimated accurately in all the cases (cases 1, 2, and 3) for the Q-MLIRT model. When the sample size increases from 500 to 2,000, the RMSEs decrease with increasing sample size, which means that the estimates are increasingly stable. The bias of regression coefficients also decreases obviously with increasing sample size, especially when τ = 0.5. The trends are less obvious for quantiles 0.25 and 0.75, mainly because there are few samples at the upper and lower tails (τ = 0.75 and 0.25, respectively) of the conditional distribution of θ_i. Nevertheless, the directions of the bias are increasingly consistent with increasing sample size, especially when n = 2,000, which means that the estimates tend to true and stable values. From Table 4, the results show that as larger test length leads to more accurate estimations of the structural regression coefficients, mainly reflecting in the decreased biases and RMSEs in almost all conditions. Together with the estimation of item parameters, it can be concluded that the increase of test length will improve the accuracy of parameter estimation of the Q-MLIRT model, especially for the structural regression coefficients.

TABLE 3

Table 3. Bayesian estimates of regression coefficients when K = 20.

TABLE 4

Table 4. Bayesian estimates of regression coefficients when K = 40.

When comparing the estimation of regression coefficients of the M-MLIRT model and the Q-MLIRT model, we find that: first, M-MLIRT slightly outperforms Q-MLIRT in case 1. This is because that the error distribution in case 1 is normal, which meets the assumption of the LR model, so we expect the M-MLIRT model to outperform in that case; second, for case 2, the regression coefficients estimated by the M-MLIRT model are worse than those by the Q-MLIRT model to some extent. Specifically, the results are similar at τ = 0.5 when the sample size is small, but with increasing sample size, there are larger bias and RMSE for the estimates of the M-MLIRT model than those of the Q-MLIRT model. This is mainly because the error is heteroscedastic and the distribution of θ_i is heavy-tailed, and LR models are sensitive to heavy-tailed distributions and extreme outliers. For case 3, the results are close to each other, although the error term has a skewed distribution. It is perhaps that the degree of skewness of the distribution in case 3 is insufficient to influence the estimation of the regression coefficients in MLIRT.

Empirical Example

The main research question in this section is how the explanatory variables of interest regarding individual characteristics and family background are related to students’ achievement in mathematics. It is quite possible that the relation between explanatory variables and students’ achievement may differ across different levels of students’ achievement, while LR only provides us the information about the conditional mean of students’ achievement and ignores the characteristics of the entire distribution of students’ achievement. Thus, the proposed Q-MLIRT model is expected to achieve a more appropriate analysis.

Data Source

In this section, a subset of the PISA 2018 mathematics data was analyzed. There were 24 test forms/booklets that contained clusters of items from the mathematics domain. Each test form included four clusters, two clusters of items from the reading domain and one or two clusters of items from the mathematics domain. We chose a group of students from the United States who were administered test forms 1 to 6 for analysis. Each of these forms contained two mathematics clusters. Therefore, there were six clusters (i.e., M01-M06) in which contained 64 mathematical items. Because each student received one of the forms from his or her assigned group, and each form contained different items, so each student answered about 20 items and each item was answered about by 375 students. The sample size was n = 840 after eliminating students with missing data, and there were K = 64 dichotomous items. More information on the PISA 2018 data can be found at www.oecd.org/pisa/data/2018database/.

Three individual and family background variables, that have been used in most studies (e.g., Adams et al., 1997; Fox and Glas, 2001; Fox, 2005; Lu et al., 2018), were chosen as explanatory variables of interest. The student-grade variable (X₁) equaled −1, 0, and 1, which represented grade 9, grade 10, grade 11, respectively. The corresponding number of students in each grade was 59, 630, and 150. Since only one student was in grade 12, so her was removed, hence the final sample size was n = 839. The student-gender variable (X₂) equaled 0 (female) or 1 (male), the corresponding number of students was 421 and 418. The ESCS (economic, social, and cultural status of parents) variable (X₃) was normally standardized. Information of these explanatory variables are summarized in Table 5.

TABLE 5

Table 5. Summary of explanatory variables.

Method

The PISA 2018 technical report stated that the item parameters were calibrated by the two-parameter logistic model (2PLM) or the generalized partial credit model (GPCM). As only dichotomously scored items are considered, we use the 2PNO model here, and we fix the first discrimination parameter to one and the first difficulty parameter to zero to identify the model. To assess the effects of the explanatory variables above on students with different ability levels, the QR model is given by:

Q_{τ} (θ_{i}) = X_{i}^{T} β_{τ},

where i = 1,…,839, β_τ = (β_0τ,β_1τ,β_2τ,β_3τ)T is a 4 × 1 vector of unknown QR coefficients under the τth quantile of the ability distribution, X_i = (1,X_1i,X_2i,X_3i)T is a 4 × 1 vector that represents the explanatory variables of student i, the distribution of the error term is assumed as AL(0,ω,τ). The estimation was conducted for the quantiles of τ = 5%, 10%, 25%, 50%, 75%, 90%, and 95% to obtain a more thorough overview of the relationship between these explanatory variables and students’ mathematical achievement.

In the Bayesian analysis, the hyperparameters of the prior distributions were given as assigned in the simulation study. In the implementation of the Gibbs sampling algorithm, 10,000 iterations were done to estimate the parameters with an initial burn-in of 5,000 iterations. Convergence of the chains was checked by PSRF, and the PSRF values were less than 1.10 in the burn-in phase for all parameters under each quantile.

The M-MLIRT model was also used to fit the data, and compared with the Q-MLIRT model using deviance information criterion (DIC, Spiegelhalter et al., 2002). The model with smaller DIC was recommended as a better-fitting one. The DIC calculated using the joint likelihood conditioned on parameters at all levels. For the Q-MLIRT model, the joint likelihood is:

f (θ, Z | ξ, β_{τ}, ω, e) = f (Z | θ, ξ) f (θ | β_{τ}, ω, e), (18)

and the joint DIC for the τth quantile of the Q-MLIRT model is defined as:

D I C_{τ} = - 2 l o g {f (θ, Z | ξ, β_{τ}, ω, e)} . (19)

For the M-MLIRT model, the joint likelihood is:

f (θ, Z | ξ, β, α^{2}) = f (Z | θ, ξ) f (θ | β, α^{2}), (20)

and the joint DIC is defined as:

D I C = - 2 l o g {f (θ, Z | ξ, β, α^{2})} . (21)

Interested readers can refer to Celeux et al. (2006); Wang et al. (2013), Zhang et al. (2019) for more information of the joint DIC.

Results

The parameter estimates of the Q-MLIRT model under various quantiles and the M-MLIRT model (reported in column “M”) are presented in Tables 6, 7. For the limited space, we only present the results of the item parameters of cluster 1 from form 1, there are 9 items. From Table 6, it is showed that the estimates of item parameters obtained based on the M-MLIRT model are close to the results obtained from the Q-MLIRT model, especially at 50% quantile. We calculated the cosine similarity between each two discrimination-parameter vectors and each two difficulty-parameter vectors estimated under different quantiles, and these values were all very close to 1, that means they are very similar to each other, and we can choose any group of them as the estimations. Figure 1 shows item parameter estimates of the M-MLIRT model and three selected groups of item parameters estimated under τ = 25%, 50%, and 75% of the Q-MLIRT model, which represent the corresponding results of the lower, middle, and higher quantiles, respectively. As can be seen, all the estimates of item parameters under different quantiles are close and of the same size order.

TABLE 6

Table 6. Item parameter estimates of PISA data.

TABLE 7

Table 7. Structural regression coefficient estimates of PISA data.

FIGURE 1

Figure 1. Item parameter estimates of the M-MLIRT(M) model and the Q-MLIRT model under different quantiles (τ = 25, 50, and 75%).

The relationship between students’ mathematical achievement and the predictors can be deduced from the structural regression coefficients, the estimates of the QR and LR coefficients associate with the posterior standard deviations (SD) are showed in Table 7. The results of the M-MLIRT model show that, on average, the student grade and family background (i.e., ESCS) variables have significantly positive correlations with students’ mathematical achievement. Specifically, students in higher grades perform better, and students whose parents have higher economic, social, and cultural status have better mathematical achievement. But for gender, results of M-MLIRT show that there is a weakly positive correlation between gender and mathematical achievement, the performance of male students is slightly better than female students.

When the results of M-MLIRT and Q-MLIRT are compared, it is argued that there are several differences between these two models. The results of the Q-MLIRT model show that relationships between these explanatory variables and students’ mathematical achievement are different across the achievement distribution. First, we find positive correlations between those explanatory variables and mathematical achievement at various locations along the distribution, and the results of Q-MLIRT at 50% quantile are as close as the M-MLIRT. For student grade and family background variables, it is found that the positive correlations between them and mathematical achievement are more significant at lower and higher quantiles of the distribution, that is the relationships between them and mathematical achievement are stronger in students with high-ability and low-ability. Regarding gender, we find that the QR structural coefficient of gender increases from 5% quantile towards the 95% quantile of the achievement distribution, which means that male students perform better than female students in mathematics, and the difference is exacerbated among high-ability students. This finding is consistent with González de San Román and De La Rica (2012). They analyzed gender gaps in PISA test scores in their study, and showed that in most PISA participating countries, male students outperformed than female students in mathematical achievement, moreover, gender gap in test scores differed significantly in different parts of the test score distribution, where the gap increased from the lower tail to the upper tail. The structural regression coefficients across the various quantiles are illustrated further in Figure 2.

FIGURE 2

Figure 2. Regression coefficients of Q-MLIRT model across different quantiles in the analysis of PISA data, along with the corresponding 50 and 95% credible intervals.

In addition, Table 8 shows a comparison of the Q-MLIRT model and the M-MLIRT model based on the joint DIC. Results indicate that for the Q-MLIRT model, 25% quantile is the best-fitting, followed by quantiles at 75%, 50%, 10%, 5%, 90% and 95% ones. And the fitting situation of M-MLIRT is nearly equally to the Q-MLIRT model when the quantile is 50% or 75%, this is consistent with what is found in the estimation of parameters, that the parameter estimation results of M-MLIRT is mostly close to the results of Q-MLIRT at 50% and 75% quantiles.

TABLE 8

Table 8. Comparing the Q-MLIRT and M-MLIRT using joint DIC for PISA data.

To sum up, the analysis results make it clear that the Q-MLIRT model offers a more comprehensive picture of the relationship between students’ mathematical achievement and the student-background variables. The results suggest that the relationships between students’ mathematical achievement and its predictors differ across different levels of student ability, which can’t be achieved with MLIRT. We find that for students with lower and higher mathematical performances in PISA, their achievements are related more closely to grade, and the socioeconomic status of their family, and the gender difference is more pronounced among high-level students. In this respect, Q-MLIRT is a useful research tool to better understand the relationship between students’ individual characteristics or family background factors and educational achievement. Note that the results, together with some educational knowledge, would provide a deeper insight for related educators on mathematical learning, and these findings will also help them to develop more-targeted policies for students of different ability levels.

Discussion

In this article, a quantile-regression-type structural model was introduced to MLIRT models. The main advantage of this form is that it offers a richer analysis of the relationship between covariates and latent variables. Meanwhile, Q-MLIRT is more robust and is applicable to various distribution of latent variables. In other words, it is more flexible and applicable in general. We also proposed a fully Bayesian estimation algorithm for the Q-MLIRT model so that the parameters in the model could be estimated simultaneously. In addition, we proposed a new evaluation index based on the cosine similarity function to evaluate further the accuracy of the item parameter estimates under different quantiles.

In the simulation study, we used Gibbs sampling to estimate the Q-MLIRT model. The results showed that the parameters could be recovered well in almost all conditions. The Q-MLIRT model was applied to analyze a PISA data set, and the results provided more specific and in-depth insights into the relationship between the explanatory variables of interest and students’ achievement. Also, note that the item parameter estimates differ for different quantiles, although we know they are very close to each other according to the cosine similarity evaluation. An alternative and easy way of applying the Q-MLIRT model in practical problems is to obtain the item parameters at a certain quantile first, such as τ = 0.5 of the model, and then, for analysis with a different value of τ, the item parameters are pre-fixed (Wang et al., 2016).

This study was the first attempt at combining the QR approach with the IRT model in a multilevel modeling framework. However, this study focused only on item response data nested within students. We intend to extend the model to a higher level (school level) and estimate how the school level affects students’ abilities. Moreover, we considered herein only dichotomous response data and one-dimensional ability parameters, whereas Q-MLIRT could be applied to multidimensional and/or polytomous data. In addition, the explanatory variables that we have considered are known covariates, whereas it would make sense to consider latent explanatory variables measured by observations with measurement errors. Finally, variable selection would be of interest applied in Q-MLIRT, especially when some explanatory variables are either insignificant or significant only under some quantiles but not others.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://www.oecd.org/pisa/data/2018database.

Author Contributions

HZ did the simulation study and analysis on real data, and completed the writing of the article. WG provided key theoretical and technical support. XZ provided original thoughts and article revisions. All authors contributed to the article and approved the submitted version.

Funding

This research was partly supported by the National Natural Science Foundation of China (Grant 12001092) and the Key Laboratory of Applied Statistics of MOE.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Adams, R. J., Wilson, M., and Wu, M. (1997). Multilevel item response models: an approach to errors in variable regression. J. Educ. Behav. Stat. 22, 47–76. doi: 10.2307/1165238