Corrected goodness-of-fit index in latent variable modeling using non-parametric bootstrapping

Sideridis, Georgios; Alghamdi, Mohammed

doi:10.3389/fpsyg.2025.1562305

METHODS article

Front. Psychol., 26 March 2025

Sec. Quantitative Psychology and Measurement

Volume 16 - 2025 | https://doi.org/10.3389/fpsyg.2025.1562305

Corrected goodness-of-fit index in latent variable modeling using non-parametric bootstrapping

Georgios Sideridis¹^*

Mohammed Alghamdi²

¹ICCTR, Boston Children’s Hospital, Harvard Medical School, Boston, MA, United States
²Department of Self-Development Skills, King Saud University, Riyadh, Saudi Arabia

Latent variable modeling (LVM) is a powerful tool for validating tools and measurements in the social sciences. One of the main challenges in this method is the evaluation of model fit that is traditionally assessed using omnibus inferential statistical criteria, descriptive fit indices, and residual statistics, all of which are, to some extent, affected by sample sizes and model complexity. In the present study, an R function was created to assess fit indices after employing non-parametric bootstrapping. Furthermore, the newly proposed corrected goodness-of-fit index (CGFI) is presented as a means to overcome the abovementioned limitations. Using the data from Progress in International Student Assessment (PISA) 2022 and Progress in International Reading Literacy Study (PIRLS) 2021, the analysis of instructional leadership and the construct of bullying results revealed differential decision-making when using the present function compared to relying solely on sample estimates. It is suggested that the CGFIboot function may provide useful information toward improving our evaluative criteria in LVMs.

1 Introduction

Latent variable modeling (LVM) has become the primary means to test the validity of measurements comprising items to form latent constructs (Shevlin and Miles, 1998). Modeling latent variables at the measurement level involves model specification, parameter estimation, and model evaluation based on how well the constructed hypothesized model explains the variances and covariances in the observed variables. A critical discussion within the LVM literature is about identifying the appropriate evaluative criteria for judging the model fit (Pan and Liu, 2024; Sathyanarayana and Mohanasundaram, 2024; Urban and Bauer, 2021). Among the criteria are the omnibus chi-squared test and a large number of descriptive fit indices such as the root mean squared error of approximation (RMSEA; Steiger and Lind, 1980), the standardized root mean squared residual (SRMR; Jöreskog and Sörbom, 1981), the comparative fit index (CFI, Bentler, 1990), and the goodness-of-fit index (GFI; Jöreskog and Sörbom, 1982), among others. These indices are essential tools for researchers to assess how well their hypothesized models correspond to the empirical data (Thompson and Wang, 1999; Lei and Lomax, 2005); however, there is no consensus on cutoff values. Earlier recommendations suggested that fit indices should be <0.900 while more recent guidelines recommend values <0.950. The adoption of more stringent criteria followed seminal simulation studies that marked a shift from more lenient thresholds (Doğan and Özdamar, 2017; Hu and Bentler, 1999; Kline, 2015; Schermelleh-Engel et al., 2003) aimed at distinguishing “exact” and “close” fit as noted by of MacCallum et al. (1996). Other recommendations point to two-index strategies combining information from descriptive fit indices and residual statistics (e.g., Hu and Bentler, 1999). For example, one index must be the SRMR, with a value ≤8%, and the second index can be any one of the descriptive fit indices, with a value ≥0.950. Additionally, other recommendations strongly advocated in favor of using 95% confidence intervals (CIs; e.g., Browne and Cudeck, 1993). However, what confounds the picture further is not just the lack of consent but also the complexities arising from small sample sizes, actual misfit, model complexity, the presence of missing data, and so on (Marsh et al., 2004). For example, small sample sizes have been reported to be a salient problem as a majority of studies engage <200 or 100 participants (see MacCallum et al., 1999; Mundfrom et al., 2005) and consider associated problems derived from using small samples (Hogarty et al., 2005). In the present study, we aim to assist this process by providing an R function that includes a promising corrective procedure for the GFI and the inclusion of bootstrapping from the original data so that potential incidences of bias are also examined. The function is easily accessible in the link https://github.com/GS1968/CFAtools/blob/main/CGFIboot.R.

The GFI, initially introduced by Jöreskog and Sörbom (1982), has gained widespread acceptance in LVM due to its intuitive interpretation as a measure of how well the model reproduces the observed variance–covariance matrix. Its downside, however, is that, for severely misspecified models, the index tends to provide inflated estimates (Bollen and Stine, 1993; Gerbing and Anderson, 1992; Marsh et al., 1988). Some researchers consider inflation, compared to deflation, a more serious problem because they jeopardize parsimony and theoretical coherence, and tends to favor a more conservative approach (Bentler and Bonett, 1980). Concerns on distorted estimates as a function of sample size have also been reported (Gerbing and Anderson, 1992; Marsh et al., 1988). Thus, the extreme sensitivity of the GFI renders it invalid, especially in small sample studies or nested model comparisons.

Early studies to mitigate the limitations of GFI led to the development of adjusted indices, such as the adjusted goodness-of-fit index (AGFI; Jöreskog and Sörbom, 1984) and the parsimony goodness-of-fit index (PGFI; Mulaik et al., 1989), which incorporate penalties for model complexity by adjusting for the degrees of freedom and the number of estimated parameters. Despite these advancements, subsequent studies pointed out that the adjusted indices are not immune to biases, particularly when models are misspecified (Fan and Sivo, 2005; Hu and Bentler, 1998) or sample sizes are small (Shevlin and Miles, 1998). The persistent need for a more robust fit index that can effectively address the biases associated with sample size and model complexity remains a critical issue in the field. Our contribution aims to provide an easy-to-implement correction via an R function, as proposed by Wang et al. (2020), which incorporates non-parametric bootstrapping to mitigate the effects of small sample sizes.

1.1 A proposal for a corrective procedure

Recently, Wang et al. (2020) proposed a corrective procedure to the GFI to account for both model complexity and sample size in light of the known downward trend observed in simulation studies. Using a systematic Monte Carlo simulation by varying sample size, estimation method, model misspecification, and model complexity, by Wang et al. (2020) demonstrated that the CGFI outperformed both the GFI and its adjusted counterpart (i.e., AGFI). The CGFI was found to be more stable across varying sample sizes and more sensitive to detecting model misspecifications, thereby overcoming the shortcomings of current fit indices (Kenny and McCoach, 2003; Shevlin and Miles, 1998). This fact makes the CGFI a notable improvement over past corrections since it considers both the small-sample downward bias in all fit indices as well as the effects of model complexity in a more systematic way. While the AGFI imposes a penalty for the number of degrees of freedom used in the model and the PGFI has also been shown to have a low sensitivity to the model misspecification, it has been recommended that the CGFI was a better and more interpretable measure than these two for model fit and misspecification, particularly when used with bootstrapping techniques such as the ones employed in our analysis. Not only does this rationale further justify the need for the CGFI method, but it also clearly delineates its superiority over past GFI correction efforts, all of which support the contributions offered in this study via an easy-to-implement R function. Given that the function is based on the Lavaan package, all estimators that are available in the Lavaan package can be applied with the current CGFI function in R.

For the corrected indicator, the authors propose a cutoff estimate of 0.90 as indicative of acceptable model fit, citing the earlier cutoff criteria. However, given their reference to the CGFI to the CFI and NNFI, Non-Normative Fit Index; the revised criteria should likely be implemented with the CGFI as well (Hu and Bentler, 1999).

1.2 The CGFIboot function in R: description and application

To adjust for the downward bias in fit indices, Wang et al. (2020) proposed a correction factor $\frac{1}{N}$ proportional to a model’s complexity adjusting factor $\frac{k (k + 1)}{p}$ to the GFI. Consequently, the corrective version of the GFI (CGFI) is defined as shown in Equation 1:

\begin{array}{l} CGFI = G F I + \frac{k (k + 1)}{p} \times \frac{1}{N} & (1) \end{array}

Where k is the number of observed variables, p is the number of free parameters, and N is the sample size. In LVM, the degrees of freedom and the number of free parameters (p) for the test model or the theoretical model $(d f_{T})$ have a verified functional relationship as follows in Equation 2:

\begin{array}{l} d f_{T} = \frac{k (k + 1)}{2} - p & (2) \end{array}

Thus, an alternative form of CGFI is as shown in Equation 3:

\begin{array}{l} CGFI = G F I + \frac{2}{1 - \frac{2 d f_{T}}{k (k + 1)}} \times \frac{1}{N} & (3) \end{array}

The corrective factor for sample size, by a factor $\frac{1}{N}$ , compensates for the potentially downward bias introduced by small samples as, when N is small, $\frac{1}{N}$ is large, resulting in compensation; when N is large, then there is a negligible correction as large samples provide population-like estimates that are robust. Furthermore, the correction for model complexity by a factor of $\frac{k (k + 1)}{p}$ accounts for the number of observed parameters relative to the number of estimated parameters. Thus, if a model is complex (having more estimated parameters than the observed variables), then the corrective procedure adjusts for model inflation; if a model is simple, the correction is small to maintain model integrity in the estimation process.

The present function encompassed the Lavaan package to estimate additional descriptive fit indices such as the CFI, TLI, Tucker-Lewis Index; GFI, AGFI, RMSEA, and SRMR. Additionally, it provides 95% confidence intervals from simulated population distributions using non-parametric bootstrapping described next. The function allows for the employment of continuous, ordered categorical and dichotomous data. Missing data are addressed via the “NA” text as the function has been designed to read comma-delimited data (.csv). The user can modify the number of replicated samples as desired. Among the estimators, the current version of Lavaan we used allowed us to apply the following estimators: ML, Maximum Likelihood; MLR, Maximum Likelihood Robust; MLM, Maximum Likelihood Mean Adjusted; WLSMV, Weighted Least Squares Mean and Variance Adjusted; DWLS, Diagonally Weighted Least Squares; GLS, Generalized Least Squares; ULS, Unweighted Least Squares; and ULSMV, Unweighted Least Squares Mean and Variance Adjusted. The availability of estimators depends on the version of Lavaan used; thus, it is not limited to the estimators mentioned here.

The non-parametric bootstrapping method involves resampling with replacement from the original sample data to form a number of pseudo-samples (bootstrap samples). These samples simulate the population distribution of a statistic, in this case, the fit indices (CGFI, CFI TLI, and RMSEA). Non-parametric bootstrapping is a particularly powerful type of resampling because it does not presume any distribution of the data, unlike parametric bootstrapping (Efron and Tibshirani, 1994), which makes this method robust to use in approximating sampling distributions when the form of an underlying population distribution does not fit a known distribution. Thus, the advantages of the procedure are its distribution-free nature, its flexibility, and its accuracy with small samples (Davison and Hinkley, 1997). The procedure’s disadvantages are computational intensity, sampling bias, where sample-based idiosyncrasies are amplified, thereby creating biased population distributions and interpretation (if distributions are severely skewed or multimodal) (Chernick, 2008). For example, with a small sample, some observations (e.g., outliers) may be replicated more frequently, reducing variability in the bootstrap samples and thus affecting the validity of the estimated parameters. Given that uncertainty is captured in the estimates of the 95% confidence intervals, non-parametric bootstrapping with small samples may result in overly optimistic (narrow) confidence intervals. Future updates of the function will include alternatives to non-parametric bootstrapping.

The output displays not only the original sample data estimates (lengthy Lavaan output) but also the averaged data from bootstrap distributions using user-defined replicated samples (1,000 in the presence instances). Discrepancies between the mean bootstrap estimates and those from the sample are likely indicative of enhanced variability, which may limit the generalization to the true population. For that purpose, standard deviations of the bootstrap distributions were added. Thus, our goal of displaying both original sample estimates and those from the bootstrap distributions is to highlight the potential concerns about findings’ generalization. As MacCallum et al. (1996) suggested, the inherent value of bootstrapping is the use of the 95% confidence intervals of the bootstrap distribution in which upper (for RMSEA) and lower bounds (for fit indices) may point to having a model with unacceptable model fit given the uncertainty introduced by the simulated population distribution.

1.2.1 Example 1: evaluating the instructional leadership scale in PISA 2022

Using the data from the Progress in International Student Assessment (PISA 2022), the CGFIboot function was applied to the measurement of instructional leadership using a sample of 193 principals from Saudi Arabia (data are available at: https://www.oecd.org/en/about/programmes/pisa.html#data). The instrument collects data from principals who are asked how often they or another member of their school management team have participated in or conducted teaching related to instructional leadership during the past 12 months. Response scaling included five items ranging from “Never or almost never” to “Every day or almost every day.” The analysis was run with the maximum likelihood estimator with robust standard errors, which has been recommended with Likert-type data with at least five items given the adjustments on the test statistic for non-normality (Beauducel and Herzberg, 2006; Muthén and Asparouhov, 2002; Rhemtulla et al., 2012). The scale includes seven items with sample content being as follows: “How often you collaborate with teachers to solve classroom discipline problems?” or “How often you work on a professional development plan?” Internal consistency reliability was measured at 0.846 using Cronbach’s alpha and 0.850 using MacDonald’s omega. Using a principal components analysis (PCA), a single component was extracted using the eigenvalue <1 criterion, explaining 52.287% of the item’s information. The Kaiser–Meyer–Olkin (KMO) test of sampling adequacy was equal to 0.854, which is a good value for the amount of common variance making the data appropriate for use in factor analysis. Furthermore, Bartlett’s test of sphericity was also significant, suggesting that the item-factor correlation are substantial for use with the factor model [χ²(21) = 491.874, p < 0.001]. After applying the CGFIboot function, the results indicated model fit rejection using the omnibus chi-square test and, similarly, the descriptive fit indices assuming the more recent criterion of 0.950 (Table 1). Densities of the fit indices along with 95% confidence bands are shown in Figure 1. Interestingly, the GFI was equal to 0.945 (rejected using strict standards), but the corrective procedure resulted in a GFI estimate of 0.950. Thus, in this example, the CGFI would indicate a good fit of the data to the model, compared to the other fit indices, demonstrating a good adjustment for small sample sizes and a correction for the downward trend observed earlier (Wang et al., 2020). Additional information provided by the confidence intervals and the estimated standard deviation of the bootstrap distribution suggested that the sample may be idiosyncratic and may not generalize to a coherent population as it is highly variable. This variability is evident with the mean chi-squared estimate from the bootstrap distribution that increased from 39 to 59 units. The uncertainty of sample estimates is also evident using the 95% confidence intervals that clearly include lower estimates of fit that are not acceptable.

Table 1

Table 1. CGFIboot function indicators for original and bootstrapped estimates in confirmatory factor analysis for the measurement of leadership in the progress in international student assessment (PISA) 2022.

Figure 1

Figure 1. Densities of bootstrap distributions for the chi-squared statistic and the following fit indices: CFI, TLI, GFI, AGFI, RMSEA, SRMR, and CGFI, along with 95% confidence intervals. Densities are based on 1,000 replicated samples in the measurement of instructional leadership.

1.2.2 Example 2: examining the psychometrics of bullying in PIRLS 2021

Using the data from the Progress in International Reading Literacy Study (PIRLS) in 2021, the bullying scale was selected for its solid psychometric properties across countries (data are available at: https://www.iea.nl/data-tools/repository/pirls). In the present study, we utilized data from 4,540 eighth graders from Saudi Arabia who were part of the 2021 cohort. The bullying scale comprised 14 items employing a frequentist-type scaling system ranging between “never” and “at least once a week.” Example items were “They said mean things about me” or “Spread lies about me.” The internal consistency of the scale was 0.895 using both Cronbach’s alpha and McDonald’s omega coefficients. Using the exploratory PCA procedure and the eigenvalue >1 criterion, a single component was extracted, explaining 43.503% of the item’s variance. The KMO test of sampling adequacy was equal to 0.946, which is a very high value for the amount of common variance making the data appropriate for use in factor analysis. Furthermore, Bartlett’s test of sphericity was also significant, suggesting that item-factor correlation is substantial for use with the factor model [χ²(91) = 23190.428, p < 0.001].

The CGFIboot function was applied for defining ordered data and using the WLSMV estimator. Densities of the fit indices along with 95% confidence bands are shown in Figure 2 for the measurement of bullying. As shown in Table 2, the sample-based data fit the model adequately. Given the large sample size, as expected, the omnibus chi-squared test of exact fit was significant [χ²(77) = 590.326, p < 0.001]; therefore, preference was given to the fit indices. Among them, all were >0.950, showing an excellent fit of the data to the model. When viewing the bootstrap estimates, the results indicated that the bootstrap estimates of the descriptive fit indices were identical, with only minor deviations observed with the residual-based indices, all within the sampling error. Bootstrap confidence intervals were narrow and closely aligned with their point estimates. Therefore, for the measurement of bullying, all of the information collectively suggests good model fit and confidence that the sample-based data accurately reflect the population estimates. This finding is rather intuitive given that sample stratification and randomization procedures are closely monitored in international studies such as PIRLS.

Figure 2

Figure 2. Densities of bootstrap distributions for the Chi-square statistic, and the following fit indices: CFI, TLI, GFI, AGFI, RMSEA, SRMR, and CGFI, along with 95% confidence intervals. Densities are based on using 1,000 replicated samples in the measurement of bullying.

Table 2

Table 2. CGFIboot function indicators for original and bootstrapped estimates in confirmatory factor analysis for the measurement of bullying in PIRLS 2021.

2 Conclusion and future directions

Overall, the CGFIboot function represents a valuable solution to some of the limitations inherent to the use of classic fit indices applied within the LVM framework, especially with small samples and complex models. With the use of a correction factor and non-parametric bootstrapping to calculate bias corrected confidence intervals (CIs) for CGFI, the results from its application to both small and large sample datasets showed the robustness of the CGFI.

While the CGFI represents a significant improvement over traditional goodness-of-fit measures, particularly in addressing sample size sensitivity and model complexity, it is not without its limitations. Although the CGFI introduces a correction factor for model complexity, its performance under very large models with high-dimensional structures needs further empirical validation. In high-dimensional SEM models (e.g., involving multiple latent constructs, hierarchical structures, or many free parameters), the correction factor applied in CGFI may not fully compensate for the distortions caused by model complexity. In simulation studies, Wang et al. (2020) demonstrated that CGFI performed well across a range of model complexities. However, for models with a very high number of parameters relative to the degrees of freedom, the index might not penalize complexity as strongly as expected, leading to inflated fit estimates. Furthermore, the model has not been tested with longitudinal designs and nested data as additional complexity is introduced with variability in intraclass correlation coefficients and the presence of model constraints. Therefore, we recommend, as is customary, that the CGFI is not used in isolation but can be combined with other descriptive fit indices, such as the CFI, RMSEA, and SRMR, which may better capture the aspects of model fit that CGFI alone cannot detect. For example, in models where high correlations among latent variables exist, CGFI might overestimate model fit, while RMSEA and SRMR might indicate significant model misfit. While the results highlight the function’s potential to enhance our “conclusion validity,” it is important to conduct further testing with diverse datasets and across a wider range of contexts, including different types of data, model complexities, and sample sizes, to assess its robustness and generalizability. Comparative studies in which the CGFI is evaluated with respect to other emerging fit indices under a variety of conditions could offer important information about its strengths and weaknesses. In addition, extending the scope of this function to longitudinal and multi-group LVM models might provide new avenues for fit evaluation in more complex research designs. Finally, to improve method transparency and replicability in LVM, we may need to develop more user-friendly interfaces/packages that would enable researchers to easily implement the present R function, which is a step in the right direction.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found at: https://www.iea.nl/data-tools/repository/pirls and https://www.oecd.org/en/about/programmes/pisa.html#data.

Ethics statement

Ethical approval was not required for the study involving humans in accordance with the local legislation and institutional requirements. Written informed consent to participate in this study was not required from the participants or the participants’ legal guardians/next of kin in accordance with the national legislation and the institutional requirements.

Author contributions

GS: Conceptualization, Formal analysis, Investigation, Software, Writing – original draft, Writing – review & editing. MA: Data curation, Funding acquisition, Resources, Software, Validation, Writing – review & editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. This study was funded by Researchers Supporting Project number (RSPD2025R601), King Saud University, Riyadh, Saudi Arabia.

Acknowledgments

We appreciate the support of the Researchers Supporting Project.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declare that no Gen AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Beauducel, A., and Herzberg, P. Y. (2006). On the performance of maximum likelihood versus means and variance adjusted weighted least squares estimation in CFA. Struct. Equ. Model. Multidiscip. J. 13, 186–203. doi: 10.1207/s15328007sem1302_2

Corrected goodness-of-fit index in latent variable modeling using non-parametric bootstrapping

1 Introduction

1.1 A proposal for a corrective procedure

1.2 The CGFIboot function in R: description and application

1.2.1 Example 1: evaluating the instructional leadership scale in PISA 2022

1.2.2 Example 2: examining the psychometrics of bullying in PIRLS 2021

2 Conclusion and future directions

Data availability statement

Ethics statement

Author contributions

Funding

Acknowledgments

Conflict of interest

Generative AI statement

Publisher’s note

References

Appendix A

Examples using the CGFI function in R

94% of researchers rate our articles as excellent or good