AUTHOR=Rutledge Douglas N. , Roger Jean-Michel , Lesnoff Matthieu 

TITLE=Different Methods for Determining the Dimensionality of Multivariate Models

JOURNAL=Frontiers in Analytical Science

VOLUME=Volume 1 - 2021

YEAR=2021

URL=https://www.frontiersin.org/journals/analytical-science/articles/10.3389/frans.2021.754447

DOI=10.3389/frans.2021.754447

ISSN=2673-9283

ABSTRACT=A tricky aspect in the use of all multivariate analysis methods is the choice of the number of Latent Variables to use in the model, whether in the case of exploratory methods such as Principal Components Analysis (PCA) and Independent Components Analysis (ICA), or predictive methods such as Principal Components Regression (PCR), Partial Least Squares regression (PLS) or PLS Discriminant Analysis (PLS-DA). For exploratory methods, we want to know which Latent Variables deserve to be selected for interpretation and which contain only noise. For predictive methods, we want to ensure that we include all the variability of interest for the prediction, without introducing variability that would lead to a reduction in the quality of the predictions for samples other than those used to create the multivariate model.
In the case of predictive methods, the most common procedure to determine the number of Latent Variables for use in the model is Cross Validation which is based on the difference between the vector of observed values, y, and the vector of predicted values, ŷ.
In this article, we will first present this procedure and its extensions, and then other methods based on entirely different principles. Many of these methods may also apply to exploratory methods.
These alternatives to Cross Validation include methods based on the characteristics of the regression coefficients vectors, such as the Durbin-Watson Criterion, the Morphological Factor, the Variance or Norm and the repeatability of the vectors calculated on random subsets of the individuals. Another group of methods is based on characterizing the structure of the X matrices after deflations.
A multitude of indicators are available, since no single criterion (even the classical Cross-Validation) works perfectly in all cases. This article proposes an empirical method to facilitate the final choice of the number of Latent Variables. A set of indicators is chosen and their evolution as a function of the number of Latent Variables extracted is synthesized by a Principal Components Analysis. The set of criteria chosen here is not exhaustive, and the efficacy of the method could be improved by including others.