AUTHOR=Chao Yi-Sheng , Wu Hsing-Chien , Wu Chao-Jung , Chen Wei-Chih 

TITLE=Principal Component Approximation and Interpretation in Health Survey and Biobank Data

JOURNAL=Frontiers in Digital Humanities

VOLUME=Volume 5 - 2018

YEAR=2018

URL=https://www.frontiersin.org/journals/digital-humanities/articles/10.3389/fdigh.2018.00011

DOI=10.3389/fdigh.2018.00011

ISSN=2297-2668

ABSTRACT=<p><bold>Background:</bold> Increasing numbers of variables in surveys and administrative databases are created. Principal component analysis (PCA) is important to summarize data or reduce dimensionality. However, one disadvantage of using PCA is the interpretability of the principal components (PCs), especially in a high-dimensional database. By analyzing the variance distribution according to PCA loadings and approximating PCs with input variables, we aim to demonstrate the importance of variables based on the proportions of total variances contributed or explained by input variables.</p><p><bold>Methods:</bold> There were five data sets of various sizes used to understand the performance of PC approximation: Hitters, SF-12v2 subset of the 2004–2011 Medical Expenditure Panel Survey (MEPS), and the full set of 1996–2011 MEPS data, along with two data sets derived from the Canadian Health Measures Survey (CHMS): a spirometry subset with the measures from the first trial of spirometry and a full data set that contained non-redundant variables. The variables in data sets were first centered and scaled before PCA. PCs were approximated through two approaches. First, the PC loadings were squared to estimate the variance contribution by variables to PCs. The other method was to use forward-stepwise regression to approximate PCs with all input variables.</p><p><bold>Results:</bold> The first few PCs had large variances in each data set. Approximating PCs using stepwise regression could efficiently identify the input variables that explain large portions of PC variances than approximating according to PCA loadings in the data sets. It required fewer numbers of variables to explain more than 80% of the PC variances through stepwise regression.</p><p><bold>Conclusion:</bold> Approximating and interpreting PCs with stepwise regression is highly feasible.PC approximation is useful to (1) interpret PCs with input variables, (2) understand the major sources of variances in data sets, (3) select unique sources of information, and (4) search and rank input variables according to the proportions of PC variance explained. This can be an approach to systematically understand databases and search for variables that are important to databases.</p>