Gene-based Analysis of Genetic Association with HDL : Application of Variance Inflation Factors and Ridge Regression
The paper reports a gene-based analysis of genetic association with high-density lipoprotein (HDL). Low HDL is a risk factor for cardiovascular disease in the general population. The two objectives of the study are: (1) gene-based testing using multiple linear regression of multiple SNPs in a gene and (2) assessing the full multiple regression model using variance inflation factor (VIF) and ridge regression. The gene of interest in this analysis is the CETP gene because it has a known association with HDL in the general population, and has been reported in previous analysis of the type 1 diabetes population (Teslovich et al., 2010; Yoo et al., 2017). When SNPs within a gene are correlated, multiple linear regression may suffer from multicollinearity. To assess multicollinearity, VIFs were calculated for each SNP, and the SNPs that had large VIFs were removed from the full model that included all SNPs of CETP as the covariates. Ridge regression is another method for assessing multicollinearity, which penalizes the correlated SNPs in the gene. By using cross-validation in R, we found the tuning parameter lambda with the minimized mean squared error (MSE). For VIF, the global hypothesis tests of association in the full and reduced models for CETP are not very different and inference from the full model (i.e. the global test) appears to be unaffected by the level of multicollinearity among the SNPs in CETP. For ridge regression, the lambda with the minimized MSE had approximately the same MSE as the full model, which implies that the full model was unaffected by multicollinearity.