Linear regression methods serve as a frequent launching point for those venturing into predictive analytics. Their roots lay in the advent of classical statistics (Francis Galton, cousin of the famed Charles Darwin, mentions “regression toward mediocrity” as far back as 1886), making them amongst the most heavily investigated and well-understood statistical methods. Their ease of interpretation makes them incredibly powerful in the context of controlled experiments, and their inclusion in nearly all modern statistical computation packages renders them approachable to even novice statistical modelers.
Unfortunately, linear regression methods come with a list of caveats, many of which are no different than the caveats associated with most methods hailing from classical statistics. More patient people than us have outlined these assumptions at length (no intro stats curriculum would be complete without a discourse on the assumptions required for the proper application of linear regression methods). Consequently, this post will focus on a problem that, in the author’s opinion, is far too frequently overlooked in academic settings: the problem of predictor collinearity.
Standard linear regression methods are known to fail (or, at least, perform sub-optimally) in the presence of highly correlated predictors. If the ultimate goal of the analysis is prediction (as opposed to interpretation of specific predictor-outcome relationships), some additional processing may be needed in order to produce a viable predictive model. In no particular order, we present six ways to deal with highly correlated data when developing a linear regression model. It should be noted that the recommendations below apply specifically to continuous outcome models, i.e., models in which the dependent variable is a real-valued number.
*Note: This is, by no means, meant to provide a thorough, technical overview of each topic. Instead, our goal is to identify some of the potential solutions to the collinearity problem in linear regression, spark conversation amongst practitioners and enthusiasts, serve as a starting point for those venturing into the realm of predictive analytics, and provide links to some of the relevant additional reading on each topic.
1) Manual Variable Selection
Highly correlated predictors contain redundant information. Consequently, removing individual features that are highly correlated with other predictors may produce viable predictive models with little loss in predictive power. The Variance Inflation Factor (VIF) of a predictor can be used to identify and eliminate potentially redundant features. This method allows for a high degree of user input, but can prove tedious in cases with datasets containing excessive numbers of potential predictors. Alternatively, univariate correlations can be used to identify candidate predictors. This approach, however, necessarily ignores multivariate relationships in the data. Multicollinearity (i.e., predictors that are related to linear combinations of other predictors) may still prove an issue after univariate correlations are considered. This is especially true in large, high dimenstional data sets.
2) Tree-based Automatic Variable Selection
Decision trees provide a sort of automatic variable selection, as tree-based methods only include features that provide a legitimate contribution to the model’s overall performance. These methods are typically easy to implement and interpret, with feature selection resulting as a necessary part of the tree building process. More advanced tree-based methods (such as bagged or boosted trees), however, require an additional tuning parameter that must be manually specified by the user. Although a variety of techniques exist to augment this selection process, this adds a level of complexity to overall model development. Addtionally, tree-based regression methods provide viable predictive models in and of themselves. Using tree-based methods to select inputs for a linear regression model may be unnecessary, as tree-based methods produce useful, stand-alone predictive models.
3) Regression-based Automatic Variable Selection
Stepwise regression methods, including forward and backward elimination methods, use various statistical criteria to iteratively add and remove potential features. The result is a statistically produced final model with (typically) decent statistical properties. The downside, however, is that the user has very little control over which variables are ultimately selected. When all available predictors are considered as candidates for model inclusion, these methods may result in models that fail to generalize beyond the training set. Additionally, these methods do not explicitly address predictor collinearity, and additional processing may be required after the final model is produced.
4) Variable Reduction via Principal Components Analysis (PCA)
PCA consumes entire chapters in academic texts (and, in fact, entire texts themselves), precluding a comprehensive overview in the current post. (Very, very, VERY) Briefly, PCA attempts to find linear combinations of predictors that are a) uncorrelated with each other, and b) explain as much of the variance in the feature-set as possible. In the context of correlated predictors, PCA can be used to create a set of predictors that are completely uncorrelated with each other. These predictors can then be used as inputs to any subsequent regression model (a procedure commonly referred to as “prinicpal components regression”). This is an incredibly simplified explanation of the process. We’ve found this explanation to be a decent starting point for those interested in additional reading.
5) Variable Reduction via Partial Least Squares (PLS)
PCA (as described above) creates uncorrelated predictors without accounting for these predictors’ relationships with the outcome of interest. This can prove problematic in cases where the principal components (the uncorrelated predictors, so to speak) are unrelated to the outcome of interest. PLS, by comparison, takes a slightly different approach, accounting for the predictor-outcome relationship while reducing the number of candidate predictors. The result is a reduced feature-set that has been selected based on its relationship with the outcome of interest.
6) Parameter Estimation via “Shrinkage” Methods
The cost function in conventional linear regression minimizes the sum of squared differences between the observed data points and the data points predicted by the linear regression model. In the presence of correlated predictors, minimizing this cost function can result in inordinately large regression coefficients as the method has difficulty quantifying the relationship between an outcome and any number of highly correlated predictors. To account for this, “shrinkage” methods add an additional penalty term to the cost function. This penalty term keeps the estimated value of the regression coefficients small, thereby reducing the inflation often seen in the presence of correlated data.
Three types of penalties are commonly used to reign in inflated coefficients resulting from collinearity. Ridge, or L-2, regression, adds a penalty based on the sum of squared regression coefficients, resulting in estimates that are artifically shrunk towards zero. Lasso, or L-1, regression, penalizes the absolute value of the sum of regression coefficients, usually resulting in several zero-valued coefficients and effectively serving as a variable reduction technique. Finally, elastic net uses both penalty terms, allowing the user to specify which penalty plays a larger role in reigning in the coefficients.
Max Kuhn and Kjell Johnson provide an excellent overview of these methods in their book “Applied Predictive Modeling“. (This post was, in fact, largely inspired by my first pass through the book.) It’s one of the better references I’ve found with it comes to applied predictive modeling, complete with R code to augment the in-text examples.