dmescherina.github.io

The blog


Project maintained by dmescherina Hosted on GitHub Pages — Theme by mattgraham

March 24, 2017

This is the post about

Extraordinary Least Squares



Everyone had to use a linear regression at one point or another during their work- (or even life-) time, or at least came across the concept during University studying time. But what if you could make this mundane tool much more exciting and efficient? Below are the list of things that Data Scientists put on top of the Ordinary Least Squares to make it a bit more extraordinary

Variables (features) selection

If you get some data set from the real world to analyse then it would probably have a few variables that would be correlated between each other. Or a few too many variables in total, most of which might not add a tremendous amount of explanatory power to the model. One of the ways to deal with it is to choose a subset of the variables (aka features), but testing each combination can get really tedious (or unfeasible after the number of variables gets too large). Luckily, there are a few algorithms that can help with this problem:

The figure below shows the simulation of methods above on a sample dataset

Penalty functions aka Shrinkage

Another way to improve your model and reduce its variance with a small bias increase trade-off is to minimise not just the residual sum of squares, but also add the constraint for the sum of parameter values. There are a few ways to do this:

We can visualise the difference between Ridge and Lasso as the minimisation problem based on the example of 2 coefficient parameters beta1 and beta2. The subset of optimal solutions for coefficients with Ridge penalty function is the circle, whereas for Lasso it’s a rhomboid. The solution occurs when beta options from regression estimation are touching the outer bound of the penalty restriction subset

From this illustration we can see that Lasso is more likely than Ridge to get a “corner” solution with one of the betas being zero:

We can see that evolution of coefficients in case of LAR and Lasso is very similar except for when a coefficient’s value approaches zero. In that case it’s dropped out of Lasso estimation but continues smoothly in the case of LAR

As a result the shrinkage affects the most the coefficients of variables with the smallest variance, which makes sense because since we don’t expect much variation in the data along the direction of this principal component its coefficient might as well be reduced to decrease the variance of the model

Bonus!

A few more illustrations of how Ridge, Lasso and Elastic Net affect the coefficients compared to simple Linear Regression (based on the wine quality data set that can be found here )

Ridge smoothly decreases coefficients to almost zero at the extreme levels of penalty function

Lasso has a more kinky variable path and ends up with a lot of coefficients zeroed out

Elastic Net is a combination of both Ridge and Lasso and would have properties depending on the weight you put on Ridge and Lasso elements of its penalty function (below example is 90% Ridge 10% Lasso):


The source of the beautiful graphs and all the extraordinary information in this post is the book “The Elements of Statistical Learning”, Trevor Hastie, Jerome H. Friedman, Robert Tibshirani, February 2009