Thursday, July 7, 2016

Variance-Bias Trade-Offs in Model building - Notes from ISLR Chapter 2

Emotional commitment is a key to achieving mastery

0. How do we measure the quality of a fit? (for eg in the case of regression)?

For a given data, we need some way to measure how well the predictions match the observed data. In regression setting, the most commonly-used is the mean squared error (MSE). The MSE will be small if the predicted responses are very close to the true responses and vice versa.


1. What is training MSE and testing MSE and how are they related to the flexibility of the statistical method? 

The mean square error computed using the training data which we used to fit the model is called the training MSE. However, our interest is in checking the predictions on the unseen test data which was never used to train the statistical learning method. The mean square error computed using the testing data is called testing MSE. We want to select a statistical method for which the test MSE is smallest. It is important to know that there is no guarantee that the method with the lowest training MSE will also have lowest test MSE.

Lets understand how training MSE and testing MSE are related to the flexibility of the statistical method.

Figure 2.9 illustrates this phenomenon on a simple example. In the left- hand panel of Figure 2.9, observations with the true function f is given by the black curve. The orange, blue and green curves illustrate three possible estimates for f obtained using methods with increasing levels of flexibility. The orange line is the linear regression fit, which is relatively inflexible. The blue and green curves were produced using smoothing splines, with different levels of smoothness. It is clear that as the level of flexibility increases, the curves fit the observed data more closely. The green curve is the most flexible and matches the data very well; however, we observe that it fits the true f (shown in black) poorly because it is too wiggly. 


On the right-hand panel of Figure 2.9, the grey curve displays the average training MSE as a function of flexibility, or more formally the degrees of freedom, for a number of smoothing splines. The degrees of freedom is a quantity that summarizes the flexibility of a curve. The orange, blue and green squares indicate the MSEs associated with the corresponding curves in the left-hand panel. A more restricted and hence smoother curve has fewer degrees of freedom than a wiggly curve. Linear regression is at the most restrictive end, with two degrees of freedom. The training MSE declines monotonically as flexibility increases. In this example the true f is non-linear, and so the orange linear fit is not flexible enough to estimate f well. The green curve has the lowest training MSE of all three methods, since it corresponds to the most flexible of the three curves fit in the left-hand panel.



In this example, we know the true function f, and so we can also compute the test MSE over a very large test set, as a function of flexibility. The test MSE is displayed using the red curve in the right-hand panel of Figure 2.9. As with the training MSE, the test MSE initially declines as the level of flexibility increases. However, at some point the test MSE levels off and then starts to increase again. Consequently, the orange and green curves both have high test MSE. The blue curve minimizes the test MSE, which should not be surprising given that visually it appears to estimate f the best in the left-hand panel of Figure 2.9. 

In the right-hand panel of Figure 2.9, as the flexibility of the statistical learning method increases, we observe a monotone decrease in the training MSE and a U-shape in the test MSE. This is a fundamental property of statistical learning that holds regardless of the particular data set at hand and regardless of the statistical method being used. As model flexibility increases, training MSE will decrease, but the test MSE may not. When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data. This happens because our statistical learning procedure is working too hard to find patterns in the training data, and may be picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f. When we overfit the training data, the test MSE will be very large because the supposed patterns that the method found in the training data simply don’t exist in the test data. Note that regardless of whether or not overfitting has occurred, we almost always expect the training MSE to be smaller than the test MSE because most statistical learning methods either directly or indirectly seek to minimize the training MSE. Overfitting refers specifically to the case in which a less flexible model would have yielded a smaller test MSE.


In Fig. 2.10 - the original curve is linear and hence we can see that the linear regression fits well and hence the orange and blue represent the better the curve and the test MSE is also smaller. Green overfits and hence the test MSE is larger.










In this case, Fig.2.11 - the original curve is non-linear and hence linear regression does not fit and hence the train and test MSE are large whereas the blue and green splines fit well and hence the test MSE is smaller.









The above behavior in all the three cases can be understood from the equation to determine expected test MSE.  The U-shape observed in the test MSE curves (Figures 2.9–2.11) turns out to be the result of two competing properties of statistical learning methods. The expected test MSE, for a given value x0, can always be decomposed into the sum of three fundamental quantities: the variance of f(x0), the squared bias of f(x0) and the variance of the error terms ε. That is, 
Equation 2.7 tells us that in order to minimize the expected test error,we need to select a statistical learning method that simultaneously achieves low variance and low bias. Note that variance is inherently a nonnegative quantity, and squared bias is also nonnegative. Hence, we see that the expected test MSE can never lie below Var(ε), the irreducible error. We can see the three effects shown separately for the three cases Fig. 2.9 to 2.11. We will talk about the meaning and the dependence to model complexity for bias and variance below.



2. What is the meaning of Variance of a statistical method and how is it related to flexibility of the statistical method?

Variance refers to the amount by which a function f would change if the function is estimated using a different training data set. Since different training sets are used to fit statistical methods, this results in a different function f. But ideally the estimate for f should not vary too much between training sets.

In general, more flexible statistical methods have higher variance. For example, if the curve cannot be fitted using linear regression, a small change in one of the data point can cause the function f to change drastically.  Therefore flexible statistical methods have higher variance.

3. What is the meaning of bias of a statistical method and how is it related to flexibility of the statistical method?

Bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model. For example, linear regression assumes that there is a linear relationship between  the response / dependent variable and features / predictors / independent variables. It is unlikely that any real-life problem truly has such a simple linear relationship, and so performing linear regression will undoubtedly result in some bias in the estimate of function f.

Hence, it is clear that flexible statistical methods have less bias.

No comments:

Post a Comment