Bias-variance decomposition

In machine learning and statistics, predictive models are typically not perfect. But what does ‘not perfect’ mean? For example, take the case of weather forecasting, and imagine a weatherman who is simply overly optimistic, and too often predicts sunny weather and high temperatures. On the other hand, one could also think of a weatherman who, for some complicated reasons, on some days grossly overestimates the temperature, while on other days grossly underestimating it, such that on average he is closer to the true temperature than the first weatherman. In absolute terms, the errors in degrees Celsius/Fahrenheit made by both weathermen could be comparable. However, the first weathermen may be considered more reliable because he is consistent, but the second may be considered more reliable because on average he is closer to the truth. This is analogous to the bias-variance trade-off in machine learning and statistics: if we see a model making errors, is this generally the result of bias (e.g. consistently predicting high temperatures), of variance (e.g. wildly varying temperature predictions across days), of noise (weather is just unpredictable…), or of a combination of all?

It turns out that mathematically speaking, the error made by a model can indeed be decomposed into two terms corresponding to bias and variance, plus one additional term, representing noise (e.g. daily fluctuations in temperature we cannot account for). To prove this is not difficult. However, as I found many proofs online somewhat lacking in detail, I have written my own, which will consequently be one of the longer proofs you will find online:

A nice example for demonstrating bias and variance, is the estimation of population variance from sample variance. For example, suppose that we want to estimate the variance in height of all American men, from knowing the heights of only 3 American men? In other words, we want to use the sample variance (of 3 American men) as an approximation of the population variance (of all American men). In principle, for points $x_1, x_2, \dots, x_n$, the variance is computed as follows:

$\frac{1}{n}\sum_{1}^{n}\left(x_i - \bar{x}\right)^2$

However, it turns out that if you want to use the sample variance as an approximation of the population variance, the above calculation is biased, and you need what’s called Bessel’s correction in calculating the variance:

$\frac{1}{n-1}\sum_{1}^{n}\left(x_i - \bar{x}\right)^2$

The above calculation is not biased. Many proofs of this can be found online, three of them on Wikipedia. I will give a demonstration of this bias by simulation: 10000 times, a sample of size 3 is drawn from the standard normal distribution (zero mean and unit variance). For each of the 10000 samples, the variance is calculated with and without Bessel’s correction. The results are summarized in the density plot below.

The plot above confirms that in calculating the variance using Bessel’s correction, the average of the 10000 approximations of the population variance is very close to the true value (1), much closer in fact than when not using Bessel’s correction. However, its spread is larger compared to not using Bessel’s correction. In other words, Bessel’s correction leads to lower bias but higher variance. Therefore, Bessel’s correction is weatherman no. 2.

A slightly more involved example of the bias-variance trade-off is the following. Imagine (“population”) data that was generated using the sine function on the domain $[-\pi, \pi]$, and adding some Gaussian noise. A complete cycle could look like this:

Now, suppose we would be given three training data points, generated in the way described above:

1. Randomly sample three points $x_1, x_2, x_3$ from the domain $[-\pi, \pi]$.
2. Generate Gaussian noise $\epsilon_1, \epsilon_2, \epsilon_3$ , one for each of the $x_i$.
3. Compute $y_i = sin(x_i) + \epsilon_i$

I should emphasize that we do not actually know the three data points were generated using the sine function, we just see the three data points. Based on these three points $(x_1, y_1), (x_2, y_1), (x_3, y_3)$, we want to fit two models, a very simple one (a constant, or 0th-order polynomial), and a more complex one (a line, or 1st-order polynomial). Suppose furthermore that we would repeat the sampling and fitting 1000 times, and each time we measure the error with respect to the “population” data. For fitting the constant, the results may look like this:

For fitting the line, the results may look like this:

It can already be seen that fitting a line seems to better capture the overall increasing trend of the data than does fitting a constant. However, this comes at the expense of high variance in the line fits compared to the constant fits.

This bias-variance trade-off across 1000 fits is even more clearly visualized by summarizing the empirical errors across the 1000 fits, with the empirical error decomposed into bias and variance, respectively. For fitting constants, the decomposition looks like this:

For fitting a line, the decomposition looks as follows:

It is clear that although the complex model (the line) on average has a better fit (i.e. low bias), this comes at the expense of much larger variance between the individual fits. Hence, in statistical and machine learning modeling, the impact of model complexity on bias and variance should always be considered carefully.