Nested cross-validation

In machine learning and statistics, we can assess how well a model generalizes to a new dataset by splitting our data into training and test data:

1. Split data into training and test data.
2. Fit a model to the training data.
3. Assess the performance of the model on the test data.

The disadvantage of the above approach, is that not all data is used for training. A more robust assessment of generalization can be done using cross-validation:

1. Split data into k parts (folds)
2. For i in 1…k:
1. Fit a model to all data but the k-th fold.
2. Assess the performance of the model on the k-th fold.

This approach uses all data for training, and gives a more robust estimate of generalization performance.

Because cross-validation can be used for assessing the performance of a model, it can also be used for comparing the performance of different models (model selection). In the next two sections, I will give examples of both cross-validation for performance estimation and cross-validation for model selection. In the last two section, I will explain why you need to do something called nested cross-validation if you want to simultaneously do performance estimation and model selection.

Cross-validation for estimating generalization performance

For example, suppose we fit a linear-kernel SVM with C = 1 to the breast cancer data from the R-package mlbench. Then we could estimate how well such an SVM generalizes by doing a 5-fold cross-validation as follows:

# Load some packages
library(kernlab)
library(caret)
library(mlbench)
library(PRROC)

# Set the random seed for reproducibility
set.seed(111)

# Load and prepare the breast cancer data
data(BreastCancer)
data <- BreastCancer[!is.na(BreastCancer$Bare.nuclei), -1] # For simplicity, restrict both classes to 200 samples data <- data[ c( sample(which(data$Class == levels(data$Class)[1]), 200), sample(which(data$Class == levels(data$Class)[2]), 200) ), ] y <- data$Class
X <- data
X$Class <- NULL # Test the performance of a linear SVM with C = 1 folds <- createFolds(data$Class, k = 5)
# For each fold ...
auc <- sapply(folds, function(fold) {
# Train an SVM, excluding the fold
fit <- ksvm(
Class ~ .,
data = data[-fold,],
kpar = list(),
C = 1,
prob.model = TRUE,
Class.weights = 1 / table(data$Class[-fold]) ) # Predict the fold yh <- predict(fit, newdata = data[fold,], type = "probabilities") # Compare the predictions to the labels posneg <- split(yh[,1], data$Class[fold])
# Return the AUC under the ROC
roc.curve(posneg[[1]], posneg[[2]])$auc })  The results look as follows (average AUC of about 0.976): Cross-validation for model selection An example of using cross-validation for choosing between models, such as when optimizing hyperparameters, looks as follows: # Function for one round of training and validating an SVM train_and_validate <- function( data, fold, C ) { # Train an SVM, excluding the fold fit <- ksvm( Class ~ ., data = data[-fold,], kernel = "vanilladot", kpar = list(), C = C, prob.model = TRUE, Class.weights = 1 / table(data$Class[-fold])
)
# Predict the fold
yh <- predict(fit, newdata = data[fold,], type = "probabilities")
# Compare the predictions to the labels
posneg <- split(yh[,1], data$Class[fold]) # Return the AUC under the ROC roc.curve(posneg[[1]], posneg[[2]])$auc
}

# Function for doing a k-fold cross-validation for each C in CC
cv <- function(
data,
k,
CC,
seed = NULL
) {
# Set the seed, if given
if (!is.null(seed)) {
set.seed(seed)
}
# For each value of the hyperparameter C ...
auc <- lapply(CC, function(C) {
folds <- createFolds(data$Class, k = k) # For each fold ... sapply(folds, function(fold) { # Train an SVM, and validate on the fold train_and_validate( data, fold, C ) }) }) auc } # Do the cross-validation for each C in CC auc <- cv( data = data, k = 5, CC = 2^seq(log2(.01), log2(10), length.out = 21), seed = 111 )  In the figure above, we can see that performance remains relatively constant until C $\approx$ 0.5, and then seems to decline, so we had better take C not too large in order to achieve good generalization. Technical note: In the code above, we specify a different set of folds for each C. We could also have chosen to use the same fold set for all C. However, in that case, results might have been strongly dependent on how the specific fold set was actually chosen. While this can also be the case when choosing different fold sets, at least with different fold sets we get an indication of the variance in AUC caused by choosing different fold sets, just by looking at the plot. Nevertheless, choosing the same fold set for different models can make much sense in some cases, for example when directly comparing two models. In this case the difference between the models could be assessed by pairwise comparing the different folds, for example using a paired Wilcoxon signed-rank test. Can we simultaneously do model selection and performance estimation? Looking at the last plot above, you might be wondering the following: Can we conclude that the expected generalization AUC is about 0.991? (the maximum value of the red line) More specifically, is taking the maximum AUC a valid approach for estimating the generalization performance of the best C? That is an interesting question. How could we check that? Well, let’s check if randomized data results in a generalization AUC of about 0.5! More concretely: if we would randomize the class labels in the breast cancer dataset, we would expect a classifier not to do any better than random. Therefore, from the following procedure we would expect a generalization AUC of about 0.5. 1. Randomize the class labels of the breast cancer dataset. 2. Do a cross-validation of all values of C. 3. Choose the best C by choosing the maximum value of the AUC. Let’s repeat the above 25 times, and see what happens: set.seed(111) auc <- replicate(25, { # Randomize the class labels. This should result in # random performance, i.e. AUC ~= 0.5 data$Class <- sample(data$Class) # Cross-validate for each C in CC, and take the # average of the folds as the performance estimate auc <- sapply(cv( data = data, k = 5, CC = 2^seq(log2(.01), log2(10), length.out = 11) ), mean) # Take the max AUC across the different Cs max(auc) })  In the above plot, you can see that with ~0.54 the average observed AUC is quite a bit higher than the ~0.5 you would expect from a random classifier! More specifically, you can see that 24 out of 25 AUCs are higher than 0.5, which can be verified as statistically highly significant, for example using a one-sample Wilcoxon signed rank test, or a binomial test. Coming back to the question we posed at the beginning of this section: No, taking the maximum AUC is not a valid approach for estimating the generalization performance of the best C. Why is this? Well, in the randomized example, essentially what we’re doing (25 times) is randomly sampling 11 AUCs (probably with some covariance structure, but still random). We would expect these 11 AUCs to average out at around 0.5. Among these 11, we then take the maximum, which we would expect to be higher than 0.5. Therefore, after repeating the above 25 times, we would definitely expect the average of the 25 maximum AUCs to also be higher than 0.5! Summarizing, simultaneously doing model selection and performance estimation leads to a positively biased performance estimate. Nested cross-validation So how should we go about doing model selection and performance estimation? In principle, what we could do, is set aside some test data to test the performance once we’ve optimized C: 1. Set aside some test data, not to be used in the cross-validation. 2. Optimize C using cross-validation on all but the test data. 3. Select the best C (the one with the highest AUC). 4. Test the performance of the highest C on the test data. This is a perfectly valid approach, except that it has the disadvantage of not having used all data for training, at some point. This argument should sound familiar: While it is perfectly valid to split our data into training and test data, we will get a more robust estimate of our performance by doing a cross-validation, as such iteratively using all data for training. Analogously, it it perfectly valid to split our data into (1) training data for training the models, (2) validation data for selecting the best model and (3) test data for assessing the performance of the best model. However, we will get a more robust performance estimate by iteratively using all data for training. This can be done by something called nested cross-validation, and can be done as follows: • Split data into k1 folds. • For i in 1…k1: 1. For all Cs: 1. Do a k2-fold cross-validation on all data but the k1-th fold. 2. Return the performance averaged across the k2 folds. 2. Select the C with the best average performance. 3. Assess the performance of this model on the k1-th fold. We can see why the above is called nested cross-validation: an inner cross-validation loop for model selection is nested within an outer cross-validation loop for performance estimation. As such, nested cross-validation tries to estimate the expected performance of a model where the hyperparameter C is optimized using cross-validation. Here’s a function for doing nested cross-validation: ncv <- function( data, k, CC, seed = NULL ) { if (!is.null(seed)) { set.seed(seed) } folds <- createFolds(data$Class, k = k)
# For each fold ...
auc <- sapply(folds, function(fold) {
# Do a cross-validation for each C
auc <- cv(
data[-fold,],
k,
CC,
seed = seed
)
# Select the C with the highest AUC
C <- CC[which.max(sapply(auc, mean))]
C  1, sample(C, 1), C)
# Test this C on the test data
train_and_validate(
data,
fold = fold,
C = C
)
})
auc
}


Let's first check whether using this function we indeed get an AUC of ~0.5 on the randomized data:

set.seed(111)
auc <- replicate(25, {
cat(".")
# Randomize the class labels. This should result in
# random performance, i.e. AUC ~= 0.5
data$Class <- sample(data$Class)

# This returns k scores
auc <- ncv(
data = data,
k = 5,
CC = 2^seq(log2(.01), log2(10), length.out = 11)
)
# Take the average as the performance estimate
mean(auc)
})


Indeed, we get random performance! Now let’s try it out on the breast cancer data:

# test on BreastCancer data
auc <- ncv(
data = data,
k = 5,
CC = 2^seq(log2(.01), log2(10), length.out = 21),
seed = 111
)


We can see that nested cross-validation gives an expected performance ~0.987, which is indeed lower than the positively biased estimate of ~0.991 we found initially. This may seem like a small difference but compared to 0.991, 0.987 is ~40% further away from the perfect classifier with AUC = 1, making it quite a substantial difference.

Effect size and statistical significance

Are statistically significant results always relevant? Let’s have a look at a simple hypothetical example. Suppose we have two groups of 2500 men. All men in group 1 have a beard, and all men in group 2 do not have a beard. Moreover, we know the height of all men, and it turns out that the bearded men are statistically significantly taller than the beardless men (t-test, p < 0.05). For example:

As mentioned above, in our example, the difference in height is significant. However, arguably the more interesting question is: What can we do with this result? Is it practically relevant? For example, we could ask ourselves: Given the height of a man we have not seen yet, can we predict whether he has a beard?

Well, we could, but we would do very poorly: only slightly better than random. For example, an optimal classifier would statistically be expected to put the decision boundary at the average of 1.84 and 1.835, thus showing a misclassification rate of about 0.49. For example, among 100 men, we would be expected to correctly predict only one more than by mere random guessing. Why is this? Because the effect size is so small: The difference in average height between the two groups is just so small that it can hardly be used for prediction. While it cannot be denied that there is a difference between the two groups, it is of little practical relevance, and we had better look for something that better predicts beardedness.

So, in reporting results, we should not only look at statistical significance, but also at effect size. Nonetheless, in practice, cases where effect size is under-reported are no exception. An interesting example is this article, on “how intelligence, population density, and friendship affect modern happiness”. It received quite some attention in the media. One of the main results in the paper was that there is an “interaction effect between frequency of socialization with friends and intelligence on life satisfaction”, such that “more intelligent individuals were actually less satisfied with life if they socialized with their friends more frequently”. This was summarized in the following graph:

Indeed, people with higher IQs seem unhappier when they have more social interactions, and Li and Kanazawa showed that these results were significant (p = 0.016). So far so good. However, look at the y-axis. The article states that life satisfaction was reported on a scale from 1 to 5, but the figure only spans a tiny fraction of the entire range, from 4.10 to 4.16. Moreover, only mean life satisfaction is reported, and no indication whatsoever is given of the spread in life satisfaction scores: Most likely, the large majority of the individual scores are either larger than 4.16 or smaller than 4.10, and therefore lie outside the range of the y-axis. To get a proper idea of how small the differences actually are, look at the same data, mean life satisfaction, but now with a y-axis ranging from 1 to 5:

To get a feeling for the effect size of this difference, we might ask a question similar to the one in the toy example we started with: Would you be able to predict whether someone has a high IQ just by knowing whether he/she socializes frequently and how happy he/she is with his/her life? Most likely you would do very poorly, close to random in fact, as the Cohen’s d statistics of 0.05 and -0.03 reported in the article suggest. With a large sample size of 15197, as reported in the article, even very small effects can be identified as statistically significant.

Concluding: Is there an effect? Yes, there is. Is it relevant? Very questionable, considering the small effect size.

Bias-variance decomposition

In machine learning and statistics, predictive models are typically not perfect. But what does ‘not perfect’ mean? For example, take the case of weather forecasting, and imagine a weatherman who is simply overly optimistic, and too often predicts sunny weather and high temperatures. On the other hand, one could also think of a weatherman who, for some complicated reasons, on some days grossly overestimates the temperature, while on other days grossly underestimating it, such that on average he is closer to the true temperature than the first weatherman. In absolute terms, the errors in degrees Celsius/Fahrenheit made by both weathermen could be comparable. However, the first weathermen may be considered more reliable because he is consistent, but the second may be considered more reliable because on average he is closer to the truth. This is analogous to the bias-variance trade-off in machine learning and statistics: if we see a model making errors, is this generally the result of bias (e.g. consistently predicting high temperatures), of variance (e.g. wildly varying temperature predictions across days), of noise (weather is just unpredictable…), or of a combination of all?

It turns out that mathematically speaking, the error made by a model can indeed be decomposed into two terms corresponding to bias and variance, plus one additional term, representing noise (e.g. daily fluctuations in temperature we cannot account for). To prove this is not difficult. However, as I found many proofs online somewhat lacking in detail, I have written my own, which will consequently be one of the longer proofs you will find online:

A nice example for demonstrating bias and variance, is the estimation of population variance from sample variance. For example, suppose that we want to estimate the variance in height of all American men, from knowing the heights of only 3 American men? In other words, we want to use the sample variance (of 3 American men) as an approximation of the population variance (of all American men). In principle, for points $x_1, x_2, \dots, x_n$, the variance is computed as follows:

$\frac{1}{n}\sum_{1}^{n}\left(x_i - \bar{x}\right)^2$

However, it turns out that if you want to use the sample variance as an approximation of the population variance, the above calculation is biased, and you need what’s called Bessel’s correction in calculating the variance:

$\frac{1}{n-1}\sum_{1}^{n}\left(x_i - \bar{x}\right)^2$

The above calculation is not biased. Many proofs of this can be found online, three of them on Wikipedia. I will give a demonstration of this bias by simulation: 10000 times, a sample of size 3 is drawn from the standard normal distribution (zero mean and unit variance). For each of the 10000 samples, the variance is calculated with and without Bessel’s correction. The results are summarized in the density plot below.

The plot above confirms that in calculating the variance using Bessel’s correction, the average of the 10000 approximations of the population variance is very close to the true value (1), much closer in fact than when not using Bessel’s correction. However, its spread is larger compared to not using Bessel’s correction. In other words, Bessel’s correction leads to lower bias but higher variance. Therefore, Bessel’s correction is weatherman no. 2.

A slightly more involved example of the bias-variance trade-off is the following. Imagine (“population”) data that was generated using the sine function on the domain $[-\pi, \pi]$, and adding some Gaussian noise. A complete cycle could look like this:

Now, suppose we would be given three training data points, generated in the way described above:

1. Randomly sample three points $x_1, x_2, x_3$ from the domain $[-\pi, \pi]$.
2. Generate Gaussian noise $\epsilon_1, \epsilon_2, \epsilon_3$ , one for each of the $x_i$.
3. Compute $y_i = sin(x_i) + \epsilon_i$

I should emphasize that we do not actually know the three data points were generated using the sine function, we just see the three data points. Based on these three points $(x_1, y_1), (x_2, y_1), (x_3, y_3)$, we want to fit two models, a very simple one (a constant, or 0th-order polynomial), and a more complex one (a line, or 1st-order polynomial). Suppose furthermore that we would repeat the sampling and fitting 1000 times, and each time we measure the error with respect to the “population” data. For fitting the constant, the results may look like this:

For fitting the line, the results may look like this:

It can already be seen that fitting a line seems to better capture the overall increasing trend of the data than does fitting a constant. However, this comes at the expense of high variance in the line fits compared to the constant fits.

This bias-variance trade-off across 1000 fits is even more clearly visualized by summarizing the empirical errors across the 1000 fits, with the empirical error decomposed into bias and variance, respectively. For fitting constants, the decomposition looks like this:

For fitting a line, the decomposition looks as follows:

It is clear that although the complex model (the line) on average has a better fit (i.e. low bias), this comes at the expense of much larger variance between the individual fits. Hence, in statistical and machine learning modeling, the impact of model complexity on bias and variance should always be considered carefully.

Piano competitions and successful careers: the statistics

In my first post about piano competitions, I explained how I collected data from a website describing more than 11000 results from about 3000 piano competitions. By doing some elementary analysis in the statistical programming language R, among other things I confirmed my suspicion that Italians do very well in piano competitions, in terms of number of prizes won per million inhabitants of a country.

However, doing well in piano competitions should not be an end in itself, and there are quite some examples of pianists winning top prizes at major piano competitions without going on to having a successful career. Therefore, one might wonder how predictive doing well in piano competitions is of achieving this ultimate goal: a successful career. To try and answer this question, we first need to think about the two things we want to relate:

1. What defines “doing well in piano competitions”? We only have data for pianists who actually reached the finals of any given competition, not for those not reaching the finals. Therefore, “doing well” is determined by the ranking of the pianists within each of the finals, i.e. 1st prize, 2nd prize, 3rd prize, etc., and possibly finalists without a prize.
2. What defines a “successful career”? Obviously, one could think of measures such as “number of concerts per year” or “yearly income”. Needless to say, these data are pretty hard to come by. Therefore, I decided to go with the relatively pragmatic “number of Google hits on the rainy Saturday afternoon of November 28, 2015”, as measured using the Custom Search API from the Google Developers Console for doing the following search: <“first_name last_name” piano>. In other words, the assumption is: the more Google hits, the more success.

So, we will try and establish whether knowing the prize won in a competition allows us to predict the number of Google hits. We will call this the “prize effect”, i.e. the effect of prize on number of Google hits. For example, we can take the Queen Elisabeth Competition, and plot the prize won by each finalist against his/her number of Google hits:

We can see that 1st prize winners generally have more Google hits than finalists without a prize, so there indeed seems to be a weak trend. (Statistically, this observation is confirmed by the Kendall correlation coefficient of -0.3 and corresponding p-value of 0.0055)

OK, simple enough, right? Well…., maybe not. These are only results for the Queen Elisabeth Competition. What if we want to assess the same trend, but based on all competitions simultaneously? Now we have a problem, because some competitions are more prestigious than others. In other words,  you might expect someone coming in 3rd at the Van Cliburn Competition to have better chances of having a successful career (and thus more Google hits) than someone coming in 3rd at the Premio de Piano Frechilla-Zuloaga, simply because the Van Cliburn Competition is a more prestigious competition. We will call this the “competition effect”. Also, it is not unlikely that the number of Google hits in November 2015 is influenced by the year in which a competition was won. We will call this the “year effect”. So what we want to do is to determine the “prize effect” on the number of Google hits, while correcting for the “competition effect” and the “year effect”. (Now we’ll get into some technical details, but feel free to skip these and head directly over to the conclusion to learn whether prize indeed predicts a successful career) Fortunately, there is a class of statistical models called mixed models that can help us out here. More specifically, we’ll use the lme4 R-package to construct a mixed model predicting number of Google hits from a fixed effect “prize”, a fixed effect “year”, and a random effect “competition”. Suppose that data.frame K0 contains our data, namely columns:

1. nhits: number of Google hits
2. prize: the prize won
3. year: the year in which a prize was won
4. competition: the name of the competition

Then one way of determining whether the prize effect is significant when taking into account the competition and year effects is the following:


# Log-transform the number of Google hits, to make it a bit better
# behaved in terms of distribution. To make the log-transform work,
# we first need to add a 'pseudocount' of 1, so as to avoid taking
# the logarithm of 0.
K0$nhits <- log10(K0$nhits+1)
# Make "year" into a factor, such that it will be treated as a
# categorical variable.
K0$year <- factor(K0$year)
# Train the null model, predicting nhits only from competition.
fit.null <- lmer(
nhits ~ year + (1|competition),
data = K0,
REML = FALSE
)
# Train the full model, predicting nhits from prize and competition.
fit.full <- lmer(
nhits ~ prize + year + (1|competition),
data = K0,
REML = FALSE
)
# compare the null model with the full model by performing a
# likelihood ratio test using the anova() function
anova(fit.full,fit.null)



Note that year is treated as a categorical variable. This is because it is likely that the effect of year on nhits is nonlinear. Indeed, I observed this for the Queen Elisabeth competition, with relatively many Google hits for the 2003 and 2013 competitions, and fewer for the 2007 and 2010 competition. This could be explained by the fact that 2013 laureates get more hits due to winning a prize, and that 2003 laureates get more hits because they have had 10 years to establish a career. This nonlinearity makes it more difficult to deal with in linear models. However, the number of years is limited, and moreover we are not interested in assessing the magnitude of the year effect, but only in removing it. Therefore, we can treat it as a categorical variable. The above approach gives p < 2.2e-16 for the difference between the two models, thus indicating that, across competitions in general, prize indeed contributes to the number of Google hits. We can try to visualize the prize effect by plotting prize against the residuals of both the null model and the full model:

This demonstrates that there is a significant correlation of prize with residual nhits in the null model that is removed when including prize as a predictor variable in the full model. This indeed indicates a trend across competitions and years for mildly more Google hits with winning top prizes. Also evident from this plot is that the residuals may not be normally distributed, thus violating one of the assumptions of mixed models. This is even more clearly seen in a Q-Q plot, below for residual nhits of the full model:

If the residuals were normally distributed, they would more closely follow the dashed line. Thus, the mixed model as described above may not be entirely appropriate for our current purpose. Therefore, in order to determine the significance of the prize effect, we may want to replace the likelihood ratio test with a non-parametric test, such as a permutation-based test. Doing this, as it turns out, also gives a significant result, namely p < 0.001 using 1000 permutations. Thus, prize can predict number of Google hits, at least to a certain extent. This is also indicated by the highly significant non-parametric Kendall correlation of prize with residual nhits in the null model. However, at -0.081 the magnitude of this correlation is fairly small. Note of caution: strictly speaking we have not established that winning a top prize actually causes a higher number of Google hits; we have only established an undirected association between the two. Nonetheless, considering that all competition results preceded the retrieval date of the number of Google hits, typically by 5 to 10 years, this is by far the most likely interpretation.

The above observations lead to the following conclusion: if you are a finalist in a piano competition and win a top prize, you do seem to have better chances of having a successful career than a finalist not winning a top prize, at least in terms of number of Google hits. However, this “prize effect” is remarkably small when observed for piano competitions in general.

Which nationalities do well in piano competitions?

Yesterday, I was listening to Guido Agosti’s awesome piano transcription of Strawinsky’s The Firebird, in a great performance by the Italian pianist Francesco Piemontesi. I remembered Francesco Piemontesi from coming in third at the 2007 Queen Elisabeth Competition in Brussels, and I was thinking: how come it seems that Italian pianists always do so well in international competitions, compared to many other European nationalities? It struck me that this was actually an assumption, as I had never seen any concrete numbers on this, at least not across many different competitions world-wide.

Curious as I am, I paid a visit to a website collecting this type of data, to see if I could find any numbers on this. Unfortunately, the website itself provided only very limited functionality for mining the data, by far not enough to answer the question I was interested in. However, being a computational scientist / data scientist (and classically trained pianist), I could not resist: Using Firefox’s Network Monitor, I tracked down the precise HTTP GET request that was sent to query the database for any individual competition. Then, I constructed my own raw HTTP GET requests, and used the computer networking tool netcat to send these in batch to the server. This allowed me to retrieve the results in HTML for about 3000 international piano competitions. I parsed this HTML data mainly using the XML and stringr R-packages (tricky…), downloaded some more data regarding population size by country from Wikipedia, and then did some elementary analysis in R.

Behold, the first results of analyzing this data: the number of prizes per million inhabitants of a country.

The results are pretty interesting. Indeed, Italy ranks highest of all Western European countries! Another suspicion confirmed is that Belgium scores substantially better than the Netherlands (my home country). Somewhat surprising is that the charts are heavily dominated by Eastern Europe (top 7), and more specifically by former Soviet states, excluding Russia itself (top 5!). The top Eastern Asian country is South Korea. China ranks very low, so the large numbers of Chinese pianists may currently still be explained by China’s huge population.

Although the results are quite interesting, some caution regarding the interpretation particularly of small differences is in order:

1. Population sizes are from 2014-2015; competition results are from 2000-2015.
2. Some countries do not exist anymore. For example, prizes won by Yugoslavian pianists are not counted towards any of the currently existing states that previously made up Yugoslavia.
3. The number of prizes in a country may somewhat depend on the number of international competitions in that country.
4. Some number are very low. For example, Mongolia ranks a bit higher than China (!). However, this is based on only three prizes won by Mongolian pianists.

I will probably have another, more sophisticated, look at this data in the near future. To be continued…..