In machine learning and statistics, *data leakage* is the problem of using information in your test samples for training your model. The problem with data leakage is that it inflates performance estimates. The simplest case, directly using test samples for training, is easily avoided. However, it does not take much for data leakage to become quite a bit more difficult to detect, as is illustrated by the following example combining feature selection with cross-validation.

A common problem in the analysis of high-throughput biological data is that, very often, the number of features is much larger than the number of samples. For example, a typical RNA-seq experiment will result in tens of thousands of features, whereas the number of samples will be orders of magnitude smaller. Building statistical or machine learning models based on such data poses a number of problems. For example, the large number of features typically make a resulting model more difficult to interpret. Also, it may take longer to train the model. However, the most important problem may be that training models on many features is more prone to overfitting, and as such increases the generalization error.

For this reason, a first step that is often taken in training models based on data with many more features than samples, is selecting only a limited number of relevant features to be used for training the model. One of the simplest ways of selecting relevant features is applying a filter. A filter selects features on an individual basis, without considering the learning algorithm that will eventually be applied to the set of selected features. An example would be checking the correlation of each feature with the response variable, and selecting only those for which the correlation exceeds a certain threshold.

For our example combining feature selection with cross-validation, we will use a simple t-test filter. Here’s an R function that performs simultaneous t-tests on the rows (features) of a feature matrix *X* given a class variable *y*, and for each feature returns a p-value:

# fast row t-test for equal sample sizes and equal variance # the hypothesis test is two-sided row_ttest <- function(X, y) { library(matrixStats) X1 <- X[, y == levels(y)[1]] X2 <- X[, y == levels(y)[2]] spool <- sqrt( (rowVars(X1) + rowVars(X2)) / 2 ) tstat <- (rowMeans(X1) - rowMeans(X2)) / spool / sqrt(4 / ncol(X)) df <- ncol(X) - 2 p <- pt(tstat, df) pmin(p, 1 - p) * 2 }

The data that we will use for the example is random data with 100 samples times 100000 features. The samples are randomly divided in two classes, 50 samples for each class. Here’s a function that generates the random data:

generate_random_data <- function( # number of samples in the random data nsamples = 1e2, # number of features in the random data nfeatures = 1e5 ) { # the features X <- matrix( rnorm(nsamples * nfeatures), nrow = nfeatures, ncol = nsamples ) # the class labels y <- gl(2, nsamples / 2) list( X = X, y = y ) }

Now, the problem at hand is to train a classifier that will predict the class label of the above data. Because of the large number of features, we want to do some initial feature selection, and might consider the following approach:

- Reduce the number of features by applying a t-test filter to the individual features.
- Train a support vector machine (SVM) and estimate the misclassification rate by cross-validation.

Here’s the code for that approach, which includes calls to the previously defined functions:

# load some libraries library(caret) library(e1071) # set the seed set.seed(123) # generate some data Xy <- generate_random_data() X <- Xy$X y <- Xy$y # apply the t-test filter selected <- row_ttest(X, y) < 0.1 # Train an SVM to predict the class label, # and estimate the misclassification rate by # cross-validation. folds <- createFolds(y, k = 5) cv <- unlist(lapply(folds, function(fold) { # train and test a model fit <- svm(x = t(X[selected, -fold]), y = y[-fold]) pred <- predict(fit, newdata = t(X[selected, fold])) ct <- table(pred, y[fold]) # get the misclassification rate 1 - sum(diag(ct)) / sum(ct) })) barplot( cv, ylim = c(0, 1), las = 2, ylab = "misclassification rate" )

The result of running the code above looks as follows:

We get perfect performance! But why? The data was random, so we should not be able to predict the class label? Let’s have a closer look at the approach we took:

- Generate random data: 100 samples times 100000 features.
- Reduce the number of features by applying a t-test filter to the individual features.
- Train a support vector machine (SVM) by cross-validation, i.e. for each fold:
- Train an SVM on the selected features on the training data (all data but the fold).
- Calculate the misclassification rate on the test data (the fold).

If you look carefully, you will realize that in step 2. we selected features based on *all* samples, so including the samples we would use for testing in step 3. However, the initial feature selection partly determines how the models in the cross-validation loop are trained. Hence, our test samples partly determined the training of our models! This is an example of *data leakage* or *data snooping*: using information from your test samples to train your model. As we have seen, the problem with data leakage is that it leads to inflated performance estimates. So how should you do it without data leakage? Well, move the feature selection to *within* the cross-validation loop to only apply it on the training data:

- Generate random data: 100 samples times 100000 features.
~~Reduce the number of features by applying a t-test filter to the individual features.~~- Train a support vector machine (SVM) by cross-validation, i.e. for each fold:
**Reduce the number of features by applying a t-test filter to the individual features,***using only the training data*(all data but the fold).- Train an SVM on the selected features on the training data (all data but the fold).
- Calculate the misclassification rate on the test data (the fold).

Here’s the code:

# load some libraries library(caret) library(e1071) # set the seed set.seed(123) # generate some data Xy <- generate_random_data() X <- Xy$X y <- Xy$y # Train an SVM to predict the class label, # and estimate the misclassification rate by # cross-validation. folds <- createFolds(y, k = 5) cv <- unlist(lapply(folds, function(fold) { # apply the t-test filter within the cross-validation loop! selected <- row_ttest(X[,-fold], y[-fold]) < 0.1 # train and test a model fit <- svm(x = t(X[selected, -fold]), y = y[-fold]) pred <- predict(fit, newdata = t(X[selected, fold])) ct <- table(pred, y[fold]) # get the misclassification rate 1 - sum(diag(ct)) / sum(ct) })) barplot( cv, ylim = c(0, 1), las = 2, ylab = "misclassification rate" )

Running the code above demonstrates that performance is indeed random:

The example outlined in this post illustrates that data leakage encompasses much more than simply using your test samples for training your model. While directly using test samples for training is trivially avoided, even only slightly more complex data analysis strategies can easily lead to data leakage that can be much more difficult to detect.