# Transformers for multivariate time series

Transformer neural networks represent a deep learning architecture exploiting attention for modeling sequence data, thus not requiring input data to actually be processed in sequence, unlike recurrent neural networks. Because of the recent successes of transformers in machine translation, I was curious to experiment with them for their use in modeling multivariate time series, and more specifically multivariate longitudinal clinical patient data. For this purpose, I have now implemented transformers as part of a method that I previously published (paper, code). The method addresses the problem of clustering multivariate time series with potentially many missing values, and uses a variational autoencoder with a Gaussian mixture prior, extended with LSTMs (or GRUs) for modeling multivariate time series, as well as implicit imputation and loss re-weighting for directly dealing with (potentially many) missing values. In addition to LSTMs and GRUs, I have now implemented transformers as part of the package. In addition to variational autoencoders with Gaussian mixture priors, the code allows to train ordinary variational LSTM/GRU/Transformer autoencoders (multivariate gaussian prior) and ordinary LSTM/GRU/Transformer autoencoders (without prior). I will probably report on some experiments comparing LSTMs and transformers for multivariate longitudinal clinical patient data in the near future.

# Nested cross-validation

In machine learning and statistics, we can assess how well a model generalizes to a new dataset by splitting our data into training and test data:

1. Split data into training and test data.
2. Fit a model to the training data.
3. Assess the performance of the model on the test data.

The disadvantage of the above approach, is that not all data is used for training. A more robust assessment of generalization can be done using cross-validation:

1. Split data into k parts (folds)
2. For i in 1…k:
1. Fit a model to all data but the k-th fold.
2. Assess the performance of the model on the k-th fold.

This approach uses all data for training, and gives a more robust estimate of generalization performance.

Because cross-validation can be used for assessing the performance of a model, it can also be used for comparing the performance of different models (model selection). In the next two sections, I will give examples of both cross-validation for performance estimation and cross-validation for model selection. In the last two section, I will explain why you need to do something called nested cross-validation if you want to simultaneously do performance estimation and model selection.

#### Cross-validation for estimating generalization performance

For example, suppose we fit a linear-kernel SVM with C = 1 to the breast cancer data from the R-package mlbench. Then we could estimate how well such an SVM generalizes by doing a 5-fold cross-validation as follows:

# Load some packages
library(kernlab)
library(caret)
library(mlbench)
library(PRROC)

# Set the random seed for reproducibility
set.seed(111)

# Load and prepare the breast cancer data
data(BreastCancer)
data <- BreastCancer[!is.na(BreastCancer$Bare.nuclei), -1] # For simplicity, restrict both classes to 200 samples data <- data[ c( sample(which(data$Class == levels(data$Class)[1]), 200), sample(which(data$Class == levels(data$Class)[2]), 200) ), ] y <- data$Class
X <- data
X$Class <- NULL # Test the performance of a linear SVM with C = 1 folds <- createFolds(data$Class, k = 5)
# For each fold ...
auc <- sapply(folds, function(fold) {
# Train an SVM, excluding the fold
fit <- ksvm(
Class ~ .,
data = data[-fold,],
kpar = list(),
C = 1,
prob.model = TRUE,
Class.weights = 1 / table(data$Class[-fold]) ) # Predict the fold yh <- predict(fit, newdata = data[fold,], type = "probabilities") # Compare the predictions to the labels posneg <- split(yh[,1], data$Class[fold])
# Return the AUC under the ROC
roc.curve(posneg[[1]], posneg[[2]])$auc })  The results look as follows (average AUC of about 0.976): #### Cross-validation for model selection An example of using cross-validation for choosing between models, such as when optimizing hyperparameters, looks as follows: # Function for one round of training and validating an SVM train_and_validate <- function( data, fold, C ) { # Train an SVM, excluding the fold fit <- ksvm( Class ~ ., data = data[-fold,], kernel = "vanilladot", kpar = list(), C = C, prob.model = TRUE, Class.weights = 1 / table(data$Class[-fold])
)
# Predict the fold
yh <- predict(fit, newdata = data[fold,], type = "probabilities")
# Compare the predictions to the labels
posneg <- split(yh[,1], data$Class[fold]) # Return the AUC under the ROC roc.curve(posneg[[1]], posneg[[2]])$auc
}

# Function for doing a k-fold cross-validation for each C in CC
cv <- function(
data,
k,
CC,
seed = NULL
) {
# Set the seed, if given
if (!is.null(seed)) {
set.seed(seed)
}
# For each value of the hyperparameter C ...
auc <- lapply(CC, function(C) {
folds <- createFolds(data$Class, k = k) # For each fold ... sapply(folds, function(fold) { # Train an SVM, and validate on the fold train_and_validate( data, fold, C ) }) }) auc } # Do the cross-validation for each C in CC auc <- cv( data = data, k = 5, CC = 2^seq(log2(.01), log2(10), length.out = 21), seed = 111 )  In the figure above, we can see that performance remains relatively constant until C $\approx$ 0.5, and then seems to decline, so we had better take C not too large in order to achieve good generalization. Technical note: In the code above, we specify a different set of folds for each C. We could also have chosen to use the same fold set for all C. However, in that case, results might have been strongly dependent on how the specific fold set was actually chosen. While this can also be the case when choosing different fold sets, at least with different fold sets we get an indication of the variance in AUC caused by choosing different fold sets, just by looking at the plot. Nevertheless, choosing the same fold set for different models can make much sense in some cases, for example when directly comparing two models. In this case the difference between the models could be assessed by pairwise comparing the different folds, for example using a paired Wilcoxon signed-rank test. #### Can we simultaneously do model selection and performance estimation? Looking at the last plot above, you might be wondering the following: Can we conclude that the expected generalization AUC is about 0.991? (the maximum value of the red line) More specifically, is taking the maximum AUC a valid approach for estimating the generalization performance of the best C? That is an interesting question. How could we check that? Well, let’s check if randomized data results in a generalization AUC of about 0.5! More concretely: if we would randomize the class labels in the breast cancer dataset, we would expect a classifier not to do any better than random. Therefore, from the following procedure we would expect a generalization AUC of about 0.5. 1. Randomize the class labels of the breast cancer dataset. 2. Do a cross-validation of all values of C. 3. Choose the best C by choosing the maximum value of the AUC. Let’s repeat the above 25 times, and see what happens: set.seed(111) auc <- replicate(25, { # Randomize the class labels. This should result in # random performance, i.e. AUC ~= 0.5 data$Class <- sample(data$Class) # Cross-validate for each C in CC, and take the # average of the folds as the performance estimate auc <- sapply(cv( data = data, k = 5, CC = 2^seq(log2(.01), log2(10), length.out = 11) ), mean) # Take the max AUC across the different Cs max(auc) })  In the above plot, you can see that with ~0.54 the average observed AUC is quite a bit higher than the ~0.5 you would expect from a random classifier! More specifically, you can see that 24 out of 25 AUCs are higher than 0.5, which can be verified as statistically highly significant, for example using a one-sample Wilcoxon signed rank test, or a binomial test. Coming back to the question we posed at the beginning of this section: No, taking the maximum AUC is not a valid approach for estimating the generalization performance of the best C. Why is this? Well, in the randomized example, essentially what we’re doing (25 times) is randomly sampling 11 AUCs (probably with some covariance structure, but still random). We would expect these 11 AUCs to average out at around 0.5. Among these 11, we then take the maximum, which we would expect to be higher than 0.5. Therefore, after repeating the above 25 times, we would definitely expect the average of the 25 maximum AUCs to also be higher than 0.5! Summarizing, simultaneously doing model selection and performance estimation leads to a positively biased performance estimate. #### Nested cross-validation So how should we go about doing model selection and performance estimation? In principle, what we could do, is set aside some test data to test the performance once we’ve optimized C: 1. Set aside some test data, not to be used in the cross-validation. 2. Optimize C using cross-validation on all but the test data. 3. Select the best C (the one with the highest AUC). 4. Test the performance of the highest C on the test data. This is a perfectly valid approach, except that it has the disadvantage of not having used all data for training, at some point. This argument should sound familiar: While it is perfectly valid to split our data into training and test data, we will get a more robust estimate of our performance by doing a cross-validation, as such iteratively using all data for training. Analogously, it it perfectly valid to split our data into (1) training data for training the models, (2) validation data for selecting the best model and (3) test data for assessing the performance of the best model. However, we will get a more robust performance estimate by iteratively using all data for training. This can be done by something called nested cross-validation, and can be done as follows: • Split data into k1 folds. • For i in 1…k1: 1. For all Cs: 1. Do a k2-fold cross-validation on all data but the k1-th fold. 2. Return the performance averaged across the k2 folds. 2. Select the C with the best average performance. 3. Assess the performance of this model on the k1-th fold. We can see why the above is called nested cross-validation: an inner cross-validation loop for model selection is nested within an outer cross-validation loop for performance estimation. As such, nested cross-validation tries to estimate the expected performance of a model where the hyperparameter C is optimized using cross-validation. Here’s a function for doing nested cross-validation: ncv <- function( data, k, CC, seed = NULL ) { if (!is.null(seed)) { set.seed(seed) } folds <- createFolds(data$Class, k = k)
# For each fold ...
auc <- sapply(folds, function(fold) {
# Do a cross-validation for each C
auc <- cv(
data[-fold,],
k,
CC,
seed = seed
)
# Select the C with the highest AUC
C <- CC[which.max(sapply(auc, mean))]
C  1, sample(C, 1), C)
# Test this C on the test data
train_and_validate(
data,
fold = fold,
C = C
)
})
auc
}


Let's first check whether using this function we indeed get an AUC of ~0.5 on the randomized data:

set.seed(111)
auc <- replicate(25, {
cat(".")
# Randomize the class labels. This should result in
# random performance, i.e. AUC ~= 0.5
data$Class <- sample(data$Class)

# This returns k scores
auc <- ncv(
data = data,
k = 5,
CC = 2^seq(log2(.01), log2(10), length.out = 11)
)
# Take the average as the performance estimate
mean(auc)
})


Indeed, we get random performance! Now let’s try it out on the breast cancer data:

# test on BreastCancer data
auc <- ncv(
data = data,
k = 5,
CC = 2^seq(log2(.01), log2(10), length.out = 21),
seed = 111
)


We can see that nested cross-validation gives an expected performance ~0.987, which is indeed lower than the positively biased estimate of ~0.991 we found initially. This may seem like a small difference but compared to 0.991, 0.987 is ~40% further away from the perfect classifier with AUC = 1, making it quite a substantial difference.

# Transfer learning: domain adaptation by instance-reweighting

In supervised learning, we typically train a model on labeled data (we know “the truth”) and eventually apply the model to unlabeled data (we do not know “the truth”). For example, a spam filtering model might be trained on a public email database, with emails clearly labeled as “spam” or “non-spam”. However, the model will eventually be applied to a personal inbox, where emails are not labeled. An interesting example from the life sciences is training a classifier for predicting protein interactions in some species for which biologically validated interactions are known, and applying this classifier to other species, for which no such data validated interactions exist.

But what if, in addition to missing labels, the data we apply our model to (the “target data”) is just very different from our training data (the “source data”)? For example, in a personal inbox both spam and non-spam emails may have very different characteristics compared to the emails in the public email database. Also, in protein interaction prediction, it could important to consider that species can have very different proteomes, and therefore also different protein interactions.

In cases such as the two outlined above, what we would like to do, is make sure that our model performs well on the target data, while still training on the source data. How can we do that?

#### The covariate shift assumption

In cases where the target data is very different from the source data, we need to think about domain adaptation. Domain adaptation can be seen as a specific case of transfer learning, and can be applied in situations where there is only a covariate shift between source and target data:

$P_S(X) \neq P_T(X) \textrm{, but } P_S(Y|X=x) = P_T(Y|X=x)$

Here,

• $P_S(X)$ represents the marginal covariate distribution of source instances.
• $P_T(X)$ represents the marginal covariate distribution of target instances.
• $P_S(Y|X=x)$ represents the conditional class distribution given source instance $x$.
• $P_S(Y|X=x)$ represents the conditional class distribution given target instance $x$.

In words, the first part ($P_S(X) \neq P_T(X)$) simply means that in general you find different emails in the public email database than in your own inbox: the target data is distributionally different from the source data. The second part ($P_S(Y|X=x) = P_T(Y|X=x)$) means that the class probability of an instance $x$ is independent of whether $x$ originated from the source or target distribution. In the example of spam filtering: if you have a specific email, then the probability of this email being spam stays the same, regardless of whether it originated from the public email database or from your personal inbox.

#### Model misspecification

Now you might think: If we train a classifier, we do not care about whether the source and target covariate distributions are different? We are only interested in the conditional class distribution $P_T(Y|X=x)$, and because the assumption is that $P_S(Y|X=x) = P_T(Y|X=x)$, we can simply train a classifier on the source data to obtain optimal performance on the target data? Well, ideally, yes.  However, it was shown that under model misspecification, covariate shift can in fact be a problem, and the thing is that typically, models are indeed misspecified: We do not know what function generated our data, but most likely it was not precisely of the form that we used for fitting the data. For example, fitting a line (e.g. using logistic regression) to separate the classes in the following case would be a clear case of model misspecification:

#### Model misspecification in a transfer learning setting

Back to transfer learning. Remember that in our transfer learning setting, we are training on labeled source data, and will apply the resulting classifier to unlabeled target data. Moreover, the unlabeled target data is distributionally different from the source data. Let’s extend the above example, and separate the data into source and target data:

You can see that the target data is differently distributed compared to the source data: it tends to have higher x1 values, implying that $P_S(X) \neq P_T(X)$. Furthermore, target class labels are unknown. Therefore, in training a classifier separating class1 from class2, the only thing that we can do is train on the labeled source data. Training a logistic regression classifier on the source data gives the decision boundary in the left two plots:

The decision boundary indeed seems optimal for the source data (left plot). However, it is far from optimal for the target data (middle plot). In fact, the optimal decision boundary of the target data is much steeper (right plot). In this transfer learning setting, the model misspecification implies that it is not possible to find a logistic regression parameterization $\theta$, such that $P_S(Y|X=x, \theta) = P_T(Y|X=x, \theta)$ for all $x$. In other words, the optimal model for the source data is different from the optimal model for the target data. This brings us to the following question: Is there a way to train a classifier on the source data, while trying to optimize for performance on the target data?

#### Re-weighted empirical risk minimization

It turns out, yes: We can train on the source data while optimizing for performance on the target data. Let’s first go through some math to show how. (or skip to an implementation using R if you are not interested) Recall that true risk minimization finds a parameterization $\theta = \theta^\ast$, such that the expected value of the loss function $l(x,y,\theta)$ under the true joint distribution $P(x,y)$ over $X$ and $Y$ is minimized:

Empirical risk minimization approximates true risk minimization by using the empirical joint distribution over $X$ and $Y$, because the true joint distribution $P(x,y)$ is unknown:

Note that in the above, $(x_i, y_i) \sim P(x,y)$.

In our domain adaptation problem, we have two joint distributions, namely the source distribution $P_S(x,y)$ and the target distribution $P_T(x,y)$. In training on the empirical source distribution, we want optimize for performance on the target distribution. To do this, we use our previous assumption ($P_S(Y|X=x)=P_T(Y|X=x)$ for all $x$), and apply the following trick for transferring knowledge from our source domain to our target domain:

Note that in the above, $(x_i, y_i) \sim P_S(x,y)$. So we started with the normal formulation of true risk minimization under the target distribution, and showed that we can approximate this by re-weighting each source instance $(x_i, y_i)$ in an empirical risk minimization under the source distribution! More specifically, each instance $(x_i, y_i)$ needs to be re-weighted by the ratio of the marginal covariate probabilities $\frac{P_T(x_i)}{P_S(x_i)}$. Interestingly, the above suggests that doing re-weighted empirical risk minimization is essentially the same as performing importance sampling for computing the expected value of the loss function under the target joint distribution, with the additional assumption that the conditional class probabilities between the source and target data are the same.

#### How to estimate the marginal probability ratio?

The problem with the above result, is that $P_T(x)$ and $P_S(x)$ are difficult to determine. However, we can avoid computing these probabilities directly, by interpreting this marginal probability ratio as another probability ratio: the ratio of probabilities that $x_i$ comes from the target data and from the source data, weighted by the ratio of the source data size $N_S$ and target data size $N_T$:

$\frac{P_T(x_i)}{P_S(x_i)} \approx \frac{N_S}{N_T} \frac{P(x_i \textrm{comes from the target data})}{P(x_i \textrm{comes from the source data})}$

Why is this? Well, here’s an argument for the discrete case. Suppose we independently draw two random samples, one of size $N_S$ from the source distribution, and one of size $N_T$ from the target distribution. We merge these two random samples, and from this sample of size $N_S + N_T$ we draw a single instance $x_i$. What is the probability that $x_i$ originated from the target distribution? If $n^S_i$ is the number of occurrences of $x_i$ in the random source sample of size $N_S$, and $n^T_i$ is the number of occurrences of $x_i$ in the random target sample of size $N_T$, then the following represents the probability that $x_i$ originated from the target distribution:

$P(x_i \textrm{comes from the target data}) = \frac{n^T_i}{n^T_i+n^S_i}$

Similar for the source data:

$P(x_j \textrm{comes from the source data}) = \frac{n^S_i}{n^S_i+n^T_i}$

Now what’s the expected value for their ratio?

So all we need to do, is estimate for each source instance the probability that it originated from the target class. How do we do that? One straightforward way of estimating these probabilities is to train a naturally probabilistic classifier, such as a logistic regression classifier.

#### A simple way of implementing re-weighting

We now have everything in place to train a classifier on the source data, while optimizing for performance on the target data:

1. Compute the source instance weights:
1. Train a logistic regression classifier separating source data from target data.
2. Apply the classifier to each source instance $x^S_i$, thus computing $p_i = P(x^S_i\textrm{ comes from the target data}).$
3. For each source instance $x^S_i$ compute the instance weight $w_i$ as $w_i = \frac{p_i}{1-p_i}$.
2. Train a logistic regression classifier on the source data, separating class1 from class2, while re-weighting each source instance $x^S_i$ by $w_i$.

In R, this could look as follows. First define some functions:


# Function to generate random data.
generate_data <- function(n) {

range_x1 <- 1
range_x2 <- 1

# The features.
x1 <- runif(n, -range_x1, range_x1)
x2 <- runif(n, -range_x2, range_x2)

# Generate class labels.
y <- (x1 < 0 & x2 > 0) + 1

# Generate source and target labels.
prob <- (x1 + range_x1) / range_x1 / 2
s <- 1:n %in% sample(n, n/2, prob = prob^5) + 1

data.frame(
x1 = x1,
x2 = x2,
y = factor(c("class1", "class2")[y]),
s = factor(c("source", "target")[s])
)
}

# Function to fit a logistic regression classifier,
# possibly weighted.
fitLRG <- function(df, weights = rep(1, nrow(df))) {
# Compute the class weights.
tab <- 1 / table(df$y) # Multiply by the instance weights weights <- as.numeric(weights * tab[match(df$y, names(tab))])
# Fit a logistic regression model on the
# source class label.
fit <- coef(glmnet(
x = as.matrix(df[, c("x1", "x2")]),
y = df$y, lambda = seq(1, 0, -0.01), weights = weights, family = "binomial" )) fit[, ncol(fit)] } # Function to compute instance weights compute_instance_weights <- function(df) { # Fit a logistic regression model on the # source/target indicator. fit <- glmnet( x = as.matrix(df[, c("x1", "x2")]), y = df$s,
lambda = seq(1, 0, -0.01),
family = "binomial"
)
# For each instance, compute the probability
# that it came from the target data
p <- predict(
fit,
newx = as.matrix(df[,c("x1", "x2")]),
type = "response"
)
p <- p[, ncol(p)]
p / (1 - p)
}



Now let’s do some transfer learning:


# Load a package for fitting logistic regression models.
library(glmnet)

# Set the seed for reproducibility.
set.seed(1)

# Generate some random data.
df <- generate_data(1e3)

# Train an unweighted classifier.
fit_unweighted <- fitLRG(df[df$s == "source",]) # Train a re-weighted classifier: # 1. Compute the instance weights weights <- compute_instance_weights(df) # 2. Train a weighted classifier fit_reweighted <- fitLRG( df[df$s == "source",],
weights = weights[df$s == "source"] )  The results confirm that instance re-weighting indeed leads to a decision boundary that is much closer to the optimal decision boundary for the target data: # Feature selection, cross-validation and data leakage In machine learning and statistics, data leakage is the problem of using information in your test samples for training your model. The problem with data leakage is that it inflates performance estimates. The simplest case, directly using test samples for training, is easily avoided. However, it does not take much for data leakage to become quite a bit more difficult to detect, as is illustrated by the following example combining feature selection with cross-validation. A common problem in the analysis of high-throughput biological data is that, very often, the number of features is much larger than the number of samples. For example, a typical RNA-seq experiment will result in tens of thousands of features, whereas the number of samples will be orders of magnitude smaller. Building statistical or machine learning models based on such data poses a number of problems. For example, the large number of features typically make a resulting model more difficult to interpret. Also, it may take longer to train the model. However, the most important problem may be that training models on many features is more prone to overfitting, and as such increases the generalization error. For this reason, a first step that is often taken in training models based on data with many more features than samples, is selecting only a limited number of relevant features to be used for training the model. One of the simplest ways of selecting relevant features is applying a filter. A filter selects features on an individual basis, without considering the learning algorithm that will eventually be applied to the set of selected features. An example would be checking the correlation of each feature with the response variable, and selecting only those for which the correlation exceeds a certain threshold. For our example combining feature selection with cross-validation, we will use a simple t-test filter. Here’s an R function that performs simultaneous t-tests on the rows (features) of a feature matrix X given a class variable y, and for each feature returns a p-value:  # fast row t-test for equal sample sizes and equal variance # the hypothesis test is two-sided row_ttest <- function(X, y) { library(matrixStats) X1 <- X[, y == levels(y)[1]] X2 <- X[, y == levels(y)[2]] spool <- sqrt( (rowVars(X1) + rowVars(X2)) / 2 ) tstat <- (rowMeans(X1) - rowMeans(X2)) / spool / sqrt(4 / ncol(X)) df <- ncol(X) - 2 p <- pt(tstat, df) pmin(p, 1 - p) * 2 }  The data that we will use for the example is random data with 100 samples times 100000 features. The samples are randomly divided in two classes, 50 samples for each class. Here’s a function that generates the random data:  generate_random_data <- function( # number of samples in the random data nsamples = 1e2, # number of features in the random data nfeatures = 1e5 ) { # the features X <- matrix( rnorm(nsamples * nfeatures), nrow = nfeatures, ncol = nsamples ) # the class labels y <- gl(2, nsamples / 2) list( X = X, y = y ) }  Now, the problem at hand is to train a classifier that will predict the class label of the above data. Because of the large number of features, we want to do some initial feature selection, and might consider the following approach: 1. Reduce the number of features by applying a t-test filter to the individual features. 2. Train a support vector machine (SVM) and estimate the misclassification rate by cross-validation. Here’s the code for that approach, which includes calls to the previously defined functions:  # load some libraries library(caret) library(e1071) # set the seed set.seed(123) # generate some data Xy <- generate_random_data() X <- Xy$X
y <- Xy$y # apply the t-test filter selected <- row_ttest(X, y) < 0.1 # Train an SVM to predict the class label, # and estimate the misclassification rate by # cross-validation. folds <- createFolds(y, k = 5) cv <- unlist(lapply(folds, function(fold) { # train and test a model fit <- svm(x = t(X[selected, -fold]), y = y[-fold]) pred <- predict(fit, newdata = t(X[selected, fold])) ct <- table(pred, y[fold]) # get the misclassification rate 1 - sum(diag(ct)) / sum(ct) })) barplot( cv, ylim = c(0, 1), las = 2, ylab = "misclassification rate" )  The result of running the code above looks as follows: We get perfect performance! But why? The data was random, so we should not be able to predict the class label? Let’s have a closer look at the approach we took: 1. Generate random data: 100 samples times 100000 features. 2. Reduce the number of features by applying a t-test filter to the individual features. 3. Train a support vector machine (SVM) by cross-validation, i.e. for each fold: 1. Train an SVM on the selected features on the training data (all data but the fold). 2. Calculate the misclassification rate on the test data (the fold). If you look carefully, you will realize that in step 2. we selected features based on all samples, so including the samples we would use for testing in step 3. However, the initial feature selection partly determines how the models in the cross-validation loop are trained. Hence, our test samples partly determined the training of our models! This is an example of data leakage or data snooping: using information from your test samples to train your model. As we have seen, the problem with data leakage is that it leads to inflated performance estimates. So how should you do it without data leakage? Well, move the feature selection to within the cross-validation loop to only apply it on the training data: 1. Generate random data: 100 samples times 100000 features. 2. Reduce the number of features by applying a t-test filter to the individual features. 3. Train a support vector machine (SVM) by cross-validation, i.e. for each fold: 1. Reduce the number of features by applying a t-test filter to the individual features, using only the training data (all data but the fold). 2. Train an SVM on the selected features on the training data (all data but the fold). 3. Calculate the misclassification rate on the test data (the fold). Here’s the code:  # load some libraries library(caret) library(e1071) # set the seed set.seed(123) # generate some data Xy <- generate_random_data() X <- Xy$X
y <- Xy\$y

# Train an SVM to predict the class label,
# and estimate the misclassification rate by
# cross-validation.
folds <- createFolds(y, k = 5)
cv <- unlist(lapply(folds, function(fold) {
# apply the t-test filter within the cross-validation loop!
selected <- row_ttest(X[,-fold], y[-fold]) < 0.1
# train and test a model
fit <- svm(x = t(X[selected, -fold]), y = y[-fold])
pred <- predict(fit, newdata = t(X[selected, fold]))
ct <- table(pred, y[fold])
# get the misclassification rate
1 - sum(diag(ct)) / sum(ct)
}))
barplot(
cv, ylim = c(0, 1), las = 2,
ylab = "misclassification rate"
)



Running the code above demonstrates that performance is indeed random:

The example outlined in this post illustrates that data leakage encompasses much more than simply using your test samples for training your model. While directly using test samples for training is trivially avoided, even only slightly more complex data analysis strategies can easily lead to data leakage that can be much more difficult to detect.

# Order of appearance and final ranking in the Queen Elisabeth Competition

The degree to which competitions can be objectively judged depends on the type of competition. For example, athletes competing in a 100m sprint can be objectively ranked by the time they need to cross the finish line. On the other hand, objective judgement is virtually impossible in music competitions, because so much depends on highly subjective personal opinions, which can differ widely across members of a jury. A famous example is given by the 1980 Chopin Competition in Warsaw, where Ivo Pogorelich was eliminated in the 3rd round. This prompted Martha Argerich, declaring Pogorelich a ‘genius’, to resign from the jury in protest. It is also important to note that, since personal opinions play such an important role in judging music competitions, there is an increased risk of biases resulting from non-music related (i.e. confounding) factors. For example, it is often said that the order of appearance affects the final ranking of the candidates. In addition to jury members as a potential cause of these biases, the contestants themselves may also be involved generating these biases, for example when a specific positioning within the program of a competition has a certain psychological effect on a contestant.

The Queen Elisabeth Competition is one of the most prestigious music competitions in the world, and while watching the final round of this year’s edition in May, I was wondering to what extent the order of appearance has influenced the final rankings over the years. To assess this influence, I collected the data on order of appearance and ranking in all Queen Elisabeth Competition finals since 1952 from the website of the competition, in total 216 results for 18 piano competitions. In the final round of the Queen Elisabeth Competition, 12 pianists give their performances spread across 6 days, where each pianist is randomly (!) allocated a specific time slot on one of the 6 days, either before or after the intermission. Hence, one can imagine the final ranking being influenced (‘biased’) by the day of performance (1 – 6), as well as by the order of performance on a given day (1 or 2). This is visualized in the boxplot below, where prizes won by finalists over the years are divided into categories based on the day of performance, and the order of performance on a given day.

Note that for all competitions since 1995, only prizes 1 – 6 have been awarded, and the remaining finalists are unranked. In the plot above, these unranked finalists have been awarded the ‘9.5th prize’ (the average of 7 and 12).

Quite strikingly, the boxplot shows that final ranking is indeed substantially influenced by day and order of appearance. For example, if you’re the first finalist on the first day, this may mean you have lost before you have played even a single note: the 2nd finalist on the last day was typically ranked 5 to 6 places higher than the 1st finalist on the first day! (as measured by the difference in median prize, i.e. the thick black line within each of the gray boxes). At least equally striking is how prize depends on the order of performance on a given day. For example, as a finalist you may be happy about being selected to play on the last day. However, it may not do you any good, unless you are scheduled to play after the intermission: the 2nd finalist on the last day was typically ranked 5 places higher than the 1st finalist on that same day! More generally, the 1st finalist on a given day typically did quite a bit worse than the 2nd finalist on the same day. Moreover, a 1st finalist on days 4, 5 or 6 typically did worse than a 2nd finalist on days 1, 2 or 3.

The above observations imply that if we would want to place bets regarding the final ranking, without having any prior knowledge of music, we may actually be able to do quite a bit better than random guessing. For example, suppose we would want to predict whether a finalist receives a prize or is unranked (or, for competitions before 1995: whether a finalist ranks between 1 to 6, or between 7 to 12). In this case, by random guessing we would expect to classify 6 out of 12 finalists correctly. Guessing more than 6 correctly as receiving a prize or not, is better than expected. (Some technicalities are following, but feel free to skip to the last paragraph of this post.) One way of trying to do better than the expected number of 6 correct guesses, is to use a Random Forest classifier. Doing this in the statistical programming language R, using the randomForest package, gives a specificity of 0.56 and a sensitivity of 0.71, as determined using the predicted classes of the input samples based on out-of-bag samples (p ~= 0.0001 using Fisher’s exact test). Together, these give a generalization error rate of 37%.

Hence, using a classifier it is expected that one would classify 100% – 37% = 63% of finalists correctly as receiving a prize or not, purely based on day and order of performance of the finalists. Note that 63% of finalists amounts to 7 – 8 finalists, which is 1 – 2 more than expected by random guessing. This again demonstrates the substantial bias in the final ranking of finalists in the Queen Elisabeth Competition, induced by day of performance, and order of performance on a given day.

# SVM with recursive feature elimination in R

The support vector machine (SVM) is a very powerful classifier due to its inherent regularization properties as well as its ability to handle decision boundaries of arbitrary complexity by its formulation as a kernel method. However, suppose that we’re not so much interested in the decision boundary itself, but more in the relative importance of the features in specifying this decision boundary. For example, given two sets (i.e. ‘classes’) of samples, one from tumor tissue and one from normal tissue, and associated RNA-seq gene expression data, one may be interested in ranking genes (i.e. ‘features’) according to their importance in specifying the difference (i.e. ‘decision boundary’) between normal and tumor tissue. This represents an important problem in the field of machine learning, and is called feature ranking. Guyon et al. proposed combining SVMs with recursive feature elimination (SVM-RFE) for feature ranking. Recursive feature elimination in its simplest formulation starts with the complete set of features, and then repeats the following three steps until no more features are left:

1. Train a model (in the present case, an SVM).
2. Compute a ranking of features.
3. Remove the feature with the worst rank.

For step 2, ranking the features, Guyon et al. proposed to take the coefficients to the SVM primal problem (i.e. the coefficients to the SVM hyperplane) as feature weights. Since SVMs are computed by solving the dual problem using quadratic programming, these hyperplane coefficients first need to be computed from the coefficients to the dual problem. In R, for a matrix X with samples in the rows and features in the columns, and a factor y containing the classes, this looks as follows:


# single-step SVM-RFE
svm_rfe__single_step <- function(X, y, ...) {

library(kernlab)

# keep track of the iteration during which
# a feature was eliminated
ii <- rep(NA, ncol(X))
i <- 0
while ( any(is.na(ii)) ) {
# indices of remaining features
not_elim_yet <- which(is.na(ii))
# train the classifier on the remaining features
fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
# compute the primal problem coefficients from the dual
# problem coefficients
sv_i <- alphaindex(fit)[[1]]
w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ]
# eliminate the feature with the smallest squared weight
to_elim <- not_elim_yet[ head(order( w * w ), 1) ]
ii[to_elim] <- i
i <- i + 1
}
# convert iterations into ranks
i - ii

}



The above algorithm eliminates one feature (i.e. gene) per iteration. Typical RNA-seq expression datasets have hundreds to tens of thousands of features, so a call to the function above also to needs to train an SVM hundreds to tens of thousands of times. Needless to say, this can take quite a while. Note that we’re most likely mainly interested in the top-ranking genes, and less so in lower ranking genes. Therefore, we could also eliminate relatively many features during early iterations, and then decrease the number of features to be eliminated as the iteration progresses. Taking the approach of exponential decrease (i.e. eliminate a fixed fraction of remaining features), this looks as follows:


# multi-step SVM-RFE
# default elim_frac corresponds to single-step
svm_rfe <- function(X, y, elim_frac = 1 / ncol(X), ...) {

library(kernlab)

# keep track of the iteration during which
# a feature was eliminated
ii <- rep(NA, ncol(X))
i <- 0
while ( any(is.na(ii)) ) {
# indices of remaining features
not_elim_yet <- which(is.na(ii))
# number of features to eliminate
n_to_elim <- ceiling ( elim_frac * length(not_elim_yet) )
# train the classifier on the remaining features
fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
# compute the primal problem coefficients from the dual
# problem coefficients
sv_i <- alphaindex(fit)[[1]]
w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ]
# eliminate the features with the smallest squared weights
to_elim <- not_elim_yet[ head(order( w * w ), n_to_elim) ]
ii[to_elim] <- i
i <- i + 1
}
# convert iterations into ranks
i - ii

}



Let’s see how this works in practice. For this, we’ll use the ‘cervical’ data from the Bioconductor package MLseq. This dataset describes the expression levels as measured by RNA-seq of 714 miRNAs in 29 tumor and 29 normal human cervical samples. We can load and process the data, and run the above functions, using the following code:


library(MLSeq)

data(cervical)

# preprocess
y <- factor(grepl("^T[0-9].*", colnames(cervical)))
levels(y) <- c("normal", "tumor")
X <- scale(t(data.matrix(cervical)))

# SVM-RFE single-step
system.time(I <- svm_rfe(X, y))
# SVM-RFE multi-step
system.time(J <- svm_rfe(X, y, 0.1))



In the example above, using default values removes one feature at a time and thus trains 714 SVMs, one for each feature. On the other hand, setting elim_frac to 0.1 (i.e. at each iteration, 10% of features are eliminated), needs to train only 47 SVMs. Although setting elim_frac at 0.1 trains roughly 15 times fewer SVMs, it runs approximately 30 times as fast (~1 second vs. ~33 seconds on my laptop). This is because the number of SVMs trained on many features is relatively more constrained by the multi-step approach than the number of SVMs trained on fewer features. So, we lose some resolution in the relatively uninteresting lower ranking features, but gain a lot in terms of execution time. This gain in execution time is visualized in the following plot.