# Transformers for multivariate time series

Transformer neural networks represent a deep learning architecture exploiting attention for modeling sequence data, thus not requiring input data to actually be processed in sequence, unlike recurrent neural networks. Because of the recent successes of transformers in machine translation, I was curious to experiment with them for their use in modeling multivariate time series, and more specifically multivariate longitudinal clinical patient data. For this purpose, I have now implemented transformers as part of a method that I previously published (paper, code). The method addresses the problem of clustering multivariate time series with potentially many missing values, and uses a variational autoencoder with a Gaussian mixture prior, extended with LSTMs (or GRUs) for modeling multivariate time series, as well as implicit imputation and loss re-weighting for directly dealing with (potentially many) missing values. In addition to LSTMs and GRUs, I have now implemented transformers as part of the package. In addition to variational autoencoders with Gaussian mixture priors, the code allows to train ordinary variational LSTM/GRU/Transformer autoencoders (multivariate gaussian prior) and ordinary LSTM/GRU/Transformer autoencoders (without prior). I will probably report on some experiments comparing LSTMs and transformers for multivariate longitudinal clinical patient data in the near future.

# Hamming distance in R using Rcpp

After all my posts on efficiently computing the Hamming distance in R, two questions remain for me. First, can we compute the Hamming distance even faster when implementing it in a low-level programming language such as C/C++, and then use the great Rcpp package to directly call this low-level C/C++ function from R? Second, how much would parallelization help? I will answer the second question in a separate post, and address the first question here.

I implemented two C/C++ functions for computing the Hamming distance between all pairs of columns of a binary matrix X:

1. A straightforward approach that I called “hamming__C_simple”: this function simply iterates through all combinations of columns of a matrix X to compare all elements between each pair of columns.
2. A more involved approach that I called “hamming__C_bitwise”: This function exploits the ability one has in C to directly manipulate individual bits in memory, which should typically be very fast. The function first interprets the R binary vectors as sequences of unsigned integers, to then perform bitwise operations on these integers.

Note that focusing on binary matrices does not limit the applicability of the C/C++ functions, because multi-value Hamming distances can be easily computed by adding multiple binary Hamming distances.

The code for both implementations is shown below.


#include "Rcpp.h"
#include "math.h"
#include "stdlib.h"
#include "stdio.h"
#include "math.h"
#include "algorithm"
#include "climits"
using namespace Rcpp;

unsigned long long binvec2int(IntegerMatrix x) {
unsigned long long u = 0;
int n = x.nrow();
for (int i = 0; i < n; i++) {
u += x(i, 0) * (unsigned long long) pow(2, n - 1 - i);
}
return u;
}

// [[Rcpp::export]]
NumericMatrix hamming__C_bitwise(IntegerMatrix X) {
int i, j, k, from, to, h;
int stepsize = sizeof(unsigned long long) * CHAR_BIT;
int n = ceil(((double) X.nrow()) / stepsize);
IntegerMatrix x;
NumericMatrix H(X.ncol(), X.ncol());
// convert X to unsigned long long
unsigned long long *xx =
(unsigned long long *) malloc(n * X.ncol() * sizeof(unsigned long long));
i = 0;
while ((from = i * stepsize) < X.nrow()) {
for (j = 0; j < X.ncol(); j++) {
to = std::min(X.nrow(), (i + 1) * stepsize);
x = X(Range(from, to - 1), Range(j, j));
xx[i + j * n] = binvec2int(x);
}
i++;
}

for (i = 0; i < X.ncol(); i++) {
for (j = i + 1; j < X.ncol(); j++) {
for (k = 0; k < n; k++) {
h = __builtin_popcountll(xx[k + i * n] ^ xx[k + j * n]);
H(i, j) += h;
H(j, i) += h;
}
}
}
free(xx);
return H;
}

// [[Rcpp::export]]
NumericMatrix hamming__C_simple(IntegerMatrix X) {
int i, j, k, h;
NumericMatrix H(X.ncol(), X.ncol());
for (i = 0; i < X.ncol(); i++) {
for (j = i + 1; j < X.ncol(); j++) {
for (k = 0; k < n; k++) {
h = (X(k, i) != X(k, j));
H(i, j) += h;
H(j, i) += h;
}
}
}
return H;
}



I compared the performance of the two C/C++ functions to a very fast pure R function based on matrix multiplication that I wrote about before:

Interestingly, we can see that the straightforward C++ implementation (C_simple) is not faster than the pure R function (pure_R). In other words, efficient use of pure R can easily be faster than fairly regular use of C++ within R. We can see that if we actually want to be faster than the pure R function, one way of doing that is exploiting C’s bitwise operations (C_bitwise).

# Fast Hamming distance in R using covariance

Over the last years, I’ve written number of posts on efficiently computing the Hamming distance in base R using matrix multiplication. e.g. for large binary matrices, multi-level matrices and feature-weighted matrices, and shown that a multi-level Hamming distance matrix can be computed by adding multiple binary Hamming distance matrices. I feel that the Hamming distance is a great use case for efficient use of R, and many other high-level programming languages as well since the ideas I presented are generally applicable. Hence, I never stop wondering whether it can be computed even faster, using only base R and avoiding extensive use of R-level looping constructs. While the function in my original post was already extremely fast, I have actually found another approach that is even (albeit slightly) faster when the number of features is relatively large. It is based on the realization that the matrix multiplication in my original approach is highly reminiscent of the way a covariance matrix is computed, and therefore I start the derivation with the definition of the covariance between two vectors. Recall that analogous to the definition of the the population covariance between two random variables $X$ and $Y$

the covariance between two binary vectors $\mathbf{x}$ and $\mathbf{y}$ of size $n$ is defined as:

where $\mathbf{\overline{xy}}$, $\mathbf{\bar{x}}$ and $\mathbf{\bar{x}}$ denote the averages of $\mathbf{xy}$, $\mathbf{x}$ and $\mathbf{y}$ respectively.

Now, proceeding from this definition

The above also implies that

Now recall from my previous post that the hamming distance $h(\mathbf{x},\mathbf{y})$ between two binary vectors $\mathbf{x}$ and $\mathbf{y}$ is computed as

Then a bit more algebra and realizing that $q_{\mathbf{x},(1 - \mathbf{y})} = q_{\mathbf{x},\mathbf{y}}$ gives

One way of interpreting this is that, up to a term dependent only on the fractions of 1’s in the individual vectors, the Hamming distance between two vectors $\mathbf{x}$ and $\mathbf{y}$ is proportional to the negative covariance between $\mathbf{x}$ and $\mathbf{y}$. The above formula leads us to another way computing the Hamming distance between all columns of a binary matrix X, again without using any R-level looping:

hamming_binary_varcov <- function(X) {
n <- ncol(X)
Xm <- rowMeans(X)
XM <- Xm * array(1, dim = c(nrow(X), nrow(X)))
tXM <- t(XM)
n * (XM + tXM - 2 * XM * tXM) - 2 * tcrossprod(X - Xm)
}


Now, naturally we want to compare the performance of this new function to the one from my previous post:

hamming_binary_matmul <- function(X) {
D <- (1 - X) %*% t(X)
D + t(D)
}


We can see that indeed, when the number of features is relatively large (left plot), the covariance-based computation of the Hamming distance in this post seems even slightly faster than the one in my previous post, at least on my computer:

# Fast weighted Hamming distance in R

Upon request, I am posting a row-weighted version of the fast Hamming distance function in my previous posts (here and here).

The fast row-weighted binary Hamming distance:


hamming_binary <- function(X, w = rep(1, nrow(X))) {
w <- sqrt(w)
D <- t(w * (1 - X)) %*% (w * X)
D + t(D)
}


The fast row-weighted “multi-level” Hamming distance:


hamming <- function(X, w = rep(1, nrow(X))) {
uniqs <- unique(as.vector(X))
H <- hamming_binary(X == uniqs[1], w)
for ( uniq in uniqs[-1] ) {
H <- H + hamming_binary(X == uniq, w)
}
H / 2
}


Note that the last function is not dependent on the mode and number of unique elements of the matrix. In other words, the matrix can consist of any number of different characters, integers, and any other type of element you can store in an R matrix. Also, because it avoids R-level looping, it is extremely fast.

# Deep learning for clustering of multivariate short time series with potentially many missing values

Time to plug some of my recent work: a method for clustering multivariate short time series with potentially many missing values, a setting commonly encountered in the analysis of longitudinal clinical data, but generally still poorly addressed in the literature. The method is based on a variational autoencoder with a Gaussian mixture prior (with a latent loss as described in Jiang et al., 2017), extended with LSTMs for modeling multivariate time series, as well as implicit imputation and loss re-weighting for directly dealing with (potentially many) missing values.

I will write some more on the method at a later date, but for now an implementation in Tensorflow is already available on github, and you can read the paper here.

# Nested cross-validation

In machine learning and statistics, we can assess how well a model generalizes to a new dataset by splitting our data into training and test data:

1. Split data into training and test data.
2. Fit a model to the training data.
3. Assess the performance of the model on the test data.

The disadvantage of the above approach, is that not all data is used for training. A more robust assessment of generalization can be done using cross-validation:

1. Split data into k parts (folds)
2. For i in 1…k:
1. Fit a model to all data but the k-th fold.
2. Assess the performance of the model on the k-th fold.

This approach uses all data for training, and gives a more robust estimate of generalization performance.

Because cross-validation can be used for assessing the performance of a model, it can also be used for comparing the performance of different models (model selection). In the next two sections, I will give examples of both cross-validation for performance estimation and cross-validation for model selection. In the last two section, I will explain why you need to do something called nested cross-validation if you want to simultaneously do performance estimation and model selection.

#### Cross-validation for estimating generalization performance

For example, suppose we fit a linear-kernel SVM with C = 1 to the breast cancer data from the R-package mlbench. Then we could estimate how well such an SVM generalizes by doing a 5-fold cross-validation as follows:

# Load some packages
library(kernlab)
library(caret)
library(mlbench)
library(PRROC)

# Set the random seed for reproducibility
set.seed(111)

# Load and prepare the breast cancer data
data(BreastCancer)
data <- BreastCancer[!is.na(BreastCancer$Bare.nuclei), -1] # For simplicity, restrict both classes to 200 samples data <- data[ c( sample(which(data$Class == levels(data$Class)[1]), 200), sample(which(data$Class == levels(data$Class)[2]), 200) ), ] y <- data$Class
X <- data
X$Class <- NULL # Test the performance of a linear SVM with C = 1 folds <- createFolds(data$Class, k = 5)
# For each fold ...
auc <- sapply(folds, function(fold) {
# Train an SVM, excluding the fold
fit <- ksvm(
Class ~ .,
data = data[-fold,],
kpar = list(),
C = 1,
prob.model = TRUE,
Class.weights = 1 / table(data$Class[-fold]) ) # Predict the fold yh <- predict(fit, newdata = data[fold,], type = "probabilities") # Compare the predictions to the labels posneg <- split(yh[,1], data$Class[fold])
# Return the AUC under the ROC
roc.curve(posneg[[1]], posneg[[2]])$auc })  The results look as follows (average AUC of about 0.976): #### Cross-validation for model selection An example of using cross-validation for choosing between models, such as when optimizing hyperparameters, looks as follows: # Function for one round of training and validating an SVM train_and_validate <- function( data, fold, C ) { # Train an SVM, excluding the fold fit <- ksvm( Class ~ ., data = data[-fold,], kernel = "vanilladot", kpar = list(), C = C, prob.model = TRUE, Class.weights = 1 / table(data$Class[-fold])
)
# Predict the fold
yh <- predict(fit, newdata = data[fold,], type = "probabilities")
# Compare the predictions to the labels
posneg <- split(yh[,1], data$Class[fold]) # Return the AUC under the ROC roc.curve(posneg[[1]], posneg[[2]])$auc
}

# Function for doing a k-fold cross-validation for each C in CC
cv <- function(
data,
k,
CC,
seed = NULL
) {
# Set the seed, if given
if (!is.null(seed)) {
set.seed(seed)
}
# For each value of the hyperparameter C ...
auc <- lapply(CC, function(C) {
folds <- createFolds(data$Class, k = k) # For each fold ... sapply(folds, function(fold) { # Train an SVM, and validate on the fold train_and_validate( data, fold, C ) }) }) auc } # Do the cross-validation for each C in CC auc <- cv( data = data, k = 5, CC = 2^seq(log2(.01), log2(10), length.out = 21), seed = 111 )  In the figure above, we can see that performance remains relatively constant until C $\approx$ 0.5, and then seems to decline, so we had better take C not too large in order to achieve good generalization. Technical note: In the code above, we specify a different set of folds for each C. We could also have chosen to use the same fold set for all C. However, in that case, results might have been strongly dependent on how the specific fold set was actually chosen. While this can also be the case when choosing different fold sets, at least with different fold sets we get an indication of the variance in AUC caused by choosing different fold sets, just by looking at the plot. Nevertheless, choosing the same fold set for different models can make much sense in some cases, for example when directly comparing two models. In this case the difference between the models could be assessed by pairwise comparing the different folds, for example using a paired Wilcoxon signed-rank test. #### Can we simultaneously do model selection and performance estimation? Looking at the last plot above, you might be wondering the following: Can we conclude that the expected generalization AUC is about 0.991? (the maximum value of the red line) More specifically, is taking the maximum AUC a valid approach for estimating the generalization performance of the best C? That is an interesting question. How could we check that? Well, let’s check if randomized data results in a generalization AUC of about 0.5! More concretely: if we would randomize the class labels in the breast cancer dataset, we would expect a classifier not to do any better than random. Therefore, from the following procedure we would expect a generalization AUC of about 0.5. 1. Randomize the class labels of the breast cancer dataset. 2. Do a cross-validation of all values of C. 3. Choose the best C by choosing the maximum value of the AUC. Let’s repeat the above 25 times, and see what happens: set.seed(111) auc <- replicate(25, { # Randomize the class labels. This should result in # random performance, i.e. AUC ~= 0.5 data$Class <- sample(data$Class) # Cross-validate for each C in CC, and take the # average of the folds as the performance estimate auc <- sapply(cv( data = data, k = 5, CC = 2^seq(log2(.01), log2(10), length.out = 11) ), mean) # Take the max AUC across the different Cs max(auc) })  In the above plot, you can see that with ~0.54 the average observed AUC is quite a bit higher than the ~0.5 you would expect from a random classifier! More specifically, you can see that 24 out of 25 AUCs are higher than 0.5, which can be verified as statistically highly significant, for example using a one-sample Wilcoxon signed rank test, or a binomial test. Coming back to the question we posed at the beginning of this section: No, taking the maximum AUC is not a valid approach for estimating the generalization performance of the best C. Why is this? Well, in the randomized example, essentially what we’re doing (25 times) is randomly sampling 11 AUCs (probably with some covariance structure, but still random). We would expect these 11 AUCs to average out at around 0.5. Among these 11, we then take the maximum, which we would expect to be higher than 0.5. Therefore, after repeating the above 25 times, we would definitely expect the average of the 25 maximum AUCs to also be higher than 0.5! Summarizing, simultaneously doing model selection and performance estimation leads to a positively biased performance estimate. #### Nested cross-validation So how should we go about doing model selection and performance estimation? In principle, what we could do, is set aside some test data to test the performance once we’ve optimized C: 1. Set aside some test data, not to be used in the cross-validation. 2. Optimize C using cross-validation on all but the test data. 3. Select the best C (the one with the highest AUC). 4. Test the performance of the highest C on the test data. This is a perfectly valid approach, except that it has the disadvantage of not having used all data for training, at some point. This argument should sound familiar: While it is perfectly valid to split our data into training and test data, we will get a more robust estimate of our performance by doing a cross-validation, as such iteratively using all data for training. Analogously, it it perfectly valid to split our data into (1) training data for training the models, (2) validation data for selecting the best model and (3) test data for assessing the performance of the best model. However, we will get a more robust performance estimate by iteratively using all data for training. This can be done by something called nested cross-validation, and can be done as follows: • Split data into k1 folds. • For i in 1…k1: 1. For all Cs: 1. Do a k2-fold cross-validation on all data but the k1-th fold. 2. Return the performance averaged across the k2 folds. 2. Select the C with the best average performance. 3. Assess the performance of this model on the k1-th fold. We can see why the above is called nested cross-validation: an inner cross-validation loop for model selection is nested within an outer cross-validation loop for performance estimation. As such, nested cross-validation tries to estimate the expected performance of a model where the hyperparameter C is optimized using cross-validation. Here’s a function for doing nested cross-validation: ncv <- function( data, k, CC, seed = NULL ) { if (!is.null(seed)) { set.seed(seed) } folds <- createFolds(data$Class, k = k)
# For each fold ...
auc <- sapply(folds, function(fold) {
# Do a cross-validation for each C
auc <- cv(
data[-fold,],
k,
CC,
seed = seed
)
# Select the C with the highest AUC
C <- CC[which.max(sapply(auc, mean))]
C  1, sample(C, 1), C)
# Test this C on the test data
train_and_validate(
data,
fold = fold,
C = C
)
})
auc
}


Let's first check whether using this function we indeed get an AUC of ~0.5 on the randomized data:

set.seed(111)
auc <- replicate(25, {
cat(".")
# Randomize the class labels. This should result in
# random performance, i.e. AUC ~= 0.5
data$Class <- sample(data$Class)

# This returns k scores
auc <- ncv(
data = data,
k = 5,
CC = 2^seq(log2(.01), log2(10), length.out = 11)
)
# Take the average as the performance estimate
mean(auc)
})


Indeed, we get random performance! Now let’s try it out on the breast cancer data:

# test on BreastCancer data
auc <- ncv(
data = data,
k = 5,
CC = 2^seq(log2(.01), log2(10), length.out = 21),
seed = 111
)


We can see that nested cross-validation gives an expected performance ~0.987, which is indeed lower than the positively biased estimate of ~0.991 we found initially. This may seem like a small difference but compared to 0.991, 0.987 is ~40% further away from the perfect classifier with AUC = 1, making it quite a substantial difference.

# Effect size and statistical significance

Are statistically significant results always relevant? Let’s have a look at a simple hypothetical example. Suppose we have two groups of 2500 men. All men in group 1 have a beard, and all men in group 2 do not have a beard. Moreover, we know the height of all men, and it turns out that the bearded men are statistically significantly taller than the beardless men (t-test, p < 0.05). For example:

As mentioned above, in our example, the difference in height is significant. However, arguably the more interesting question is: What can we do with this result? Is it practically relevant? For example, we could ask ourselves: Given the height of a man we have not seen yet, can we predict whether he has a beard?

Well, we could, but we would do very poorly: only slightly better than random. For example, an optimal classifier would statistically be expected to put the decision boundary at the average of 1.84 and 1.835, thus showing a misclassification rate of about 0.49. For example, among 100 men, we would be expected to correctly predict only one more than by mere random guessing. Why is this? Because the effect size is so small: The difference in average height between the two groups is just so small that it can hardly be used for prediction. While it cannot be denied that there is a difference between the two groups, it is of little practical relevance, and we had better look for something that better predicts beardedness.

So, in reporting results, we should not only look at statistical significance, but also at effect size. Nonetheless, in practice, cases where effect size is under-reported are no exception. An interesting example is this article, on “how intelligence, population density, and friendship affect modern happiness”. It received quite some attention in the media. One of the main results in the paper was that there is an “interaction effect between frequency of socialization with friends and intelligence on life satisfaction”, such that “more intelligent individuals were actually less satisfied with life if they socialized with their friends more frequently”. This was summarized in the following graph:

Indeed, people with higher IQs seem unhappier when they have more social interactions, and Li and Kanazawa showed that these results were significant (p = 0.016). So far so good. However, look at the y-axis. The article states that life satisfaction was reported on a scale from 1 to 5, but the figure only spans a tiny fraction of the entire range, from 4.10 to 4.16. Moreover, only mean life satisfaction is reported, and no indication whatsoever is given of the spread in life satisfaction scores: Most likely, the large majority of the individual scores are either larger than 4.16 or smaller than 4.10, and therefore lie outside the range of the y-axis. To get a proper idea of how small the differences actually are, look at the same data, mean life satisfaction, but now with a y-axis ranging from 1 to 5:

To get a feeling for the effect size of this difference, we might ask a question similar to the one in the toy example we started with: Would you be able to predict whether someone has a high IQ just by knowing whether he/she socializes frequently and how happy he/she is with his/her life? Most likely you would do very poorly, close to random in fact, as the Cohen’s d statistics of 0.05 and -0.03 reported in the article suggest. With a large sample size of 15197, as reported in the article, even very small effects can be identified as statistically significant.

Concluding: Is there an effect? Yes, there is. Is it relevant? Very questionable, considering the small effect size.

# Transfer learning: domain adaptation by instance-reweighting

In supervised learning, we typically train a model on labeled data (we know “the truth”) and eventually apply the model to unlabeled data (we do not know “the truth”). For example, a spam filtering model might be trained on a public email database, with emails clearly labeled as “spam” or “non-spam”. However, the model will eventually be applied to a personal inbox, where emails are not labeled. An interesting example from the life sciences is training a classifier for predicting protein interactions in some species for which biologically validated interactions are known, and applying this classifier to other species, for which no such data validated interactions exist.

But what if, in addition to missing labels, the data we apply our model to (the “target data”) is just very different from our training data (the “source data”)? For example, in a personal inbox both spam and non-spam emails may have very different characteristics compared to the emails in the public email database. Also, in protein interaction prediction, it could important to consider that species can have very different proteomes, and therefore also different protein interactions.

In cases such as the two outlined above, what we would like to do, is make sure that our model performs well on the target data, while still training on the source data. How can we do that?

#### The covariate shift assumption

In cases where the target data is very different from the source data, we need to think about domain adaptation. Domain adaptation can be seen as a specific case of transfer learning, and can be applied in situations where there is only a covariate shift between source and target data:

$P_S(X) \neq P_T(X) \textrm{, but } P_S(Y|X=x) = P_T(Y|X=x)$

Here,

• $P_S(X)$ represents the marginal covariate distribution of source instances.
• $P_T(X)$ represents the marginal covariate distribution of target instances.
• $P_S(Y|X=x)$ represents the conditional class distribution given source instance $x$.
• $P_S(Y|X=x)$ represents the conditional class distribution given target instance $x$.

In words, the first part ($P_S(X) \neq P_T(X)$) simply means that in general you find different emails in the public email database than in your own inbox: the target data is distributionally different from the source data. The second part ($P_S(Y|X=x) = P_T(Y|X=x)$) means that the class probability of an instance $x$ is independent of whether $x$ originated from the source or target distribution. In the example of spam filtering: if you have a specific email, then the probability of this email being spam stays the same, regardless of whether it originated from the public email database or from your personal inbox.

#### Model misspecification

Now you might think: If we train a classifier, we do not care about whether the source and target covariate distributions are different? We are only interested in the conditional class distribution $P_T(Y|X=x)$, and because the assumption is that $P_S(Y|X=x) = P_T(Y|X=x)$, we can simply train a classifier on the source data to obtain optimal performance on the target data? Well, ideally, yes.  However, it was shown that under model misspecification, covariate shift can in fact be a problem, and the thing is that typically, models are indeed misspecified: We do not know what function generated our data, but most likely it was not precisely of the form that we used for fitting the data. For example, fitting a line (e.g. using logistic regression) to separate the classes in the following case would be a clear case of model misspecification:

#### Model misspecification in a transfer learning setting

Back to transfer learning. Remember that in our transfer learning setting, we are training on labeled source data, and will apply the resulting classifier to unlabeled target data. Moreover, the unlabeled target data is distributionally different from the source data. Let’s extend the above example, and separate the data into source and target data:

You can see that the target data is differently distributed compared to the source data: it tends to have higher x1 values, implying that $P_S(X) \neq P_T(X)$. Furthermore, target class labels are unknown. Therefore, in training a classifier separating class1 from class2, the only thing that we can do is train on the labeled source data. Training a logistic regression classifier on the source data gives the decision boundary in the left two plots:

The decision boundary indeed seems optimal for the source data (left plot). However, it is far from optimal for the target data (middle plot). In fact, the optimal decision boundary of the target data is much steeper (right plot). In this transfer learning setting, the model misspecification implies that it is not possible to find a logistic regression parameterization $\theta$, such that $P_S(Y|X=x, \theta) = P_T(Y|X=x, \theta)$ for all $x$. In other words, the optimal model for the source data is different from the optimal model for the target data. This brings us to the following question: Is there a way to train a classifier on the source data, while trying to optimize for performance on the target data?

#### Re-weighted empirical risk minimization

It turns out, yes: We can train on the source data while optimizing for performance on the target data. Let’s first go through some math to show how. (or skip to an implementation using R if you are not interested) Recall that true risk minimization finds a parameterization $\theta = \theta^\ast$, such that the expected value of the loss function $l(x,y,\theta)$ under the true joint distribution $P(x,y)$ over $X$ and $Y$ is minimized:

Empirical risk minimization approximates true risk minimization by using the empirical joint distribution over $X$ and $Y$, because the true joint distribution $P(x,y)$ is unknown:

Note that in the above, $(x_i, y_i) \sim P(x,y)$.

In our domain adaptation problem, we have two joint distributions, namely the source distribution $P_S(x,y)$ and the target distribution $P_T(x,y)$. In training on the empirical source distribution, we want optimize for performance on the target distribution. To do this, we use our previous assumption ($P_S(Y|X=x)=P_T(Y|X=x)$ for all $x$), and apply the following trick for transferring knowledge from our source domain to our target domain:

Note that in the above, $(x_i, y_i) \sim P_S(x,y)$. So we started with the normal formulation of true risk minimization under the target distribution, and showed that we can approximate this by re-weighting each source instance $(x_i, y_i)$ in an empirical risk minimization under the source distribution! More specifically, each instance $(x_i, y_i)$ needs to be re-weighted by the ratio of the marginal covariate probabilities $\frac{P_T(x_i)}{P_S(x_i)}$. Interestingly, the above suggests that doing re-weighted empirical risk minimization is essentially the same as performing importance sampling for computing the expected value of the loss function under the target joint distribution, with the additional assumption that the conditional class probabilities between the source and target data are the same.

#### How to estimate the marginal probability ratio?

The problem with the above result, is that $P_T(x)$ and $P_S(x)$ are difficult to determine. However, we can avoid computing these probabilities directly, by interpreting this marginal probability ratio as another probability ratio: the ratio of probabilities that $x_i$ comes from the target data and from the source data, weighted by the ratio of the source data size $N_S$ and target data size $N_T$:

$\frac{P_T(x_i)}{P_S(x_i)} \approx \frac{N_S}{N_T} \frac{P(x_i \textrm{comes from the target data})}{P(x_i \textrm{comes from the source data})}$

Why is this? Well, here’s an argument for the discrete case. Suppose we independently draw two random samples, one of size $N_S$ from the source distribution, and one of size $N_T$ from the target distribution. We merge these two random samples, and from this sample of size $N_S + N_T$ we draw a single instance $x_i$. What is the probability that $x_i$ originated from the target distribution? If $n^S_i$ is the number of occurrences of $x_i$ in the random source sample of size $N_S$, and $n^T_i$ is the number of occurrences of $x_i$ in the random target sample of size $N_T$, then the following represents the probability that $x_i$ originated from the target distribution:

$P(x_i \textrm{comes from the target data}) = \frac{n^T_i}{n^T_i+n^S_i}$

Similar for the source data:

$P(x_j \textrm{comes from the source data}) = \frac{n^S_i}{n^S_i+n^T_i}$

Now what’s the expected value for their ratio?

So all we need to do, is estimate for each source instance the probability that it originated from the target class. How do we do that? One straightforward way of estimating these probabilities is to train a naturally probabilistic classifier, such as a logistic regression classifier.

#### A simple way of implementing re-weighting

We now have everything in place to train a classifier on the source data, while optimizing for performance on the target data:

1. Compute the source instance weights:
1. Train a logistic regression classifier separating source data from target data.
2. Apply the classifier to each source instance $x^S_i$, thus computing $p_i = P(x^S_i\textrm{ comes from the target data}).$
3. For each source instance $x^S_i$ compute the instance weight $w_i$ as $w_i = \frac{p_i}{1-p_i}$.
2. Train a logistic regression classifier on the source data, separating class1 from class2, while re-weighting each source instance $x^S_i$ by $w_i$.

In R, this could look as follows. First define some functions:


# Function to generate random data.
generate_data <- function(n) {

range_x1 <- 1
range_x2 <- 1

# The features.
x1 <- runif(n, -range_x1, range_x1)
x2 <- runif(n, -range_x2, range_x2)

# Generate class labels.
y <- (x1 < 0 & x2 > 0) + 1

# Generate source and target labels.
prob <- (x1 + range_x1) / range_x1 / 2
s <- 1:n %in% sample(n, n/2, prob = prob^5) + 1

data.frame(
x1 = x1,
x2 = x2,
y = factor(c("class1", "class2")[y]),
s = factor(c("source", "target")[s])
)
}

# Function to fit a logistic regression classifier,
# possibly weighted.
fitLRG <- function(df, weights = rep(1, nrow(df))) {
# Compute the class weights.
tab <- 1 / table(df$y) # Multiply by the instance weights weights <- as.numeric(weights * tab[match(df$y, names(tab))])
# Fit a logistic regression model on the
# source class label.
fit <- coef(glmnet(
x = as.matrix(df[, c("x1", "x2")]),
y = df$y, lambda = seq(1, 0, -0.01), weights = weights, family = "binomial" )) fit[, ncol(fit)] } # Function to compute instance weights compute_instance_weights <- function(df) { # Fit a logistic regression model on the # source/target indicator. fit <- glmnet( x = as.matrix(df[, c("x1", "x2")]), y = df$s,
lambda = seq(1, 0, -0.01),
family = "binomial"
)
# For each instance, compute the probability
# that it came from the target data
p <- predict(
fit,
newx = as.matrix(df[,c("x1", "x2")]),
type = "response"
)
p <- p[, ncol(p)]
p / (1 - p)
}



Now let’s do some transfer learning:


# Load a package for fitting logistic regression models.
library(glmnet)

# Set the seed for reproducibility.
set.seed(1)

# Generate some random data.
df <- generate_data(1e3)

# Train an unweighted classifier.
fit_unweighted <- fitLRG(df[df$s == "source",]) # Train a re-weighted classifier: # 1. Compute the instance weights weights <- compute_instance_weights(df) # 2. Train a weighted classifier fit_reweighted <- fitLRG( df[df$s == "source",],
weights = weights[df$s == "source"] )  The results confirm that instance re-weighting indeed leads to a decision boundary that is much closer to the optimal decision boundary for the target data: # The pseudocount in the analysis of genome-wide sequencing profiles. A commonly used approach for normalizing a binned genome-wide sequencing profile with a control, is the following: $x_i^{norm} = log_2\frac{x_i + 1}{cy_i + 1}$ Here, $x_i^{norm}$ is the normalized signal in genomic bin $i$ , $x_i$ represents the number of signal reads in bin $i$ , $y_i$ represents the number of control reads in bin $i$. $c$ is the normalization constant used to make signal and control quantitatively comparable. Often, the ratio of sequencing depths is used. However, alternative and more sophisticated methods have been developed, for example the autocorrelation-based method presented in Chapter 5 of my PhD thesis. To account for bins with no reads, a pseudocount of 1 is often added, as such avoiding division by zero. However, adding a pseudocount can give surprising, and questionable, results. As an example, consider the following R-code for randomly generating an artificial ChIP-seq signal and control: set.seed(1) signal <- rpois(1e2, lambda = 1) control <- signal * 2 normalized <- log2((signal + 1) / (control + 1))  In the resulting data, on each position, the control has exactly twice as many reads as the signal: Because of the factor 2 difference, adding a pseudocount of 1 in the normalization (taking $c = 1$ for now) will result in a profile that is negatively correlated with both signal and control! The reason for the negative correlation is that the relative contribution of the constant 1 to the signal is much higher than to the control, because the “sequencing depth” of the control is twice as high as that of the signal. In the artificial example given above, the problem could have easily been solved by setting the constant $c$ to the ratio of sequencing depths between signal and control. This would have resulted in a normalized profile that was completely flat. However, in practice this will most often not solve the problem, one important reason being that genome-wide read count distributions typically differ between signal and control. While the above may seem like a contrived example, these effects can be observed when analyzing actual ChIP-seq data, as the example of normalizing a ChIP-seq profile of the transcription factor Nanog given in Chapter 6 of my PhD thesis demonstrates. To alleviate these problems, more sophisticated normalization methods are needed, such as the autocorrelation-based method presented in Chapter 5 of my PhD thesis. # Feature selection, cross-validation and data leakage In machine learning and statistics, data leakage is the problem of using information in your test samples for training your model. The problem with data leakage is that it inflates performance estimates. The simplest case, directly using test samples for training, is easily avoided. However, it does not take much for data leakage to become quite a bit more difficult to detect, as is illustrated by the following example combining feature selection with cross-validation. A common problem in the analysis of high-throughput biological data is that, very often, the number of features is much larger than the number of samples. For example, a typical RNA-seq experiment will result in tens of thousands of features, whereas the number of samples will be orders of magnitude smaller. Building statistical or machine learning models based on such data poses a number of problems. For example, the large number of features typically make a resulting model more difficult to interpret. Also, it may take longer to train the model. However, the most important problem may be that training models on many features is more prone to overfitting, and as such increases the generalization error. For this reason, a first step that is often taken in training models based on data with many more features than samples, is selecting only a limited number of relevant features to be used for training the model. One of the simplest ways of selecting relevant features is applying a filter. A filter selects features on an individual basis, without considering the learning algorithm that will eventually be applied to the set of selected features. An example would be checking the correlation of each feature with the response variable, and selecting only those for which the correlation exceeds a certain threshold. For our example combining feature selection with cross-validation, we will use a simple t-test filter. Here’s an R function that performs simultaneous t-tests on the rows (features) of a feature matrix X given a class variable y, and for each feature returns a p-value:  # fast row t-test for equal sample sizes and equal variance # the hypothesis test is two-sided row_ttest <- function(X, y) { library(matrixStats) X1 <- X[, y == levels(y)[1]] X2 <- X[, y == levels(y)[2]] spool <- sqrt( (rowVars(X1) + rowVars(X2)) / 2 ) tstat <- (rowMeans(X1) - rowMeans(X2)) / spool / sqrt(4 / ncol(X)) df <- ncol(X) - 2 p <- pt(tstat, df) pmin(p, 1 - p) * 2 }  The data that we will use for the example is random data with 100 samples times 100000 features. The samples are randomly divided in two classes, 50 samples for each class. Here’s a function that generates the random data:  generate_random_data <- function( # number of samples in the random data nsamples = 1e2, # number of features in the random data nfeatures = 1e5 ) { # the features X <- matrix( rnorm(nsamples * nfeatures), nrow = nfeatures, ncol = nsamples ) # the class labels y <- gl(2, nsamples / 2) list( X = X, y = y ) }  Now, the problem at hand is to train a classifier that will predict the class label of the above data. Because of the large number of features, we want to do some initial feature selection, and might consider the following approach: 1. Reduce the number of features by applying a t-test filter to the individual features. 2. Train a support vector machine (SVM) and estimate the misclassification rate by cross-validation. Here’s the code for that approach, which includes calls to the previously defined functions:  # load some libraries library(caret) library(e1071) # set the seed set.seed(123) # generate some data Xy <- generate_random_data() X <- Xy$X
y <- Xy$y # apply the t-test filter selected <- row_ttest(X, y) < 0.1 # Train an SVM to predict the class label, # and estimate the misclassification rate by # cross-validation. folds <- createFolds(y, k = 5) cv <- unlist(lapply(folds, function(fold) { # train and test a model fit <- svm(x = t(X[selected, -fold]), y = y[-fold]) pred <- predict(fit, newdata = t(X[selected, fold])) ct <- table(pred, y[fold]) # get the misclassification rate 1 - sum(diag(ct)) / sum(ct) })) barplot( cv, ylim = c(0, 1), las = 2, ylab = "misclassification rate" )  The result of running the code above looks as follows: We get perfect performance! But why? The data was random, so we should not be able to predict the class label? Let’s have a closer look at the approach we took: 1. Generate random data: 100 samples times 100000 features. 2. Reduce the number of features by applying a t-test filter to the individual features. 3. Train a support vector machine (SVM) by cross-validation, i.e. for each fold: 1. Train an SVM on the selected features on the training data (all data but the fold). 2. Calculate the misclassification rate on the test data (the fold). If you look carefully, you will realize that in step 2. we selected features based on all samples, so including the samples we would use for testing in step 3. However, the initial feature selection partly determines how the models in the cross-validation loop are trained. Hence, our test samples partly determined the training of our models! This is an example of data leakage or data snooping: using information from your test samples to train your model. As we have seen, the problem with data leakage is that it leads to inflated performance estimates. So how should you do it without data leakage? Well, move the feature selection to within the cross-validation loop to only apply it on the training data: 1. Generate random data: 100 samples times 100000 features. 2. Reduce the number of features by applying a t-test filter to the individual features. 3. Train a support vector machine (SVM) by cross-validation, i.e. for each fold: 1. Reduce the number of features by applying a t-test filter to the individual features, using only the training data (all data but the fold). 2. Train an SVM on the selected features on the training data (all data but the fold). 3. Calculate the misclassification rate on the test data (the fold). Here’s the code:  # load some libraries library(caret) library(e1071) # set the seed set.seed(123) # generate some data Xy <- generate_random_data() X <- Xy$X
y <- Xy\$y

# Train an SVM to predict the class label,
# and estimate the misclassification rate by
# cross-validation.
folds <- createFolds(y, k = 5)
cv <- unlist(lapply(folds, function(fold) {
# apply the t-test filter within the cross-validation loop!
selected <- row_ttest(X[,-fold], y[-fold]) < 0.1
# train and test a model
fit <- svm(x = t(X[selected, -fold]), y = y[-fold])
pred <- predict(fit, newdata = t(X[selected, fold]))
ct <- table(pred, y[fold])
# get the misclassification rate
1 - sum(diag(ct)) / sum(ct)
}))
barplot(
cv, ylim = c(0, 1), las = 2,
ylab = "misclassification rate"
)



Running the code above demonstrates that performance is indeed random:

The example outlined in this post illustrates that data leakage encompasses much more than simply using your test samples for training your model. While directly using test samples for training is trivially avoided, even only slightly more complex data analysis strategies can easily lead to data leakage that can be much more difficult to detect.