The pseudocount in the analysis of genome-wide sequencing profiles.

A commonly used approach for normalizing a binned genome-wide sequencing profile with a control, is the following:

x_i^{norm} = log_2\frac{x_i + 1}{cy_i + 1}

Here, x_i^{norm} is the normalized signal in genomic bin i , x_i represents the number of signal reads in bin i , y_i represents the number of control reads in bin i. c is the normalization constant used to make signal and control quantitatively comparable.
Often, the ratio of sequencing depths is used. However, alternative and more sophisticated methods have been developed, for example the autocorrelation-based method presented in Chapter 5 of my PhD thesis. To account for bins with no reads, a pseudocount of 1 is often added, as such avoiding division by zero. However, adding a pseudocount can give surprising, and questionable, results. As an example, consider the following R-code for randomly generating an artificial ChIP-seq signal and control:

set.seed(1)
signal <- rpois(1e2, lambda = 1)
control <- signal * 2
normalized <- log2((signal + 1) / (control + 1))

In the resulting data, on each position, the control has exactly twice as many reads as the signal:

pseudocount1
Artificial ChIP-seq signal and control, where control = 2 * signal.

Because of the factor 2 difference, adding a pseudocount of 1 in the normalization (taking c = 1 for now) will result in a profile that is negatively correlated with both signal and control!

pseudocount2
Signal, control, and normalized signal (taking c = 1 for now). The normalized signal is negatively correlated with both signal and control!

The reason for the negative correlation is that the relative contribution of the constant 1 to the signal is much higher than to the control, because the “sequencing depth” of the control is twice as high as that of the signal. In the artificial example given above, the problem could have easily been solved by setting the constant c to the ratio of sequencing depths between signal and control. This would have resulted in a normalized profile that was completely flat. However, in practice this will most often not solve the problem, one important reason being that genome-wide read count distributions typically differ between signal and control. While the above may seem like a contrived example, these effects can be observed when analyzing actual ChIP-seq data, as the example of normalizing a ChIP-seq profile of the transcription factor Nanog given in Chapter 6 of my PhD thesis demonstrates. To alleviate these problems, more sophisticated normalization methods are needed, such as the autocorrelation-based method presented in Chapter 5 of my PhD thesis.

Feature selection, cross-validation and data leakage

In machine learning and statistics, data leakage is the problem of using information in your test samples for training your model. The problem with data leakage is that it inflates performance estimates. The simplest case, directly using test samples for training, is easily avoided. However, it does not take much for data leakage to become quite a bit more difficult to detect, as is illustrated by the following example combining feature selection with cross-validation.

A common problem in the analysis of high-throughput biological data is that, very often, the number of features is much larger than the number of samples. For example, a typical RNA-seq experiment will result in tens of thousands of features, whereas the number of samples will be orders of magnitude smaller. Building statistical or machine learning models based on such data poses a number of problems. For example, the large number of features typically make a resulting model more difficult to interpret. Also, it may take longer to train the model. However, the most important problem may be that training models on many features is more prone to overfitting, and as such increases the generalization error.

For this reason, a first step that is often taken in training models based on data with many more features than samples, is selecting only a limited number of relevant features to be used for training the model. One of the simplest ways of selecting relevant features is applying a filter. A filter selects features on an individual basis, without considering the learning algorithm that will eventually be applied to the set of selected features. An example would be checking the correlation of each feature with the response variable, and selecting only those for which the correlation exceeds a certain threshold.

For our example combining feature selection with cross-validation, we will use a simple t-test filter. Here’s an R function that performs simultaneous t-tests on the rows (features) of a feature matrix X given a class variable y, and for each feature returns a p-value:


# fast row t-test for equal sample sizes and equal variance
# the hypothesis test is two-sided
row_ttest <- function(X, y) {
  library(matrixStats)
  X1 <- X[, y == levels(y)[1]]
  X2 <- X[, y == levels(y)[2]]
  spool <- sqrt( (rowVars(X1) + rowVars(X2)) / 2 )
  tstat <- (rowMeans(X1) - rowMeans(X2)) / spool / sqrt(4 / ncol(X))
  df <- ncol(X) - 2
  p <- pt(tstat, df)
  pmin(p, 1 - p) * 2
}

The data that we will use for the example is random data with 100 samples times 100000 features. The samples are randomly divided in two classes, 50 samples for each class. Here’s a function that generates the random data:


generate_random_data <- function(
  # number of samples in the random data
  nsamples = 1e2,
  # number of features in the random data
  nfeatures = 1e5
) {
  # the features
  X <- matrix(
    rnorm(nsamples * nfeatures),
    nrow = nfeatures,
    ncol = nsamples
  )
  # the class labels
  y <- gl(2, nsamples / 2)
  list(
    X = X,
    y = y
  )
}

Now, the problem at hand is to train a classifier that will predict the class label of the above data. Because of the large number of features, we want to do some initial feature selection, and might consider the following approach:

  1. Reduce the number of features by applying a t-test filter to the individual features.
  2. Train a support vector machine (SVM) and estimate the misclassification rate by cross-validation.

Here’s the code for that approach, which includes calls to the previously defined functions:


# load some libraries
library(caret)
library(e1071)

# set the seed
set.seed(123)

# generate some data
Xy <- generate_random_data()
X <- Xy$X
y <- Xy$y

# apply the t-test filter
selected <- row_ttest(X, y) < 0.1

# Train an SVM to predict the class label,
# and estimate the misclassification rate by
# cross-validation.
folds <- createFolds(y, k = 5)
cv <- unlist(lapply(folds, function(fold) {
  # train and test a model
  fit <- svm(x = t(X[selected, -fold]), y = y[-fold])
  pred <- predict(fit, newdata = t(X[selected, fold]))
  ct <- table(pred, y[fold])
  # get the misclassification rate
  1 - sum(diag(ct)) / sum(ct)
}))
barplot(
  cv, ylim = c(0, 1), las = 2,
  ylab = "misclassification rate"
)

The result of running the code above looks as follows:

Misclassification rate across the 5 folds
Misclassification rate of 0 across the five folds: we can perfectly predict random data! Can that be correct?

We get perfect performance! But why? The data was random, so we should not be able to predict the class label? Let’s have a closer look at the approach we took:

  1. Generate random data: 100 samples times 100000 features.
  2. Reduce the number of features by applying a t-test filter to the individual features.
  3. Train a support vector machine (SVM) by cross-validation, i.e. for each fold:
    1. Train an SVM on the selected features on the training data (all data but the fold).
    2. Calculate the misclassification rate on the test data (the fold).

If you look carefully, you will realize that in step 2. we selected features based on all samples, so including the samples we would use for testing in step 3. However, the initial feature selection partly determines how the models in the cross-validation loop are trained. Hence, our test samples partly determined the training of our models! This is an example of data leakage or data snooping: using information from your test samples to train your model. As we have seen, the problem with data leakage is that it leads to inflated performance estimates. So how should you do it without data leakage? Well, move the feature selection to within the cross-validation loop to only apply it on the training data:

  1. Generate random data: 100 samples times 100000 features.
  2. Reduce the number of features by applying a t-test filter to the individual features.
  3. Train a support vector machine (SVM) by cross-validation, i.e. for each fold:
    1. Reduce the number of features by applying a t-test filter to the individual features, using only the training data (all data but the fold).
    2. Train an SVM on the selected features on the training data (all data but the fold).
    3. Calculate the misclassification rate on the test data (the fold).

Here’s the code:


# load some libraries
library(caret)
library(e1071)

# set the seed
set.seed(123)

# generate some data
Xy <- generate_random_data()
X <- Xy$X
y <- Xy$y

# Train an SVM to predict the class label,
# and estimate the misclassification rate by
# cross-validation.
folds <- createFolds(y, k = 5)
cv <- unlist(lapply(folds, function(fold) {
  # apply the t-test filter within the cross-validation loop!
  selected <- row_ttest(X[,-fold], y[-fold]) < 0.1
  # train and test a model
  fit <- svm(x = t(X[selected, -fold]), y = y[-fold])
  pred <- predict(fit, newdata = t(X[selected, fold]))
  ct <- table(pred, y[fold])
  # get the misclassification rate
  1 - sum(diag(ct)) / sum(ct)
}))
barplot(
  cv, ylim = c(0, 1), las = 2,
  ylab = "misclassification rate"
)

Running the code above demonstrates that performance is indeed random:

Misclassification rate across the 5 folds
Misclassification rate around 0.5 across the five folds: we get random performance, as we should!

The example outlined in this post illustrates that data leakage encompasses much more than simply using your test samples for training your model. While directly using test samples for training is trivially avoided, even only slightly more complex data analysis strategies can easily lead to data leakage that can be much more difficult to detect.