# The pseudocount in the analysis of genome-wide sequencing profiles.

A commonly used approach for normalizing a binned genome-wide sequencing profile with a control, is the following:

$x_i^{norm} = log_2\frac{x_i + 1}{cy_i + 1}$

Here, $x_i^{norm}$ is the normalized signal in genomic bin $i$ , $x_i$ represents the number of signal reads in bin $i$ , $y_i$ represents the number of control reads in bin $i$. $c$ is the normalization constant used to make signal and control quantitatively comparable.
Often, the ratio of sequencing depths is used. However, alternative and more sophisticated methods have been developed, for example the autocorrelation-based method presented in Chapter 5 of my PhD thesis. To account for bins with no reads, a pseudocount of 1 is often added, as such avoiding division by zero. However, adding a pseudocount can give surprising, and questionable, results. As an example, consider the following R-code for randomly generating an artificial ChIP-seq signal and control:

set.seed(1)
signal <- rpois(1e2, lambda = 1)
control <- signal * 2
normalized <- log2((signal + 1) / (control + 1))


In the resulting data, on each position, the control has exactly twice as many reads as the signal:

Because of the factor 2 difference, adding a pseudocount of 1 in the normalization (taking $c = 1$ for now) will result in a profile that is negatively correlated with both signal and control!

The reason for the negative correlation is that the relative contribution of the constant 1 to the signal is much higher than to the control, because the “sequencing depth” of the control is twice as high as that of the signal. In the artificial example given above, the problem could have easily been solved by setting the constant $c$ to the ratio of sequencing depths between signal and control. This would have resulted in a normalized profile that was completely flat. However, in practice this will most often not solve the problem, one important reason being that genome-wide read count distributions typically differ between signal and control. While the above may seem like a contrived example, these effects can be observed when analyzing actual ChIP-seq data, as the example of normalizing a ChIP-seq profile of the transcription factor Nanog given in Chapter 6 of my PhD thesis demonstrates. To alleviate these problems, more sophisticated normalization methods are needed, such as the autocorrelation-based method presented in Chapter 5 of my PhD thesis.

# SVM with recursive feature elimination in R

The support vector machine (SVM) is a very powerful classifier due to its inherent regularization properties as well as its ability to handle decision boundaries of arbitrary complexity by its formulation as a kernel method. However, suppose that we’re not so much interested in the decision boundary itself, but more in the relative importance of the features in specifying this decision boundary. For example, given two sets (i.e. ‘classes’) of samples, one from tumor tissue and one from normal tissue, and associated RNA-seq gene expression data, one may be interested in ranking genes (i.e. ‘features’) according to their importance in specifying the difference (i.e. ‘decision boundary’) between normal and tumor tissue. This represents an important problem in the field of machine learning, and is called feature ranking. Guyon et al. proposed combining SVMs with recursive feature elimination (SVM-RFE) for feature ranking. Recursive feature elimination in its simplest formulation starts with the complete set of features, and then repeats the following three steps until no more features are left:

1. Train a model (in the present case, an SVM).
2. Compute a ranking of features.
3. Remove the feature with the worst rank.

For step 2, ranking the features, Guyon et al. proposed to take the coefficients to the SVM primal problem (i.e. the coefficients to the SVM hyperplane) as feature weights. Since SVMs are computed by solving the dual problem using quadratic programming, these hyperplane coefficients first need to be computed from the coefficients to the dual problem. In R, for a matrix X with samples in the rows and features in the columns, and a factor y containing the classes, this looks as follows:


# single-step SVM-RFE
svm_rfe__single_step <- function(X, y, ...) {

library(kernlab)

# keep track of the iteration during which
# a feature was eliminated
ii <- rep(NA, ncol(X))
i <- 0
while ( any(is.na(ii)) ) {
# indices of remaining features
not_elim_yet <- which(is.na(ii))
# train the classifier on the remaining features
fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
# compute the primal problem coefficients from the dual
# problem coefficients
sv_i <- alphaindex(fit)[[1]]
w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ]
# eliminate the feature with the smallest squared weight
to_elim <- not_elim_yet[ head(order( w * w ), 1) ]
ii[to_elim] <- i
i <- i + 1
}
# convert iterations into ranks
i - ii

}



The above algorithm eliminates one feature (i.e. gene) per iteration. Typical RNA-seq expression datasets have hundreds to tens of thousands of features, so a call to the function above also to needs to train an SVM hundreds to tens of thousands of times. Needless to say, this can take quite a while. Note that we’re most likely mainly interested in the top-ranking genes, and less so in lower ranking genes. Therefore, we could also eliminate relatively many features during early iterations, and then decrease the number of features to be eliminated as the iteration progresses. Taking the approach of exponential decrease (i.e. eliminate a fixed fraction of remaining features), this looks as follows:


# multi-step SVM-RFE
# default elim_frac corresponds to single-step
svm_rfe <- function(X, y, elim_frac = 1 / ncol(X), ...) {

library(kernlab)

# keep track of the iteration during which
# a feature was eliminated
ii <- rep(NA, ncol(X))
i <- 0
while ( any(is.na(ii)) ) {
# indices of remaining features
not_elim_yet <- which(is.na(ii))
# number of features to eliminate
n_to_elim <- ceiling ( elim_frac * length(not_elim_yet) )
# train the classifier on the remaining features
fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
# compute the primal problem coefficients from the dual
# problem coefficients
sv_i <- alphaindex(fit)[[1]]
w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ]
# eliminate the features with the smallest squared weights
to_elim <- not_elim_yet[ head(order( w * w ), n_to_elim) ]
ii[to_elim] <- i
i <- i + 1
}
# convert iterations into ranks
i - ii

}



Let’s see how this works in practice. For this, we’ll use the ‘cervical’ data from the Bioconductor package MLseq. This dataset describes the expression levels as measured by RNA-seq of 714 miRNAs in 29 tumor and 29 normal human cervical samples. We can load and process the data, and run the above functions, using the following code:


library(MLSeq)

data(cervical)

# preprocess
y <- factor(grepl("^T[0-9].*", colnames(cervical)))
levels(y) <- c("normal", "tumor")
X <- scale(t(data.matrix(cervical)))

# SVM-RFE single-step
system.time(I <- svm_rfe(X, y))
# SVM-RFE multi-step
system.time(J <- svm_rfe(X, y, 0.1))



In the example above, using default values removes one feature at a time and thus trains 714 SVMs, one for each feature. On the other hand, setting elim_frac to 0.1 (i.e. at each iteration, 10% of features are eliminated), needs to train only 47 SVMs. Although setting elim_frac at 0.1 trains roughly 15 times fewer SVMs, it runs approximately 30 times as fast (~1 second vs. ~33 seconds on my laptop). This is because the number of SVMs trained on many features is relatively more constrained by the multi-step approach than the number of SVMs trained on fewer features. So, we lose some resolution in the relatively uninteresting lower ranking features, but gain a lot in terms of execution time. This gain in execution time is visualized in the following plot.