The pseudocount in the analysis of genome-wide sequencing profiles.

A commonly used approach for normalizing a binned genome-wide sequencing profile with a control, is the following:

x_i^{norm} = log_2\frac{x_i + 1}{cy_i + 1}

Here, x_i^{norm} is the normalized signal in genomic bin i , x_i represents the number of signal reads in bin i , y_i represents the number of control reads in bin i. c is the normalization constant used to make signal and control quantitatively comparable.
Often, the ratio of sequencing depths is used. However, alternative and more sophisticated methods have been developed, for example the autocorrelation-based method presented in Chapter 5 of my PhD thesis. To account for bins with no reads, a pseudocount of 1 is often added, as such avoiding division by zero. However, adding a pseudocount can give surprising, and questionable, results. As an example, consider the following R-code for randomly generating an artificial ChIP-seq signal and control:

signal <- rpois(1e2, lambda = 1)
control <- signal * 2
normalized <- log2((signal + 1) / (control + 1))

In the resulting data, on each position, the control has exactly twice as many reads as the signal:

Artificial ChIP-seq signal and control, where control = 2 * signal.

Because of the factor 2 difference, adding a pseudocount of 1 in the normalization (taking c = 1 for now) will result in a profile that is negatively correlated with both signal and control!

Signal, control, and normalized signal (taking c = 1 for now). The normalized signal is negatively correlated with both signal and control!

The reason for the negative correlation is that the relative contribution of the constant 1 to the signal is much higher than to the control, because the “sequencing depth” of the control is twice as high as that of the signal. In the artificial example given above, the problem could have easily been solved by setting the constant c to the ratio of sequencing depths between signal and control. This would have resulted in a normalized profile that was completely flat. However, in practice this will most often not solve the problem, one important reason being that genome-wide read count distributions typically differ between signal and control. While the above may seem like a contrived example, these effects can be observed when analyzing actual ChIP-seq data, as the example of normalizing a ChIP-seq profile of the transcription factor Nanog given in Chapter 6 of my PhD thesis demonstrates. To alleviate these problems, more sophisticated normalization methods are needed, such as the autocorrelation-based method presented in Chapter 5 of my PhD thesis.