Fast Hamming distance in R using covariance

Over the last years, I’ve written number of posts on efficiently computing the Hamming distance in base R using matrix multiplication. e.g. for large binary matrices, multi-level matrices and feature-weighted matrices, and shown that a multi-level Hamming distance matrix can be computed by adding multiple binary Hamming distance matrices. I feel that the Hamming distance is a great use case for efficient use of R, and many other high-level programming languages as well since the ideas I presented are generally applicable. Hence, I never stop wondering whether it can be computed even faster, using only base R and avoiding extensive use of R-level looping constructs. While the function in my original post was already extremely fast, I have actually found another approach that is even (albeit slightly) faster when the number of features is relatively large. It is based on the realization that the matrix multiplication in my original approach is highly reminiscent of the way a covariance matrix is computed, and therefore I start the derivation with the definition of the covariance between two vectors. Recall that analogous to the definition of the the population covariance between two random variables X and Y

f1

the covariance between two binary vectors \mathbf{x} and \mathbf{y} of size n is defined as:

f2

where \mathbf{\overline{xy}}, \mathbf{\bar{x}} and \mathbf{\bar{x}} denote the averages of \mathbf{xy}, \mathbf{x} and \mathbf{y} respectively.

Now, proceeding from this definition

f3

The above also implies that

f4

Now recall from my previous post that the hamming distance h(\mathbf{x},\mathbf{y}) between two binary vectors \mathbf{x} and \mathbf{y} is computed as

f5

Then a bit more algebra and realizing that q_{\mathbf{x},(1 - \mathbf{y})} = q_{\mathbf{x},\mathbf{y}} gives

f6

One way of interpreting this is that, up to a term dependent only on the fractions of 1’s in the individual vectors, the Hamming distance between two vectors \mathbf{x} and \mathbf{y} is proportional to the negative covariance between \mathbf{x} and \mathbf{y}. The above formula leads us to another way computing the Hamming distance between all columns of a binary matrix X, again without using any R-level looping:

hamming_binary_varcov <- function(X) {
  n <- ncol(X)
  Xm <- rowMeans(X)
  XM <- Xm * array(1, dim = c(nrow(X), nrow(X)))
  tXM <- t(XM)
  n * (XM + tXM - 2 * XM * tXM) - 2 * tcrossprod(X - Xm)
}

Now, naturally we want to compare the performance of this new function to the one from my previous post:

hamming_binary_matmul <- function(X) {
	D <- (1 - X) %*% t(X)
	D + t(D)
}

We can see that indeed, when the number of features is relatively large (left plot), the covariance-based computation of the Hamming distance in this post seems even slightly faster than the one in my previous post, at least on my computer:

Performance
Comparing execution times of Hamming distance computation using covariance and direct matrix multiplication.

 

Fast weighted Hamming distance in R

Upon request, I am posting a row-weighted version of the fast Hamming distance function in my previous posts (here and here).

The fast row-weighted binary Hamming distance:


hamming_binary <- function(X, w = rep(1, nrow(X))) {
	w <- sqrt(w)
	D <- t(w * (1 - X)) %*% (w * X)
	D + t(D)
}

The fast row-weighted “multi-level” Hamming distance:


hamming <- function(X, w = rep(1, nrow(X))) {
	uniqs <- unique(as.vector(X))
	H <- hamming_binary(X == uniqs[1], w)
	for ( uniq in uniqs[-1] ) {
		H <- H + hamming_binary(X == uniq, w)
	}
	H / 2
}

Note that the last function is not dependent on the mode and number of unique elements of the matrix. In other words, the matrix can consist of any number of different characters, integers, and any other type of element you can store in an R matrix. Also, because it avoids R-level looping, it is extremely fast.

Deep learning for clustering of multivariate short time series with potentially many missing values

Time to plug some of my recent work: a method for clustering multivariate short time series with potentially many missing values, a setting commonly encountered in the analysis of longitudinal clinical data, but generally still poorly addressed in the literature. The method is based on a variational autoencoder with a Gaussian mixture prior (with a latent loss as described in Jiang et al., 2017), extended with LSTMs for modeling multivariate time series, as well as implicit imputation and loss re-weighting for directly dealing with (potentially many) missing values.

I will write some more on the method at a later date, but for now an implementation in Tensorflow is already available on github, and you can read the paper here.

R: are *apply loops faster than for loops?

In a high-level programming language such as R or Python, it is typically a good idea to avoid loops whenever possible, since looping in high-level programming languages can be very time-consuming.

However, sometimes loops cannot be avoided. In R, in addition to traditional looping constructs such as the for and while loop, the *apply() functions provide looping functionality using a single function call. It is often suggested to avoid traditional looping constructs in favor of these *apply() functions, because a single call to an *apply() function would be faster than running through a for or while loop. To investigate this claim, I will run a small benchmark: I will time a (trivial) loop that returns a vector of integers 1 to n, for different values of n, using four different variants of running a loop in R:

  • Using a for loop: the return vector is dynamically extended while iterating through the loop. This variant I call FOR_LOOP_DYN_ALLOC.
  • Using a for loop: the memory for the return vector is pre-allocated, before iterating through the loop. This variant I call FOR_LOOP_PRE_ALLOC.
  • Using R’s sapply() function.
  • Using R’s lapply() function, followed by a call to unlist() to return a vector instead of a list.

Below, you can find an R function for timing these loops, for a vector of different loop sizes N, and a number of replicates N to average the execution time over.


r_loops <- function(
    N, # The different loop sizes to try
    nrep # The number of replicates to average the execution time over
) {
    T <- 1e3 * data.frame(
        FOR_LOOP_DYN_ALLOC = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- integer(0)
                    for (i in 1:n) {
                        x <- c(x, i)
                    }
                })[3]
            }))
        })),
        FOR_LOOP_PRE_ALLOC = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- integer(n)
                    for (i in 1:n) {
                        x[i] <- i
                    }
                })[3]
            }))
        })),
        SAPPLY = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- sapply(1:n, function(i) {
                        i
                    })
                })[3]
            }))
        })),
        UNLIST_LAPPLY = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- unlist(lapply(1:n, function(i) {
                        i
                    }))
                })[3]
            }))
        }))
    )    

    matplot(
        N,
        T,
        type = "l",
        lty = 1,
        lwd = 2,
        col = rainbow(ncol(T)),
        log = "xy",
        xlab = "size of loop",
        ylab = "execution time (milliseconds)",
        main = sprintf("Average of %i replicates", nrep)
    )
    legend(
        "topleft",
        lty = 1,
        lwd = 2,
        col = rainbow(ncol(T)),
        legend = colnames(T)
    )

}

Let’s run this for relatively short loops and 10 replicates:


png("short_r_loops.png")
r_loops(2^(7:15), 10)
dev.off()

This is what the result looks like:

short_r_loops
Short loops: execution time in milliseconds of different types of looping in R, as a function of the size of the loop.

It seems that for these relatively short loops, using for is actually quite comparable to using sapply(). The lapply() function seems a bit faster than sapply(), probably because sapply() is essentially a wrapper for lapply() and thus includes some extra overhead. The for loop with dynamically extended return vector (FOR_LOOP_DYN_ALLOC) is by far the worst performing.

Let’s try some longer loops (omitting FOR_LOOP_DYN_ALLOC, because of the poor performance; using only three replicates):

long_r_loops
Long loops: execution time in milliseconds of different types of looping in R, as a function of the size of the loop.

Now this is interesting: it seems that for long loops, intelligent use of for (by pre-allocating memory) is actually faster than both of the *apply() functions! My guess is that, when calling an *apply() function, the size of the return object is dynamically increased (i.e. during the iteration). For long loops and larger amounts of memory, this will take more time compared to pre-allocation using for.

In any case, the above two plots show that *apply() is not necessarily faster than for. Therefore, an important reason for using *apply() functions may instead be that they fit the functional programming paradigm better, where everything is done using function calls and side effects are reduced: The *apply() functions are essentially ‘functionals‘ that take a certain function f as an input argument. The scope of the variables defined within f is limited to f, and variables defined outside f cannot be modified inside f (except using the special scoping assignment operator <<-). This is what makes calls to *apply() functions very different from running for loops.