Fast Hamming distance in R using covariance

Over the last years, I’ve written number of posts on efficiently computing the Hamming distance in base R using matrix multiplication. e.g. for large binary matrices, multi-level matrices and feature-weighted matrices, and shown that a multi-level Hamming distance matrix can be computed by adding multiple binary Hamming distance matrices. I feel that the Hamming distance is a great use case for efficient use of R, and many other high-level programming languages as well since the ideas I presented are generally applicable. Hence, I never stop wondering whether it can be computed even faster, using only base R and avoiding extensive use of R-level looping constructs. While the function in my original post was already extremely fast, I have actually found another approach that is even (albeit slightly) faster when the number of features is relatively large. It is based on the realization that the matrix multiplication in my original approach is highly reminiscent of the way a covariance matrix is computed, and therefore I start the derivation with the definition of the covariance between two vectors. Recall that analogous to the definition of the the population covariance between two random variables X and Y

f1

the covariance between two binary vectors \mathbf{x} and \mathbf{y} of size n is defined as:

f2

where \mathbf{\overline{xy}}, \mathbf{\bar{x}} and \mathbf{\bar{x}} denote the averages of \mathbf{xy}, \mathbf{x} and \mathbf{y} respectively.

Now, proceeding from this definition

f3

The above also implies that

f4

Now recall from my previous post that the hamming distance h(\mathbf{x},\mathbf{y}) between two binary vectors \mathbf{x} and \mathbf{y} is computed as

f5

Then a bit more algebra and realizing that q_{\mathbf{x},(1 - \mathbf{y})} = q_{\mathbf{x},\mathbf{y}} gives

f6

One way of interpreting this is that, up to a term dependent only on the fractions of 1’s in the individual vectors, the Hamming distance between two vectors \mathbf{x} and \mathbf{y} is proportional to the negative covariance between \mathbf{x} and \mathbf{y}. The above formula leads us to another way computing the Hamming distance between all columns of a binary matrix X, again without using any R-level looping:

hamming_binary_varcov <- function(X) {
  n <- ncol(X)
  Xm <- rowMeans(X)
  XM <- Xm * array(1, dim = c(nrow(X), nrow(X)))
  tXM <- t(XM)
  n * (XM + tXM - 2 * XM * tXM) - 2 * tcrossprod(X - Xm)
}

Now, naturally we want to compare the performance of this new function to the one from my previous post:

hamming_binary_matmul <- function(X) {
	D <- (1 - X) %*% t(X)
	D + t(D)
}

We can see that indeed, when the number of features is relatively large (left plot), the covariance-based computation of the Hamming distance in this post seems even slightly faster than the one in my previous post, at least on my computer:

Performance
Comparing execution times of Hamming distance computation using covariance and direct matrix multiplication.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s