Fast Hamming distance in R using covariance

Over the last years, I’ve written number of posts on efficiently computing the Hamming distance in base R using matrix multiplication. e.g. for large binary matrices, multi-level matrices and feature-weighted matrices, and shown that a multi-level Hamming distance matrix can be computed by adding multiple binary Hamming distance matrices. I feel that the Hamming distance is a great use case for efficient use of R, and many other high-level programming languages as well since the ideas I presented are generally applicable. Hence, I never stop wondering whether it can be computed even faster, using only base R and avoiding extensive use of R-level looping constructs. While the function in my original post was already extremely fast, I have actually found another approach that is even (albeit slightly) faster when the number of features is relatively large. It is based on the realization that the matrix multiplication in my original approach is highly reminiscent of the way a covariance matrix is computed, and therefore I start the derivation with the definition of the covariance between two vectors. Recall that analogous to the definition of the the population covariance between two random variables $X$ and $Y$

the covariance between two binary vectors $\mathbf{x}$ and $\mathbf{y}$ of size $n$ is defined as:

where $\mathbf{\overline{xy}}$, $\mathbf{\bar{x}}$ and $\mathbf{\bar{x}}$ denote the averages of $\mathbf{xy}$, $\mathbf{x}$ and $\mathbf{y}$ respectively.

Now, proceeding from this definition

The above also implies that

Now recall from my previous post that the hamming distance $h(\mathbf{x},\mathbf{y})$ between two binary vectors $\mathbf{x}$ and $\mathbf{y}$ is computed as

Then a bit more algebra and realizing that $q_{\mathbf{x},(1 - \mathbf{y})} = q_{\mathbf{x},\mathbf{y}}$ gives

One way of interpreting this is that, up to a term dependent only on the fractions of 1’s in the individual vectors, the Hamming distance between two vectors $\mathbf{x}$ and $\mathbf{y}$ is proportional to the negative covariance between $\mathbf{x}$ and $\mathbf{y}$. The above formula leads us to another way computing the Hamming distance between all columns of a binary matrix X, again without using any R-level looping:

hamming_binary_varcov <- function(X) {
n <- ncol(X)
Xm <- rowMeans(X)
XM <- Xm * array(1, dim = c(nrow(X), nrow(X)))
tXM <- t(XM)
n * (XM + tXM - 2 * XM * tXM) - 2 * tcrossprod(X - Xm)
}


Now, naturally we want to compare the performance of this new function to the one from my previous post:

hamming_binary_matmul <- function(X) {
D <- (1 - X) %*% t(X)
D + t(D)
}


We can see that indeed, when the number of features is relatively large (left plot), the covariance-based computation of the Hamming distance in this post seems even slightly faster than the one in my previous post, at least on my computer:

Fast weighted Hamming distance in R

Upon request, I am posting a row-weighted version of the fast Hamming distance function in my previous posts (here and here).

The fast row-weighted binary Hamming distance:


hamming_binary <- function(X, w = rep(1, nrow(X))) {
w <- sqrt(w)
D <- t(w * (1 - X)) %*% (w * X)
D + t(D)
}


The fast row-weighted “multi-level” Hamming distance:


hamming <- function(X, w = rep(1, nrow(X))) {
uniqs <- unique(as.vector(X))
H <- hamming_binary(X == uniqs[1], w)
for ( uniq in uniqs[-1] ) {
H <- H + hamming_binary(X == uniq, w)
}
H / 2
}


Note that the last function is not dependent on the mode and number of unique elements of the matrix. In other words, the matrix can consist of any number of different characters, integers, and any other type of element you can store in an R matrix. Also, because it avoids R-level looping, it is extremely fast.

Deep learning for clustering of multivariate short time series with potentially many missing values

Time to plug some of my recent work: a method for clustering multivariate short time series with potentially many missing values, a setting commonly encountered in the analysis of longitudinal clinical data, but generally still poorly addressed in the literature. The method is based on a variational autoencoder with a Gaussian mixture prior (with a latent loss as described in Jiang et al., 2017), extended with LSTMs for modeling multivariate time series, as well as implicit imputation and loss re-weighting for directly dealing with (potentially many) missing values.

I will write some more on the method at a later date, but for now an implementation in Tensorflow is already available on github, and you can read the paper here.