Hamming distance in R using Rcpp

After all my posts on efficiently computing the Hamming distance in R, two questions remain for me. First, can we compute the Hamming distance even faster when implementing it in a low-level programming language such as C/C++, and then use the great Rcpp package to directly call this low-level C/C++ function from R? Second, how much would parallelization help? I will answer the second question in a separate post, and address the first question here.

I implemented two C/C++ functions for computing the Hamming distance between all pairs of columns of a binary matrix X:

  1. A straightforward approach that I called “hamming__C_simple”: this function simply iterates through all combinations of columns of a matrix X to compare all elements between each pair of columns.
  2. A more involved approach that I called “hamming__C_bitwise”: This function exploits the ability one has in C to directly manipulate individual bits in memory, which should typically be very fast. The function first interprets the R binary vectors as sequences of unsigned integers, to then perform bitwise operations on these integers.

Note that focusing on binary matrices does not limit the applicability of the C/C++ functions, because multi-value Hamming distances can be easily computed by adding multiple binary Hamming distances.

The code for both implementations is shown below.


#include "Rcpp.h"
#include "math.h"
#include "stdlib.h"
#include "stdio.h"
#include "math.h"
#include "algorithm"
#include "climits"
using namespace Rcpp;

unsigned long long binvec2int(IntegerMatrix x) {
unsigned long long u = 0;
int n = x.nrow();
for (int i = 0; i < n; i++) {
u += x(i, 0) * (unsigned long long) pow(2, n - 1 - i);
}
return u;
}

// [[Rcpp::export]]
NumericMatrix hamming__C_bitwise(IntegerMatrix X) {
int i, j, k, from, to, h;
int stepsize = sizeof(unsigned long long) * CHAR_BIT;
int n = ceil(((double) X.nrow()) / stepsize);
IntegerMatrix x;
NumericMatrix H(X.ncol(), X.ncol());
// convert X to unsigned long long
unsigned long long *xx =
(unsigned long long *) malloc(n * X.ncol() * sizeof(unsigned long long));
i = 0;
while ((from = i * stepsize) < X.nrow()) {
for (j = 0; j < X.ncol(); j++) {
to = std::min(X.nrow(), (i + 1) * stepsize);
x = X(Range(from, to - 1), Range(j, j));
xx[i + j * n] = binvec2int(x);
}
i++;
}

for (i = 0; i < X.ncol(); i++) {
for (j = i + 1; j < X.ncol(); j++) {
for (k = 0; k < n; k++) {
h = __builtin_popcountll(xx[k + i * n] ^ xx[k + j * n]);
H(i, j) += h;
H(j, i) += h;
}
}
}
free(xx);
return H;
}

// [[Rcpp::export]]
NumericMatrix hamming__C_simple(IntegerMatrix X) {
int i, j, k, h;
NumericMatrix H(X.ncol(), X.ncol());
for (i = 0; i < X.ncol(); i++) {
for (j = i + 1; j < X.ncol(); j++) {
for (k = 0; k < n; k++) {
h = (X(k, i) != X(k, j));
H(i, j) += h;
H(j, i) += h;
}
}
}
return H;
}

I compared the performance of the two C/C++ functions to a very fast pure R function based on matrix multiplication that I wrote about before:

performance_C
C++ vs. pure R in computing the Hamming distance.

Interestingly, we can see that the straightforward C++ implementation (C_simple) is not faster than the pure R function (pure_R). In other words, efficient use of pure R can easily be faster than fairly regular use of C++ within R. We can see that if we actually want to be faster than the pure R function, one way of doing that is exploiting C’s bitwise operations (C_bitwise).

Fast Hamming distance in R using covariance

Over the last years, I’ve written number of posts on efficiently computing the Hamming distance in base R using matrix multiplication. e.g. for large binary matrices, multi-level matrices and feature-weighted matrices, and shown that a multi-level Hamming distance matrix can be computed by adding multiple binary Hamming distance matrices. I feel that the Hamming distance is a great use case for efficient use of R, and many other high-level programming languages as well since the ideas I presented are generally applicable. Hence, I never stop wondering whether it can be computed even faster, using only base R and avoiding extensive use of R-level looping constructs. While the function in my original post was already extremely fast, I have actually found another approach that is even (albeit slightly) faster when the number of features is relatively large. It is based on the realization that the matrix multiplication in my original approach is highly reminiscent of the way a covariance matrix is computed, and therefore I start the derivation with the definition of the covariance between two vectors. Recall that analogous to the definition of the the population covariance between two random variables X and Y

f1

the covariance between two binary vectors \mathbf{x} and \mathbf{y} of size n is defined as:

f2

where \mathbf{\overline{xy}}, \mathbf{\bar{x}} and \mathbf{\bar{x}} denote the averages of \mathbf{xy}, \mathbf{x} and \mathbf{y} respectively.

Now, proceeding from this definition

f3

The above also implies that

f4

Now recall from my previous post that the hamming distance h(\mathbf{x},\mathbf{y}) between two binary vectors \mathbf{x} and \mathbf{y} is computed as

f5

Then a bit more algebra and realizing that q_{\mathbf{x},(1 - \mathbf{y})} = q_{\mathbf{x},\mathbf{y}} gives

f6

One way of interpreting this is that, up to a term dependent only on the fractions of 1’s in the individual vectors, the Hamming distance between two vectors \mathbf{x} and \mathbf{y} is proportional to the negative covariance between \mathbf{x} and \mathbf{y}. The above formula leads us to another way computing the Hamming distance between all columns of a binary matrix X, again without using any R-level looping:

hamming_binary_varcov <- function(X) {
  n <- ncol(X)
  Xm <- rowMeans(X)
  XM <- Xm * array(1, dim = c(nrow(X), nrow(X)))
  tXM <- t(XM)
  n * (XM + tXM - 2 * XM * tXM) - 2 * tcrossprod(X - Xm)
}

Now, naturally we want to compare the performance of this new function to the one from my previous post:

hamming_binary_matmul <- function(X) {
	D <- (1 - X) %*% t(X)
	D + t(D)
}

We can see that indeed, when the number of features is relatively large (left plot), the covariance-based computation of the Hamming distance in this post seems even slightly faster than the one in my previous post, at least on my computer:

Performance
Comparing execution times of Hamming distance computation using covariance and direct matrix multiplication.

 

Fast weighted Hamming distance in R

Upon request, I am posting a row-weighted version of the fast Hamming distance function in my previous posts (here and here).

The fast row-weighted binary Hamming distance:


hamming_binary <- function(X, w = rep(1, nrow(X))) {
	w <- sqrt(w)
	D <- t(w * (1 - X)) %*% (w * X)
	D + t(D)
}

The fast row-weighted “multi-level” Hamming distance:


hamming <- function(X, w = rep(1, nrow(X))) {
	uniqs <- unique(as.vector(X))
	H <- hamming_binary(X == uniqs[1], w)
	for ( uniq in uniqs[-1] ) {
		H <- H + hamming_binary(X == uniq, w)
	}
	H / 2
}

Note that the last function is not dependent on the mode and number of unique elements of the matrix. In other words, the matrix can consist of any number of different characters, integers, and any other type of element you can store in an R matrix. Also, because it avoids R-level looping, it is extremely fast.

Faster Hamming distance in R (2)

In my last post, I proposed a very fast and simple matrix multiplication approach for calculating the Hamming distance between all columns of a matrix X. This approach was restricted to binary matrices:


hamming_binary <- function(X, Y = NULL) {
	if (is.null(Y)) {
		D <- t(1 - X) %*% X
		D + t(D)
	} else {
		t(1 - X) %*% Y + t(X) %*% (1 - Y)
	}
}

Since it only relies on low-level matrix algebra routines, the above function is extremely fast, much faster for example than the hamming.distance function from the e1071 package, which is implemented as an R-level nested loop and is also restricted to binary matrices. However, what if we would like to calculate the Hamming distance between strings of characters, such as DNA sequences, which are composed of As, Cs, Gs and Ts? The nice thing is, that matrix multiplication can also be used for matrices with more than two possible values, and different types of elements. Furthermore, the approach below is generalized to calculating the Hamming distances between all pairs of columns of two matrices X and Y:


hamming <- function(X, Y = NULL) {
	if (is.null(Y)) {
		uniqs <- unique(as.vector(X))
		U <- X == uniqs[1]
		H <- t(U) %*% U
		for ( uniq in uniqs[-1] ) {
			U <- X == uniq
			H <- H + t(U) %*% U
		}
	} else {
		uniqs <- union(X, Y)
		H <- t(X == uniqs[1]) %*% (Y == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + t(X == uniq) %*% (Y == uniq)
		}
	}
	nrow(X) - H
}

The function above calculates the Hamming distance between all columns of a matrix X, or two matrices X and Y. Again matrix multiplication is used, this time for counting, between two columns x and y, the number of cases in which corresponding elements have the same value (e.g. A, C, G or T). This counting is done for each of the possible values individually, while iteratively adding the results. The end result of the iterative adding is the sum of all corresponding elements that are the same, i.e. the inverse of the Hamming distance. Therefore, the last step is to subtract this end result H from the maximum possible distance, which is the number of rows of matrix X.

Interestingly, such a "multi-level" Hamming distance can also be quite elegantly expressed as a sum of multiple binary Hamming distances:


hamming <- function(X, Y = NULL) {
	if (is.null(Y)) {
		uniqs <- unique(as.vector(X))
		H <- hamming_binary(X == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + hamming_binary(X == uniq)
		}
	} else {
		uniqs <- union(X, Y)
		H <- hamming_binary(X == uniqs[1], Y == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + hamming_binary(X == uniq, Y == uniq)
		}
	}
	H / 2
}

Note that at the end, we need to divide by two, because by summing the individual binary Hamming distances, we are counting each element twice.

Also note that the function is not dependent on the mode of the matrix. In other words, the matrix can consist of characters, integers, and any other type of element you can store in an R matrix.

Fast Hamming distance in R using matrix multiplication


This post is strictly about binary matrices. For matrices of any data type, have a look at this post.

The hamming distance is perhaps the most frequently used distance measure for binary vectors. It is defined on two vectors of equal length as the number of corresponding elements that differ between the two vectors. A special case of the Hamming distance, is the Hamming distance between two binary vectors, which is often used e.g. in telecommunication to count the number of bit flips in a signal, and in bioinformatics as a similarity measure for hierarchical clustering. While this binary Hamming distance calculation between only two vectors is simple and computationally cheap, calculating all pairwise distances between the rows of a matrix quickly becomes prohibitive, as the computation time scales quadratically with the number of rows in a matrix. The following R function represents a basic implementation of calculating the Hamming distance between all rows of a matrix, using a nested for-loop:


hamming_loop <- function(X) {
	d <- matrix(0, nrow = nrow(X), ncol = nrow(X))
	for ( i in 1:(nrow(X) - 1) ) {
		for ( j in (i + 1):nrow(X) ) {
			d[i,j] <- d[j,i] <- sum(X[i,] != X[j,])
		}
	}
	d
}

 

However, in R, the Hamming distance between the rows of a binary matrix can be computed much faster, by exploiting the fact that the dot product of two binary vectors x and (1 – y) counts the corresponding elements where x has a 1 and y has a 0:


hamming <- function(X) {
	D <- (1 - X) %*% t(X)
	D + t(D)
}

To demonstrate how much faster the matrix multiplication function is than the nested for-loop function, I ran some simulations. In addition to the above two functions, I included the function hamming.distance from the R-package e1071:

Hamming distance computation time in seconds, as a function of number of rows, while keeping the number of columns at 100.
Hamming distance computation time in seconds, as a function of number of rows, while keeping the number of columns at 100.

The relatively naive nested for-loop approach is faster than the hamming.distance function from the e1071 package. This is surprising, since the e1071 hamming.distance function is also implemented as a nested loop. Strikingly, the matrix multiplication-based function is by far the fastest function for computing the Hamming distance, the reason being that it does not use any expensive looping at R-level, only efficient lower level linear algebra routines. Note that the above approach is restricted to binary matrices. However, quite conveniently, matrix multiplication can also be used for computing the Hamming distance for matrices with more than two possible values, and other types of elements (character, numeric, …), as explained in this post.