Johann de Jong

Algorithms, Hamming distance, Information theory

johanndejong

September 23, 2015

Fast Hamming distance in R

This post is strictly about binary matrices. For matrices of any data type, have a look at this post. The hamming distance is perhaps the most frequently used distance measure for binary vectors. It is defined on two vectors of equal length as the number of corresponding elements that differ between the two vectors. A special case…

This post is strictly about binary matrices. For matrices of any data type, have a look at this post.

The hamming distance is perhaps the most frequently used distance measure for binary vectors. It is defined on two vectors of equal length as the number of corresponding elements that differ between the two vectors. A special case of the Hamming distance, is the Hamming distance between two binary vectors, which is often used e.g. in telecommunication to count the number of bit flips in a signal, and in bioinformatics as a similarity measure for hierarchical clustering. While this binary Hamming distance calculation between only two vectors is simple and computationally cheap, calculating all pairwise distances between the rows of a matrix quickly becomes prohibitive, as the computation time scales quadratically with the number of rows in a matrix. The following R function represents a basic implementation of calculating the Hamming distance between all rows of a matrix, using a nested for-loop:

hamming_loop <- function(X) {
	d <- matrix(0, nrow = nrow(X), ncol = nrow(X))
	for ( i in 1:(nrow(X) - 1) ) {
		for ( j in (i + 1):nrow(X) ) {
			d[i,j] <- d[j,i] <- sum(X[i,] != X[j,])
		}
	}
	d
}

However, in R, the Hamming distance between the rows of a binary matrix can be computed much faster, by exploiting the fact that the dot product of two binary vectors x and (1 – y) counts the corresponding elements where x has a 1 and y has a 0:

hamming <- function(X) {
	D <- (1 - X) %*% t(X)
	D + t(D)
}

To demonstrate how much faster the matrix multiplication function is than the nested for-loop function, I ran some simulations. In addition to the above two functions, I included the function hamming.distance from the R-package e1071:

Hamming distance computation time in seconds, as a function of number of rows, while keeping the number of columns at 100.

The relatively naive nested for-loop approach is faster than the hamming.distance function from the e1071 package. This is surprising, since the e1071 hamming.distance function is also implemented as a nested loop. Strikingly, the matrix multiplication-based function is by far the fastest function for computing the Hamming distance, the reason being that it does not use any expensive looping at R-level, only efficient lower level linear algebra routines. Note that the above approach is restricted to binary matrices. However, quite conveniently, matrix multiplication can also be used for computing the Hamming distance for matrices with more than two possible values, and other types of elements (character, numeric, …), as explained in this post.

Tags:

Responses to “Fast Hamming distance in R”

Faster Hamming distance in R (2) | Johann de Jong

October 2, 2015

[…] Faster Hamming distance in R using matrix multiplication […]

LikeLike

Reply
Anonymous

July 13, 2018

If your matricies are small

LikeLike

Reply
1. johanndejong
  
  July 13, 2018
  
  Not sure what you mean, but it works for matrices of any size that R allows.
  
  LikeLike
  
  Reply
  1. Anonymous
    
    July 13, 2018
    
    If you have a 200000 by 52 matrix then you won’t have enough memory to store the matrix
    
    LikeLike
  2. johanndejong
    
    July 13, 2018
    
    Unfortunately that problem is caused by the machine you’re running the algorithm on, not by the matrix multiplication. Regardless of the algorithm, the result matrix will remain 200000 x 200000. If you’re having memory issues you could try (1) having a look at sparse matrices (if your data is sparse), or (2) the bigalgebra package in R, or (3) processing your matrix in batches and write the results to a file.
    
    LikeLike
  3. Anonymous
    
    July 13, 2018
    
    Thank you for the reply – the sparse matrix might be a nifty idea
    
    LikeLike
Anonymous

October 11, 2018

“the dot product of two binary vectors x and (1 – y) counts the corresponding elements that are different between x and y” Is it correct? Say two vectors x = (1, 1, 0) and y = (0, 0, 1), all 3 elements are different, but the dot product between either x * (1 – y) or (1 – x) * y is 3.

LikeLike

Reply
1. johanndejong
  
  October 11, 2018
  
  Hmm, I might be missing your point: didn’t you just give an example that supports that statement? The dot product is 3, and the number of corresponding elements that are different is also 3.
  
  LikeLike
  
  Reply
  1. Anonymous
    
    October 11, 2018
    
    Oh, sorry, I meant to say neither x * (1 – y) nor (1-x) * y is 3.
    
    LikeLike
johanndejong

October 11, 2018

Ah, now I see what you mean, and also see that the phrasing might not be entirely clear (well spotted; updated now). The dot products do in fact count the number of corresponding elements that are different between x and y. For each element in the distance matrix, you could see it as adding two dot products, one that is explicitly computed, and one that is implicitly computed. More specifically, the following is an equivalent way of computing distance matrix: (1 – X) %*% t(X) + X %*% t(1 – X). Note however that X %*% t(1 – X) == t((1 – X) %*% t(X)). Hence, for efficiency reasons, I choose to compute it only once, and then add the transpose, i.e. D + t(D). So, one dot product that computes all distances.

LikeLike

Reply
1. Anonymous
  
  October 12, 2018
  
  Thanks a lot for the clarification!
  
  LikeLike
  
  Reply
2. Anonymous
  
  September 6, 2019
  
  What the heck is X?
  
  LikeLike
  
  Reply
  1. johanndejong
    
    September 6, 2019
    
    Stated in the post: “… in R, the Hamming distance between the rows of a binary matrix …”
    
    LikeLike
Efficient Pairwise Distance for Binary Data – Mallory Analytics

March 5, 2019

[…] can save you seconds or even minutes off the runtime of the distance calculations. As outlined here, the Euclidean distance between binary data points can be quickly calculated using a matrix cross […]

LikeLike

Reply
Fast weighted Hamming distance in R – Johann de Jong

September 28, 2019

[…] I am posting a row-weighted version of the fast Hamming distance function in my previous posts (here and […]

LikeLike

Reply
Fast Hamming distance in R using covariance – Johann de Jong

September 30, 2019

[…] on efficiently computing the Hamming distance in base R using matrix multiplication. e.g. for large binary matrices, multi-level matrices and feature-weighted matrices, and shown that a multi-level Hamming distance […]

LikeLike

Reply
Hamming distance in R using Rcpp – Johann de Jong

October 13, 2019

[…] I compared the performance of the two C/C++ functions to a very fast pure R function based on matrix multiplication that I wrote about before: […]

LikeLike

Reply
Fast Hamming distance in Python – Johann de Jong

November 15, 2019

[…] In my previous post, I have shown how matrix multiplication can be used to highly efficiently calculate the pairwise Hamming distances between all columns of a matrix X (or between all columns of matrices X and Y). Naturally, this trick can also be applied to the programming language Python. Matrix multiplication avoids Python-level looping, and should hence be very fast. I have provided no Python benchmark here, but you can check out my previous benchmark in R. […]

LikeLike

Reply

Fast Hamming distance in R

Share this:

Responses to “Fast Hamming distance in R”

Leave a comment Cancel reply