Johann de Jong

Algorithms, Hamming distance, Information theory

johanndejong

October 2, 2015

Faster Hamming distance in R (2)

In my last post, I proposed a very fast and simple matrix multiplication approach for calculating the Hamming distance between all columns of a matrix X. This approach was restricted to binary matrices: Since it only relies on low-level matrix algebra routines, the above function is extremely fast, much faster for example than the hamming.distance function from…

hamming_binary <- function(X, Y = NULL) {
	if (is.null(Y)) {
		D <- t(1 - X) %*% X
		D + t(D)
	} else {
		t(1 - X) %*% Y + t(X) %*% (1 - Y)
	}
}

Since it only relies on low-level matrix algebra routines, the above function is extremely fast, much faster for example than the hamming.distance function from the e1071 package, which is implemented as an R-level nested loop and is also restricted to binary matrices. However, what if we would like to calculate the Hamming distance between strings of characters, such as DNA sequences, which are composed of As, Cs, Gs and Ts? The nice thing is, that matrix multiplication can also be used for matrices with more than two possible values, and different types of elements. Furthermore, the approach below is generalized to calculating the Hamming distances between all pairs of columns of two matrices X and Y:

hamming <- function(X, Y = NULL) {
	if (is.null(Y)) {
		uniqs <- unique(as.vector(X))
		U <- X == uniqs[1]
		H <- t(U) %*% U
		for ( uniq in uniqs[-1] ) {
			U <- X == uniq
			H <- H + t(U) %*% U
		}
	} else {
		uniqs <- union(X, Y)
		H <- t(X == uniqs[1]) %*% (Y == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + t(X == uniq) %*% (Y == uniq)
		}
	}
	nrow(X) - H
}

The function above calculates the Hamming distance between all columns of a matrix X, or two matrices X and Y. Again matrix multiplication is used, this time for counting, between two columns x and y, the number of cases in which corresponding elements have the same value (e.g. A, C, G or T). This counting is done for each of the possible values individually, while iteratively adding the results. The end result of the iterative adding is the sum of all corresponding elements that are the same, i.e. the inverse of the Hamming distance. Therefore, the last step is to subtract this end result H from the maximum possible distance, which is the number of rows of matrix X.

Interestingly, such a “multi-level” Hamming distance can also be quite elegantly expressed as a sum of multiple binary Hamming distances:

hamming <- function(X, Y = NULL) {
	if (is.null(Y)) {
		uniqs <- unique(as.vector(X))
		H <- hamming_binary(X == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + hamming_binary(X == uniq)
		}
	} else {
		uniqs <- union(X, Y)
		H <- hamming_binary(X == uniqs[1], Y == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + hamming_binary(X == uniq, Y == uniq)
		}
	}
	H / 2
}

Note that at the end, we need to divide by two, because by summing the individual binary Hamming distances, we are counting each element twice.

Also note that the function is not dependent on the mode of the matrix. In other words, the matrix can consist of characters, integers, and any other type of element you can store in an R matrix.

Tags:

Responses to “Faster Hamming distance in R (2)”

Fast Hamming distance in R using matrix multiplication – Johann de Jong

January 18, 2016

[…] The relatively naive nested for-loop approach is faster than the hamming.distance function from the e1071 package. This is surprising, since the e1071 hamming.distance function is also implemented as a nested loop. Strikingly, the matrix multiplication-based function is by far the fastest function for computing the Hamming distance, the reason being that it does not use any expensive looping at R-level, only efficient lower level linear algebra routines. Note that the above approach is restricted to binary matrices. However, quite conveniently, matrix multiplication can also be used for computing the hamming distance for matrices with more than two possible values, and other types of elements (character, numeric, …), as explained in this post. […]

LikeLike

Reply
Anonymous

March 22, 2018

It seems your “hamming” function only works on DNA sequences, but not on amino acid sequences where there are 20 different letters?

LikeLike

Reply
1. Anonymous
  
  March 22, 2018
  
  Sorry, forgot to say that they were great posts!
  
  LikeLike
  
  Reply
  1. johanndejong
    
    March 22, 2018
    
    Thanks! Glad you like them.
    
    For me, the hamming function above does in fact work for any type of element (and number of distinct values) you can store in a matrix. For example, try running the following code:
    
    n <- 1000
    m <- 100
    X <- matrix(sample(LETTERS, n * m, replace = TRUE), nrow = n, ncol = m)
    Y <- matrix(sample(LETTERS, n * m, replace = TRUE), nrow = n, ncol = m)
    H <- hamming(X, Y)
    
    # check a random example
    i <- sample(m, 1)
    j <- sample(m, 1)
    sum(X[, i] != Y[, j]) == H[i, j]
    
    LikeLike
Anonymous

March 23, 2018

Very nice! My bad. Didn’t read the post carefully. My sequences in my input matrix are arranged by rows. After transposing it, everything works!

Thanks!

LikeLike

Reply
chris2016

September 21, 2019

Can you add an example to get weighted hamming? Where the weights are the number of levels per feature then each distance is divided by this?

LikeLike

Reply
1. johanndejong
  
  September 23, 2019
  
  Assuming that by “feature” you mean “row” in the matrices X and Y above (not really clear from your question), feature-wise weights are implemented as follows:
  
  hamming_binary <- function(X, w = rep(1, nrow(X))) {
  w <- sqrt(w)
  D <- t(w * (1 – X)) %*% (w * X)
  D + t(D)
  }
  
  This is easily extended to the more complex hamming() function. I might write a separate post on this some time.
  
  LikeLike
  
  Reply
  1. chris2016
    
    September 23, 2019
    
    Yeah that’s it. Thanks. Nice work.
    
    LikeLike
  2. chris2016
    
    September 28, 2019
    
    Sorry, but what if I want to get the feature weights for more levels, i.e. categories. I have from your other code this function I need to adapt for my package, so how to change this?
    
    hamming <- function(X) {
    uniqs <- unique(as.vector(X))
    U <- X == uniqs[1]
    H <- t(U) %*% U
    for ( uniq in uniqs[-1] ) {
    U <- X == uniq
    H <- H + t(U) %*% U
    }
    nrow(X) – H
    }
    
    LikeLike
Fast weighted Hamming distance in R – Johann de Jong

September 28, 2019

[…] Upon request, I am posting a row-weighted version of the fast Hamming distance function in my previous posts (here and here). […]

LikeLike

Reply
johanndejong

September 28, 2019

Hi Chris, You can do that by expressing the “multi-level” function above in terms of the binary Hamming distance. I updated this post, and wrote a quick new post (https://johanndejong.wordpress.com/2019/09/28/fast-weighted-hamming-distance-in-r/) to answer your question. Feel free to use the code in your package, but please cite my blog. Also, please make sure the code does exactly what you intend.

LikeLike

Reply
1. chris2016
  
  September 28, 2019
  
  Ok this is good. I’ll let you know when this progresses and certainly can cite blog.
  
  LikeLiked by 1 person
  
  Reply
Fast Hamming distance in R using covariance – Johann de Jong

September 30, 2019

[…] the Hamming distance in base R using matrix multiplication. e.g. for large binary matrices, multi-level matrices and feature-weighted matrices, and shown that a multi-level Hamming distance matrix can be computed […]

LikeLike

Reply
Hamming distance in R using Rcpp – Johann de Jong

October 13, 2019

[…] Note that focusing on binary matrices does not limit the applicability of the C/C++ functions, because multi-value Hamming distances can be easily computed by adding multiple binary Hamming distances. […]

LikeLike

Reply
Fast Hamming distance in Python – Johann de Jong

November 15, 2019

[…] my previous post, I have shown how matrix multiplication can be used to highly efficiently calculate the […]

LikeLike

Reply
Anonymous

January 5, 2024

This has incredible speed for DNA sequence comparisons! Thank you so much for sharing it.

LikeLiked by 1 person

Reply

Faster Hamming distance in R (2)

Share this:

Responses to “Faster Hamming distance in R (2)”

Leave a comment Cancel reply