Faster Hamming distance in R (2)

In my last post, I proposed a very fast and simple matrix multiplication approach for calculating the Hamming distance between all columns of a matrix X. This approach was restricted to binary matrices:


hamming_binary <- function(X, Y = NULL) {
	if (is.null(Y)) {
		D <- t(1 - X) %*% X
		D + t(D)
	} else {
		t(1 - X) %*% Y + t(X) %*% (1 - Y)
	}
}

Since it only relies on low-level matrix algebra routines, the above function is extremely fast, much faster for example than the hamming.distance function from the e1071 package, which is implemented as an R-level nested loop and is also restricted to binary matrices. However, what if we would like to calculate the Hamming distance between strings of characters, such as DNA sequences, which are composed of As, Cs, Gs and Ts? The nice thing is, that matrix multiplication can also be used for matrices with more than two possible values, and different types of elements. Furthermore, the approach below is generalized to calculating the Hamming distances between all pairs of columns of two matrices X and Y:


hamming <- function(X, Y = NULL) {
	if (is.null(Y)) {
		uniqs <- unique(as.vector(X))
		U <- X == uniqs[1]
		H <- t(U) %*% U
		for ( uniq in uniqs[-1] ) {
			U <- X == uniq
			H <- H + t(U) %*% U
		}
	} else {
		uniqs <- union(X, Y)
		H <- t(X == uniqs[1]) %*% (Y == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + t(X == uniq) %*% (Y == uniq)
		}
	}
	nrow(X) - H
}

The function above calculates the Hamming distance between all columns of a matrix X, or two matrices X and Y. Again matrix multiplication is used, this time for counting, between two columns x and y, the number of cases in which corresponding elements have the same value (e.g. A, C, G or T). This counting is done for each of the possible values individually, while iteratively adding the results. The end result of the iterative adding is the sum of all corresponding elements that are the same, i.e. the inverse of the Hamming distance. Therefore, the last step is to subtract this end result H from the maximum possible distance, which is the number of rows of matrix X.

Interestingly, such a "multi-level" Hamming distance can also be quite elegantly expressed as a sum of multiple binary Hamming distances:


hamming <- function(X, Y = NULL) {
	if (is.null(Y)) {
		uniqs <- unique(as.vector(X))
		H <- hamming_binary(X == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + hamming_binary(X == uniq)
		}
	} else {
		uniqs <- union(X, Y)
		H <- hamming_binary(X == uniqs[1], Y == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + hamming_binary(X == uniq, Y == uniq)
		}
	}
	H / 2
}

Note that at the end, we need to divide by two, because by summing the individual binary Hamming distances, we are counting each element twice.

Also note that the function is not dependent on the mode of the matrix. In other words, the matrix can consist of characters, integers, and any other type of element you can store in an R matrix.