Fast Hamming distance in R using matrix multiplication


This post is strictly about binary matrices. For matrices of any data type, have a look at this post.

The hamming distance is perhaps the most frequently used distance measure for binary vectors. It is defined on two vectors of equal length as the number of corresponding elements that differ between the two vectors. A special case of the Hamming distance, is the Hamming distance between two binary vectors, which is often used e.g. in telecommunication to count the number of bit flips in a signal, and in bioinformatics as a similarity measure for hierarchical clustering. While this binary Hamming distance calculation between only two vectors is simple and computationally cheap, calculating all pairwise distances between the rows of a matrix quickly becomes prohibitive, as the computation time scales quadratically with the number of rows in a matrix. The following R function represents a basic implementation of calculating the Hamming distance between all rows of a matrix, using a nested for-loop:


hamming_loop <- function(X) {
	d <- matrix(0, nrow = nrow(X), ncol = nrow(X))
	for ( i in 1:(nrow(X) - 1) ) {
		for ( j in (i + 1):nrow(X) ) {
			d[i,j] <- d[j,i] <- sum(X[i,] != X[j,])
		}
	}
	d
}

 

However, in R, the Hamming distance between the rows of a binary matrix can be computed much faster, by exploiting the fact that the dot product of two binary vectors x and (1 – y) counts the corresponding elements where x has a 1 and y has a 0:


hamming <- function(X) {
	D <- (1 - X) %*% t(X)
	D + t(D)
}

To demonstrate how much faster the matrix multiplication function is than the nested for-loop function, I ran some simulations. In addition to the above two functions, I included the function hamming.distance from the R-package e1071:

Hamming distance computation time in seconds, as a function of number of rows, while keeping the number of columns at 100.
Hamming distance computation time in seconds, as a function of number of rows, while keeping the number of columns at 100.

The relatively naive nested for-loop approach is faster than the hamming.distance function from the e1071 package. This is surprising, since the e1071 hamming.distance function is also implemented as a nested loop. Strikingly, the matrix multiplication-based function is by far the fastest function for computing the Hamming distance, the reason being that it does not use any expensive looping at R-level, only efficient lower level linear algebra routines. Note that the above approach is restricted to binary matrices. However, quite conveniently, matrix multiplication can also be used for computing the Hamming distance for matrices with more than two possible values, and other types of elements (character, numeric, …), as explained in this post.

Which nationalities do well in piano competitions?

Yesterday, I was listening to Guido Agosti’s awesome piano transcription of Strawinsky’s The Firebird, in a great performance by the Italian pianist Francesco Piemontesi. I remembered Francesco Piemontesi from coming in third at the 2007 Queen Elisabeth Competition in Brussels, and I was thinking: how come it seems that Italian pianists always do so well in international competitions, compared to many other European nationalities? It struck me that this was actually an assumption, as I had never seen any concrete numbers on this, at least not across many different competitions world-wide.

Curious as I am, I paid a visit to a website collecting this type of data, to see if I could find any numbers on this. Unfortunately, the website itself provided only very limited functionality for mining the data, by far not enough to answer the question I was interested in. However, being a computational scientist / data scientist (and classically trained pianist), I could not resist: Using Firefox’s Network Monitor, I tracked down the precise HTTP GET request that was sent to query the database for any individual competition. Then, I constructed my own raw HTTP GET requests, and used the computer networking tool netcat to send these in batch to the server. This allowed me to retrieve the results in HTML for about 3000 international piano competitions. I parsed this HTML data mainly using the XML and stringr R-packages (tricky…), downloaded some more data regarding population size by country from Wikipedia, and then did some elementary analysis in R.

Behold, the first results of analyzing this data: the number of prizes per million inhabitants of a country.

Number of prizes in international piano competitions, per million inhabitants, by country
Number of prizes in international piano competitions, per million inhabitants, by country

The results are pretty interesting. Indeed, Italy ranks highest of all Western European countries! Another suspicion confirmed is that Belgium scores substantially better than the Netherlands (my home country). Somewhat surprising is that the charts are heavily dominated by Eastern Europe (top 7), and more specifically by former Soviet states, excluding Russia itself (top 5!). The top Eastern Asian country is South Korea. China ranks very low, so the large numbers of Chinese pianists may currently still be explained by China’s huge population.

Although the results are quite interesting, some caution regarding the interpretation particularly of small differences is in order:

  1. Population sizes are from 2014-2015; competition results are from 2000-2015.
  2. Some countries do not exist anymore. For example, prizes won by Yugoslavian pianists are not counted towards any of the currently existing states that previously made up Yugoslavia.
  3. The number of prizes in a country may somewhat depend on the number of international competitions in that country.
  4. Some number are very low. For example, Mongolia ranks a bit higher than China (!). However, this is based on only three prizes won by Mongolian pianists.

I will probably have another, more sophisticated, look at this data in the near future. To be continued…..