Hamming distance in R using Rcpp

After all my posts on efficiently computing the Hamming distance in R, two questions remain for me. First, can we compute the Hamming distance even faster when implementing it in a low-level programming language such as C/C++, and then use the great Rcpp package to directly call this low-level C/C++ function from R? Second, how much would parallelization help? I will answer the second question in a separate post, and address the first question here.

I implemented two C/C++ functions for computing the Hamming distance between all pairs of columns of a binary matrix X:

  1. A straightforward approach that I called “hamming__C_simple”: this function simply iterates through all combinations of columns of a matrix X to compare all elements between each pair of columns.
  2. A more involved approach that I called “hamming__C_bitwise”: This function exploits the ability one has in C to directly manipulate individual bits in memory, which should typically be very fast. The function first interprets the R binary vectors as sequences of unsigned integers, to then perform bitwise operations on these integers.

Note that focusing on binary matrices does not limit the applicability of the C/C++ functions, because multi-value Hamming distances can be easily computed by adding multiple binary Hamming distances.

The code for both implementations is shown below.


#include "Rcpp.h"
#include "math.h"
#include "stdlib.h"
#include "stdio.h"
#include "math.h"
#include "algorithm"
#include "climits"
using namespace Rcpp;

unsigned long long binvec2int(IntegerMatrix x) {
unsigned long long u = 0;
int n = x.nrow();
for (int i = 0; i < n; i++) {
u += x(i, 0) * (unsigned long long) pow(2, n - 1 - i);
}
return u;
}

// [[Rcpp::export]]
NumericMatrix hamming__C_bitwise(IntegerMatrix X) {
int i, j, k, from, to, h;
int stepsize = sizeof(unsigned long long) * CHAR_BIT;
int n = ceil(((double) X.nrow()) / stepsize);
IntegerMatrix x;
NumericMatrix H(X.ncol(), X.ncol());
// convert X to unsigned long long
unsigned long long *xx =
(unsigned long long *) malloc(n * X.ncol() * sizeof(unsigned long long));
i = 0;
while ((from = i * stepsize) < X.nrow()) {
for (j = 0; j < X.ncol(); j++) {
to = std::min(X.nrow(), (i + 1) * stepsize);
x = X(Range(from, to - 1), Range(j, j));
xx[i + j * n] = binvec2int(x);
}
i++;
}

for (i = 0; i < X.ncol(); i++) {
for (j = i + 1; j < X.ncol(); j++) {
for (k = 0; k < n; k++) {
h = __builtin_popcountll(xx[k + i * n] ^ xx[k + j * n]);
H(i, j) += h;
H(j, i) += h;
}
}
}
free(xx);
return H;
}

// [[Rcpp::export]]
NumericMatrix hamming__C_simple(IntegerMatrix X) {
int i, j, k, h;
NumericMatrix H(X.ncol(), X.ncol());
for (i = 0; i < X.ncol(); i++) {
for (j = i + 1; j < X.ncol(); j++) {
for (k = 0; k < n; k++) {
h = (X(k, i) != X(k, j));
H(i, j) += h;
H(j, i) += h;
}
}
}
return H;
}

I compared the performance of the two C/C++ functions to a very fast pure R function based on matrix multiplication that I wrote about before:

performance_C
C++ vs. pure R in computing the Hamming distance.

Interestingly, we can see that the straightforward C++ implementation (C_simple) is not faster than the pure R function (pure_R). In other words, efficient use of pure R can easily be faster than fairly regular use of C++ within R. We can see that if we actually want to be faster than the pure R function, one way of doing that is exploiting C’s bitwise operations (C_bitwise).

Rcpp to integrate C with R

If my R script needs a speed boost, I normally use R’s .C() function to call a function written in C from within R. However, after having read many good things about the Rcpp R-package lately, I decided to give this package a try. In principle, Rcpp was developed to integrate C++ with R. However, C++ is almost a superset of C, and therefore integrating Rcpp with C instead of C++ is mostly just a matter of knowing the Rcpp interface and writing some C code.

As a first step in getting to know the Rcpp interface, I set myself the goal of designing a function myOrder() that returns the ordering of the elements of a given numeric vector. In R, this would normally be done by calling the function order(). For example, ordering the vector [2.1, 4.5, 1] would return [3, 1, 2]. I wrote two functions for doing this type of ordering and stored these in a text file myOrder.cpp, with the extension .cpp for C++:


#include <Rcpp.h>
using namespace Rcpp;

int pointer_compare(
    const void* a,
    const void* b
) {
    /* Two double-pointers p1 and p2 */
    double* p1 = *(double**)a;
    double* p2 = *(double**)b;

    /* Compare pointers based on the value they point to */
    if ( *p1 < *p2 ) {
        return -1;
    } else if ( *p2 < *p1 ) {
        return 1;
    }
    return 0;
}

// [[Rcpp::export]]
IntegerVector myOrder(
    DoubleVector x
) {

    int i, n = x.size();
    IntegerVector ret(n); 

    double** p = (double**) malloc(n * sizeof(double*));

    /*
    Retrieve the addresses of the elements in x,
    and store these in the pointer array p.
    */
    for ( i = 0; i < n; i++ ) {
        p[i] = &x[i];
    }

    /*
    Sort the pointers based on their de-referenced
    values. (i.e. the values of the elements in x)
    */
    qsort(p, n, sizeof(double*), pointer_compare);

    /*
    'Normalize' the pointers to a range of [1...n]
    by subtracting the address of the first element
    in x.
    */
    for ( i = 0; i < n; i++ ) {
        ret[i] = p[i] - &x[0] + 1; // add 1 for R
    }

    free(p);

    return ret;
}

The above functions are C code, except for only a few lines:

  • Line 2: defines a namespace for the code, such that anything defined in the header file Rcpp.h that is used in the above code, does not need to be prepended with “Rcpp::”. For example, type DoubleVector is defined in Rcpp.h, and the DoubleVector parameter of function myOrder() would normally need to be declared as “Rcpp::DoubleVector x”. However, by using the namespace, it is sufficient to write “DoubleVector x”.
  • Line 21: specifies that function myOrder() needs to be callable from within R.
  • Lines 22, 23, 27: IntegerVector and DoubleVector are Rcpp-specific C++ type definitions, corresponding to integer and numeric vectors in R, respectively.
  • Line 26: size() is a method for a vector object, which returns the length of that vector.

The above functions can be compiled, linked and called by running the following R code:


# load the library
library(Rcpp)
# compile, link and load the myOrder() function
sourceCpp("myOrder.cpp")
# generate some data
x <- rnorm(10)
# return the ordering of the data
myOrder(x)

Note that by calling the sourceCpp() function, the Rcpp package takes care of compiling and linking, and after that the C function can conveniently be called from within R.

It is interesting to compare the performance of R’s internal order() function with this custom-made Rcpp function.

order
Comparing the running time of R’s internal order() function to the myOrder() function for ordering numeric vectors.

You can see that the Rcpp function is quite a bit faster than R’s internal function. Obviously, R’s internal order() function is more generic than myOrder(). Also, myOrder() does not deal with NAs and only allows for non-decreasing ordering, to name just two things. Still, the above figure shows that, with relatively little effort and quite painlessly compared to R’s .C interface, it is possible to speed up R code substantially, even when considering low-level implemented R functions such as order().