Hamming distance in R using Rcpp

After all my posts on efficiently computing the Hamming distance in R, two questions remain for me. First, can we compute the Hamming distance even faster when implementing it in a low-level programming language such as C/C++, and then use the great Rcpp package to directly call this low-level C/C++ function from R? Second, how much would parallelization help? I will answer the second question in a separate post, and address the first question here.

I implemented two C/C++ functions for computing the Hamming distance between all pairs of columns of a binary matrix X:

  1. A straightforward approach that I called “hamming__C_simple”: this function simply iterates through all combinations of columns of a matrix X to compare all elements between each pair of columns.
  2. A more involved approach that I called “hamming__C_bitwise”: This function exploits the ability one has in C to directly manipulate individual bits in memory, which should typically be very fast. The function first interprets the R binary vectors as sequences of unsigned integers, to then perform bitwise operations on these integers.

Note that focusing on binary matrices does not limit the applicability of the C/C++ functions, because multi-value Hamming distances can be easily computed by adding multiple binary Hamming distances.

The code for both implementations is shown below.


#include "Rcpp.h"
#include "math.h"
#include "stdlib.h"
#include "stdio.h"
#include "math.h"
#include "algorithm"
#include "climits"
using namespace Rcpp;

unsigned long long binvec2int(IntegerMatrix x) {
unsigned long long u = 0;
int n = x.nrow();
for (int i = 0; i < n; i++) {
u += x(i, 0) * (unsigned long long) pow(2, n - 1 - i);
}
return u;
}

// [[Rcpp::export]]
NumericMatrix hamming__C_bitwise(IntegerMatrix X) {
int i, j, k, from, to, h;
int stepsize = sizeof(unsigned long long) * CHAR_BIT;
int n = ceil(((double) X.nrow()) / stepsize);
IntegerMatrix x;
NumericMatrix H(X.ncol(), X.ncol());
// convert X to unsigned long long
unsigned long long *xx =
(unsigned long long *) malloc(n * X.ncol() * sizeof(unsigned long long));
i = 0;
while ((from = i * stepsize) < X.nrow()) {
for (j = 0; j < X.ncol(); j++) {
to = std::min(X.nrow(), (i + 1) * stepsize);
x = X(Range(from, to - 1), Range(j, j));
xx[i + j * n] = binvec2int(x);
}
i++;
}

for (i = 0; i < X.ncol(); i++) {
for (j = i + 1; j < X.ncol(); j++) {
for (k = 0; k < n; k++) {
h = __builtin_popcountll(xx[k + i * n] ^ xx[k + j * n]);
H(i, j) += h;
H(j, i) += h;
}
}
}
free(xx);
return H;
}

// [[Rcpp::export]]
NumericMatrix hamming__C_simple(IntegerMatrix X) {
int i, j, k, h;
NumericMatrix H(X.ncol(), X.ncol());
for (i = 0; i < X.ncol(); i++) {
for (j = i + 1; j < X.ncol(); j++) {
for (k = 0; k < n; k++) {
h = (X(k, i) != X(k, j));
H(i, j) += h;
H(j, i) += h;
}
}
}
return H;
}

I compared the performance of the two C/C++ functions to a very fast pure R function based on matrix multiplication that I wrote about before:

performance_C
C++ vs. pure R in computing the Hamming distance.

Interestingly, we can see that the straightforward C++ implementation (C_simple) is not faster than the pure R function (pure_R). In other words, efficient use of pure R can easily be faster than fairly regular use of C++ within R. We can see that if we actually want to be faster than the pure R function, one way of doing that is exploiting C’s bitwise operations (C_bitwise).

Transfer learning: domain adaptation by instance-reweighting

In supervised learning, we typically train a model on labeled data (we know “the truth”) and eventually apply the model to unlabeled data (we do not know “the truth”). For example, a spam filtering model might be trained on a public email database, with emails clearly labeled as “spam” or “non-spam”. However, the model will eventually be applied to a personal inbox, where emails are not labeled. An interesting example from the life sciences is training a classifier for predicting protein interactions in some species for which biologically validated interactions are known, and applying this classifier to other species, for which no such data validated interactions exist.

But what if, in addition to missing labels, the data we apply our model to (the “target data”) is just very different from our training data (the “source data”)? For example, in a personal inbox both spam and non-spam emails may have very different characteristics compared to the emails in the public email database. Also, in protein interaction prediction, it could important to consider that species can have very different proteomes, and therefore also different protein interactions.

In cases such as the two outlined above, what we would like to do, is make sure that our model performs well on the target data, while still training on the source data. How can we do that?

The covariate shift assumption

In cases where the target data is very different from the source data, we need to think about domain adaptation. Domain adaptation can be seen as a specific case of transfer learning, and can be applied in situations where there is only a covariate shift between source and target data:

P_S(X) \neq P_T(X) \textrm{, but } P_S(Y|X=x) = P_T(Y|X=x)

Here,

  • P_S(X) represents the marginal covariate distribution of source instances.
  • P_T(X) represents the marginal covariate distribution of target instances.
  • P_S(Y|X=x) represents the conditional class distribution given source instance x.
  • P_S(Y|X=x) represents the conditional class distribution given target instance x.

In words, the first part (P_S(X) \neq P_T(X)) simply means that in general you find different emails in the public email database than in your own inbox: the target data is distributionally different from the source data. The second part (P_S(Y|X=x) = P_T(Y|X=x)) means that the class probability of an instance x is independent of whether x originated from the source or target distribution. In the example of spam filtering: if you have a specific email, then the probability of this email being spam stays the same, regardless of whether it originated from the public email database or from your personal inbox.

Model misspecification

Now you might think: If we train a classifier, we do not care about whether the source and target covariate distributions are different? We are only interested in the conditional class distribution P_T(Y|X=x), and because the assumption is that P_S(Y|X=x) = P_T(Y|X=x), we can simply train a classifier on the source data to obtain optimal performance on the target data? Well, ideally, yes.  However, it was shown that under model misspecification, covariate shift can in fact be a problem, and the thing is that typically, models are indeed misspecified: We do not know what function generated our data, but most likely it was not precisely of the form that we used for fitting the data. For example, fitting a line (e.g. using logistic regression) to separate the classes in the following case would be a clear case of model misspecification:

modelmisspecification
Our data: two features (x1 and x2) and two classes (class1 and class2). It is obvious what function generated this data (points belong to class2 if x1 < 0 and x2 > 0). Therefore, fitting a line is a clear case of model misspecification.

Model misspecification in a transfer learning setting

Back to transfer learning. Remember that in our transfer learning setting, we are training on labeled source data, and will apply the resulting classifier to unlabeled target data. Moreover, the unlabeled target data is distributionally different from the source data. Let’s extend the above example, and separate the data into source and target data:

source_and_target_data
Labeled source data, and unlabeled target data: target samples generally have higher x1 values than source samples, but target class labels are unknown.

You can see that the target data is differently distributed compared to the source data: it tends to have higher x1 values, implying that P_S(X) \neq P_T(X). Furthermore, target class labels are unknown. Therefore, in training a classifier separating class1 from class2, the only thing that we can do is train on the labeled source data. Training a logistic regression classifier on the source data gives the decision boundary in the left two plots:

unweighted_training
Left: Optimal decision boundary for the source data. Middle: The same source data decision boundary is clearly not optimal for the target data. Right: The optimal decision boundary is in fact much steeper.

The decision boundary indeed seems optimal for the source data (left plot). However, it is far from optimal for the target data (middle plot). In fact, the optimal decision boundary of the target data is much steeper (right plot). In this transfer learning setting, the model misspecification implies that it is not possible to find a logistic regression parameterization \theta, such that P_S(Y|X=x, \theta) = P_T(Y|X=x, \theta) for all x. In other words, the optimal model for the source data is different from the optimal model for the target data. This brings us to the following question: Is there a way to train a classifier on the source data, while trying to optimize for performance on the target data?

Re-weighted empirical risk minimization

It turns out, yes: We can train on the source data while optimizing for performance on the target data. Let’s first go through some math to show how. (or skip to an implementation using R if you are not interested) Recall that true risk minimization finds a parameterization \theta = \theta^\ast, such that the expected value of the loss function l(x,y,\theta) under the true joint distribution P(x,y) over X and Y is minimized:

true_risk_minimization

Empirical risk minimization approximates true risk minimization by using the empirical joint distribution over X and Y, because the true joint distribution P(x,y) is unknown:

empirical_risk_minimization

Note that in the above, (x_i, y_i) \sim P(x,y).

In our domain adaptation problem, we have two joint distributions, namely the source distribution P_S(x,y) and the target distribution P_T(x,y). In training on the empirical source distribution, we want optimize for performance on the target distribution. To do this, we use our previous assumption (P_S(Y|X=x)=P_T(Y|X=x) for all x), and apply the following trick for transferring knowledge from our source domain to our target domain:

reweighted_empirical_risk_minimization

Note that in the above, (x_i, y_i) \sim P_S(x,y). So we started with the normal formulation of true risk minimization under the target distribution, and showed that we can approximate this by re-weighting each source instance (x_i, y_i) in an empirical risk minimization under the source distribution! More specifically, each instance (x_i, y_i) needs to be re-weighted by the ratio of the marginal covariate probabilities \frac{P_T(x_i)}{P_S(x_i)}. Interestingly, the above suggests that doing re-weighted empirical risk minimization is essentially the same as performing importance sampling for computing the expected value of the loss function under the target joint distribution, with the additional assumption that the conditional class probabilities between the source and target data are the same.

How to estimate the marginal probability ratio?

The problem with the above result, is that P_T(x) and P_S(x) are difficult to determine. However, we can avoid computing these probabilities directly, by interpreting this marginal probability ratio as another probability ratio: the ratio of probabilities that x_i comes from the target data and from the source data, weighted by the ratio of the source data size N_S and target data size N_T:

\frac{P_T(x_i)}{P_S(x_i)} \approx \frac{N_S}{N_T} \frac{P(x_i \textrm{comes from the target data})}{P(x_i \textrm{comes from the source data})}

Why is this? Well, here’s an argument for the discrete case. Suppose we independently draw two random samples, one of size N_S from the source distribution, and one of size N_T from the target distribution. We merge these two random samples, and from this sample of size N_S + N_T we draw a single instance x_i. What is the probability that x_i originated from the target distribution? If n^S_i is the number of occurrences of x_i in the random source sample of size N_S, and n^T_i is the number of occurrences of x_i in the random target sample of size N_T, then the following represents the probability that x_i originated from the target distribution:

P(x_i \textrm{comes from the target data}) = \frac{n^T_i}{n^T_i+n^S_i}

Similar for the source data:

P(x_j \textrm{comes from the source data}) = \frac{n^S_i}{n^S_i+n^T_i}

Now what’s the expected value for their ratio?

instance_reweighting

So all we need to do, is estimate for each source instance the probability that it originated from the target class. How do we do that? One straightforward way of estimating these probabilities is to train a naturally probabilistic classifier, such as a logistic regression classifier.

A simple way of implementing re-weighting

We now have everything in place to train a classifier on the source data, while optimizing for performance on the target data:

  1. Compute the source instance weights:
    1. Train a logistic regression classifier separating source data from target data.
    2. Apply the classifier to each source instance $x^S_i$, thus computing p_i = P(x^S_i\textrm{ comes from the target data}).
    3. For each source instance x^S_i compute the instance weight w_i as w_i = \frac{p_i}{1-p_i}.
  2. Train a logistic regression classifier on the source data, separating class1 from class2, while re-weighting each source instance x^S_i by w_i.

In R, this could look as follows. First define some functions:


# Function to generate random data.
generate_data <- function(n) {

  range_x1 <- 1
  range_x2 <- 1

  # The features.
  x1 <- runif(n, -range_x1, range_x1)
  x2 <- runif(n, -range_x2, range_x2)

  # Generate class labels.
  y <- (x1 < 0 & x2 > 0) + 1

  # Generate source and target labels.
  prob <- (x1 + range_x1) / range_x1 / 2
  s <- 1:n %in% sample(n, n/2, prob = prob^5) + 1

  data.frame(
    x1 = x1,
    x2 = x2,
    y = factor(c("class1", "class2")[y]),
    s = factor(c("source", "target")[s])
  )
}

# Function to fit a logistic regression classifier,
# possibly weighted.
fitLRG <- function(df, weights = rep(1, nrow(df))) {
  # Compute the class weights.
  tab <- 1 / table(df$y)
  # Multiply by the instance weights
  weights <- as.numeric(weights * tab[match(df$y, names(tab))])
  # Fit a logistic regression model on the
  # source class label.
  fit <- coef(glmnet(
    x = as.matrix(df[, c("x1", "x2")]),
    y = df$y,
    lambda = seq(1, 0, -0.01),
    weights = weights,
    family = "binomial"
  ))
  fit[, ncol(fit)]
}

# Function to compute instance weights
compute_instance_weights <- function(df) {
  # Fit a logistic regression model on the
  # source/target indicator.
  fit <- glmnet(
    x = as.matrix(df[, c("x1", "x2")]),
    y = df$s,
    lambda = seq(1, 0, -0.01),
    family = "binomial"
  )
  # For each instance, compute the probability
  # that it came from the target data
  p <- predict(
    fit,
    newx = as.matrix(df[,c("x1", "x2")]),
    type = "response"
  )
  p <- p[, ncol(p)]
  p / (1 - p)
}

Now let’s do some transfer learning:


# Load a package for fitting logistic regression models.
library(glmnet)

# Set the seed for reproducibility.
set.seed(1)

# Generate some random data.
df <- generate_data(1e3)

# Train an unweighted classifier.
fit_unweighted <- fitLRG(df[df$s == "source",])

# Train a re-weighted classifier:
# 1. Compute the instance weights
weights <- compute_instance_weights(df)
# 2. Train a weighted classifier
fit_reweighted <- fitLRG(
  df[df$s == "source",],
  weights = weights[df$s == "source"]
)

The results confirm that instance re-weighting indeed leads to a decision boundary that is much closer to the optimal decision boundary for the target data:

reweighted_training
Unweighted training on the source data is clearly not optimal for the target data (left plot). Re-weighted training on the source data separates the target data much better (middle plot), and in fact results in a decision boundary that lies very close to the optimal decision boundary (right plot).

 

 

R: are *apply loops faster than for loops?

In a high-level programming language such as R or Python, it is typically a good idea to avoid loops whenever possible, since looping in high-level programming languages can be very time-consuming.

However, sometimes loops cannot be avoided. In R, in addition to traditional looping constructs such as the for and while loop, the *apply() functions provide looping functionality using a single function call. It is often suggested to avoid traditional looping constructs in favor of these *apply() functions, because a single call to an *apply() function would be faster than running through a for or while loop. To investigate this claim, I will run a small benchmark: I will time a (trivial) loop that returns a vector of integers 1 to n, for different values of n, using four different variants of running a loop in R:

  • Using a for loop: the return vector is dynamically extended while iterating through the loop. This variant I call FOR_LOOP_DYN_ALLOC.
  • Using a for loop: the memory for the return vector is pre-allocated, before iterating through the loop. This variant I call FOR_LOOP_PRE_ALLOC.
  • Using R’s sapply() function.
  • Using R’s lapply() function, followed by a call to unlist() to return a vector instead of a list.

Below, you can find an R function for timing these loops, for a vector of different loop sizes N, and a number of replicates N to average the execution time over.


r_loops <- function(
    N, # The different loop sizes to try
    nrep # The number of replicates to average the execution time over
) {
    T <- 1e3 * data.frame(
        FOR_LOOP_DYN_ALLOC = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- integer(0)
                    for (i in 1:n) {
                        x <- c(x, i)
                    }
                })[3]
            }))
        })),
        FOR_LOOP_PRE_ALLOC = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- integer(n)
                    for (i in 1:n) {
                        x[i] <- i
                    }
                })[3]
            }))
        })),
        SAPPLY = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- sapply(1:n, function(i) {
                        i
                    })
                })[3]
            }))
        })),
        UNLIST_LAPPLY = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- unlist(lapply(1:n, function(i) {
                        i
                    }))
                })[3]
            }))
        }))
    )    

    matplot(
        N,
        T,
        type = "l",
        lty = 1,
        lwd = 2,
        col = rainbow(ncol(T)),
        log = "xy",
        xlab = "size of loop",
        ylab = "execution time (milliseconds)",
        main = sprintf("Average of %i replicates", nrep)
    )
    legend(
        "topleft",
        lty = 1,
        lwd = 2,
        col = rainbow(ncol(T)),
        legend = colnames(T)
    )

}

Let’s run this for relatively short loops and 10 replicates:


png("short_r_loops.png")
r_loops(2^(7:15), 10)
dev.off()

This is what the result looks like:

short_r_loops
Short loops: execution time in milliseconds of different types of looping in R, as a function of the size of the loop.

It seems that for these relatively short loops, using for is actually quite comparable to using sapply(). The lapply() function seems a bit faster than sapply(), probably because sapply() is essentially a wrapper for lapply() and thus includes some extra overhead. The for loop with dynamically extended return vector (FOR_LOOP_DYN_ALLOC) is by far the worst performing.

Let’s try some longer loops (omitting FOR_LOOP_DYN_ALLOC, because of the poor performance; using only three replicates):

long_r_loops
Long loops: execution time in milliseconds of different types of looping in R, as a function of the size of the loop.

Now this is interesting: it seems that for long loops, intelligent use of for (by pre-allocating memory) is actually faster than both of the *apply() functions! My guess is that, when calling an *apply() function, the size of the return object is dynamically increased (i.e. during the iteration). For long loops and larger amounts of memory, this will take more time compared to pre-allocation using for.

In any case, the above two plots show that *apply() is not necessarily faster than for. Therefore, an important reason for using *apply() functions may instead be that they fit the functional programming paradigm better, where everything is done using function calls and side effects are reduced: The *apply() functions are essentially ‘functionals‘ that take a certain function f as an input argument. The scope of the variables defined within f is limited to f, and variables defined outside f cannot be modified inside f (except using the special scoping assignment operator <<-). This is what makes calls to *apply() functions very different from running for loops.

SVM with recursive feature elimination in R

The support vector machine (SVM) is a very powerful classifier due to its inherent regularization properties as well as its ability to handle decision boundaries of arbitrary complexity by its formulation as a kernel method. However, suppose that we’re not so much interested in the decision boundary itself, but more in the relative importance of the features in specifying this decision boundary. For example, given two sets (i.e. ‘classes’) of samples, one from tumor tissue and one from normal tissue, and associated RNA-seq gene expression data, one may be interested in ranking genes (i.e. ‘features’) according to their importance in specifying the difference (i.e. ‘decision boundary’) between normal and tumor tissue. This represents an important problem in the field of machine learning, and is called feature ranking. Guyon et al. proposed combining SVMs with recursive feature elimination (SVM-RFE) for feature ranking. Recursive feature elimination in its simplest formulation starts with the complete set of features, and then repeats the following three steps until no more features are left:

  1. Train a model (in the present case, an SVM).
  2. Compute a ranking of features.
  3. Remove the feature with the worst rank.

For step 2, ranking the features, Guyon et al. proposed to take the coefficients to the SVM primal problem (i.e. the coefficients to the SVM hyperplane) as feature weights. Since SVMs are computed by solving the dual problem using quadratic programming, these hyperplane coefficients first need to be computed from the coefficients to the dual problem. In R, for a matrix X with samples in the rows and features in the columns, and a factor y containing the classes, this looks as follows:


# single-step SVM-RFE
svm_rfe__single_step <- function(X, y, ...) {

    library(kernlab)

    # keep track of the iteration during which
    # a feature was eliminated
    ii <- rep(NA, ncol(X))
    i <- 0
    while ( any(is.na(ii)) ) {
        # indices of remaining features
        not_elim_yet <- which(is.na(ii))
        # train the classifier on the remaining features
        fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
        # compute the primal problem coefficients from the dual
        # problem coefficients
        sv_i <- alphaindex(fit)[[1]]
        w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ]
        # eliminate the feature with the smallest squared weight
        to_elim <- not_elim_yet[ head(order( w * w ), 1) ]
        ii[to_elim] <- i
        i <- i + 1
    }
    # convert iterations into ranks
    i - ii

}

The above algorithm eliminates one feature (i.e. gene) per iteration. Typical RNA-seq expression datasets have hundreds to tens of thousands of features, so a call to the function above also to needs to train an SVM hundreds to tens of thousands of times. Needless to say, this can take quite a while. Note that we’re most likely mainly interested in the top-ranking genes, and less so in lower ranking genes. Therefore, we could also eliminate relatively many features during early iterations, and then decrease the number of features to be eliminated as the iteration progresses. Taking the approach of exponential decrease (i.e. eliminate a fixed fraction of remaining features), this looks as follows:


# multi-step SVM-RFE
# default elim_frac corresponds to single-step
svm_rfe <- function(X, y, elim_frac = 1 / ncol(X), ...) {

    library(kernlab)

    # keep track of the iteration during which
    # a feature was eliminated
    ii <- rep(NA, ncol(X))
    i <- 0
    while ( any(is.na(ii)) ) {
        # indices of remaining features
        not_elim_yet <- which(is.na(ii))
        # number of features to eliminate
        n_to_elim <- ceiling ( elim_frac * length(not_elim_yet) )
        # train the classifier on the remaining features
        fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
        # compute the primal problem coefficients from the dual
        # problem coefficients
        sv_i <- alphaindex(fit)[[1]]
        w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ]
        # eliminate the features with the smallest squared weights
        to_elim <- not_elim_yet[ head(order( w * w ), n_to_elim) ]
        ii[to_elim] <- i
        i <- i + 1
    }
    # convert iterations into ranks
    i - ii

}

Let’s see how this works in practice. For this, we’ll use the ‘cervical’ data from the Bioconductor package MLseq. This dataset describes the expression levels as measured by RNA-seq of 714 miRNAs in 29 tumor and 29 normal human cervical samples. We can load and process the data, and run the above functions, using the following code:


library(MLSeq)

# load RNA-seq expression data
data(cervical)

# preprocess
y <- factor(grepl("^T[0-9].*", colnames(cervical)))
levels(y) <- c("normal", "tumor")
X <- scale(t(data.matrix(cervical)))

# SVM-RFE single-step
system.time(I <- svm_rfe(X, y))
# SVM-RFE multi-step
system.time(J <- svm_rfe(X, y, 0.1))

In the example above, using default values removes one feature at a time and thus trains 714 SVMs, one for each feature. On the other hand, setting elim_frac to 0.1 (i.e. at each iteration, 10% of features are eliminated), needs to train only 47 SVMs. Although setting elim_frac at 0.1 trains roughly 15 times fewer SVMs, it runs approximately 30 times as fast (~1 second vs. ~33 seconds on my laptop). This is because the number of SVMs trained on many features is relatively more constrained by the multi-step approach than the number of SVMs trained on fewer features. So, we lose some resolution in the relatively uninteresting lower ranking features, but gain a lot in terms of execution time. This gain in execution time is visualized in the following plot.

Execution time of SVM-RFE in seconds (left) and number of SVMs trained (right); as a function of the fraction of features eliminated per iteration.
Execution time of SVM-RFE in seconds (left) and number of SVMs trained (right); as a function of the fraction of features eliminated per iteration.

Rcpp to integrate C with R

If my R script needs a speed boost, I normally use R’s .C() function to call a function written in C from within R. However, after having read many good things about the Rcpp R-package lately, I decided to give this package a try. In principle, Rcpp was developed to integrate C++ with R. However, C++ is almost a superset of C, and therefore integrating Rcpp with C instead of C++ is mostly just a matter of knowing the Rcpp interface and writing some C code.

As a first step in getting to know the Rcpp interface, I set myself the goal of designing a function myOrder() that returns the ordering of the elements of a given numeric vector. In R, this would normally be done by calling the function order(). For example, ordering the vector [2.1, 4.5, 1] would return [3, 1, 2]. I wrote two functions for doing this type of ordering and stored these in a text file myOrder.cpp, with the extension .cpp for C++:


#include <Rcpp.h>
using namespace Rcpp;

int pointer_compare(
    const void* a,
    const void* b
) {
    /* Two double-pointers p1 and p2 */
    double* p1 = *(double**)a;
    double* p2 = *(double**)b;

    /* Compare pointers based on the value they point to */
    if ( *p1 < *p2 ) {
        return -1;
    } else if ( *p2 < *p1 ) {
        return 1;
    }
    return 0;
}

// [[Rcpp::export]]
IntegerVector myOrder(
    DoubleVector x
) {

    int i, n = x.size();
    IntegerVector ret(n); 

    double** p = (double**) malloc(n * sizeof(double*));

    /*
    Retrieve the addresses of the elements in x,
    and store these in the pointer array p.
    */
    for ( i = 0; i < n; i++ ) {
        p[i] = &x[i];
    }

    /*
    Sort the pointers based on their de-referenced
    values. (i.e. the values of the elements in x)
    */
    qsort(p, n, sizeof(double*), pointer_compare);

    /*
    'Normalize' the pointers to a range of [1...n]
    by subtracting the address of the first element
    in x.
    */
    for ( i = 0; i < n; i++ ) {
        ret[i] = p[i] - &x[0] + 1; // add 1 for R
    }

    free(p);

    return ret;
}

The above functions are C code, except for only a few lines:

  • Line 2: defines a namespace for the code, such that anything defined in the header file Rcpp.h that is used in the above code, does not need to be prepended with “Rcpp::”. For example, type DoubleVector is defined in Rcpp.h, and the DoubleVector parameter of function myOrder() would normally need to be declared as “Rcpp::DoubleVector x”. However, by using the namespace, it is sufficient to write “DoubleVector x”.
  • Line 21: specifies that function myOrder() needs to be callable from within R.
  • Lines 22, 23, 27: IntegerVector and DoubleVector are Rcpp-specific C++ type definitions, corresponding to integer and numeric vectors in R, respectively.
  • Line 26: size() is a method for a vector object, which returns the length of that vector.

The above functions can be compiled, linked and called by running the following R code:


# load the library
library(Rcpp)
# compile, link and load the myOrder() function
sourceCpp("myOrder.cpp")
# generate some data
x <- rnorm(10)
# return the ordering of the data
myOrder(x)

Note that by calling the sourceCpp() function, the Rcpp package takes care of compiling and linking, and after that the C function can conveniently be called from within R.

It is interesting to compare the performance of R’s internal order() function with this custom-made Rcpp function.

order
Comparing the running time of R’s internal order() function to the myOrder() function for ordering numeric vectors.

You can see that the Rcpp function is quite a bit faster than R’s internal function. Obviously, R’s internal order() function is more generic than myOrder(). Also, myOrder() does not deal with NAs and only allows for non-decreasing ordering, to name just two things. Still, the above figure shows that, with relatively little effort and quite painlessly compared to R’s .C interface, it is possible to speed up R code substantially, even when considering low-level implemented R functions such as order().

 

Comparing for-loops in R and Python

Python and R are two of the most commonly used high-level programming languages for data analysis. An excellent infographic that compares these languages for data analysis can be found here.

It is sometimes suggested that Python is faster than R, often illustrated using relatively complex examples. However, increasing complexity usually also means an increasing number of ways of optimizing code for better performance, making a direct comparison between the two languages more convoluted. Therefore I was wondering: How do Python and R compare in something as simple as running through a for-loop? To make things as simple as possible, I will compare Python and R in running through an empty for-loop. More specifically, within the body of the loop, nothing will happen at all, so all that’s measured is the actual looping time. In R, for a loop of 10 million iterations, repeated 10 times, this looks as follows:


N <- 1:1E7
nrep <- 10

cat(replicate(nrep, system.time({
	for ( n in N ) { ; }
})[3]), sep = "\n")

In Python, the repeated empty for-loop looks as follows:


import time

N = xrange(0,int(1E7))
nrep = xrange(0,10)

for rep in nrep:
   t0 = time.time()
   for n in N:
      pass
   print time.time() - t0

The results of running these empty for-loops are shown below.

R_vs_Python
Comparing execution times of for-loops of 10 million iterations, repeated 10 times.

Interestingly, R is quite a bit faster than Python! Note of caution: I have only tested this on OS X 10.9.5, using R 3.1.2 and Python 2.7.11, and I cannot exclude the possibility that results would look different under different OSs using different versions of R and Python. Also, to a certain extent, the question of which of these two languages is faster is not very relevant. More specifically, while in general it is important for programming languages to be fast, the primary importance of these two languages is that they are convenient for data analysis, and when it comes to data analysis both languages are great, but just serve slightly different purposes. (Again, for more details, have another look at the infographic.)

 

 

Piano competitions and successful careers: the statistics

In my first post about piano competitions, I explained how I collected data from a website describing more than 11000 results from about 3000 piano competitions. By doing some elementary analysis in the statistical programming language R, among other things I confirmed my suspicion that Italians do very well in piano competitions, in terms of number of prizes won per million inhabitants of a country.

However, doing well in piano competitions should not be an end in itself, and there are quite some examples of pianists winning top prizes at major piano competitions without going on to having a successful career. Therefore, one might wonder how predictive doing well in piano competitions is of achieving this ultimate goal: a successful career. To try and answer this question, we first need to think about the two things we want to relate:

  1. What defines “doing well in piano competitions”? We only have data for pianists who actually reached the finals of any given competition, not for those not reaching the finals. Therefore, “doing well” is determined by the ranking of the pianists within each of the finals, i.e. 1st prize, 2nd prize, 3rd prize, etc., and possibly finalists without a prize.
  2. What defines a “successful career”? Obviously, one could think of measures such as “number of concerts per year” or “yearly income”. Needless to say, these data are pretty hard to come by. Therefore, I decided to go with the relatively pragmatic “number of Google hits on the rainy Saturday afternoon of November 28, 2015”, as measured using the Custom Search API from the Google Developers Console for doing the following search: <“first_name last_name” piano>. In other words, the assumption is: the more Google hits, the more success.

So, we will try and establish whether knowing the prize won in a competition allows us to predict the number of Google hits. We will call this the “prize effect”, i.e. the effect of prize on number of Google hits. For example, we can take the Queen Elisabeth Competition, and plot the prize won by each finalist against his/her number of Google hits:

Queen_Elisabeth_Competition
For the Queen Elisabeth Competition (years 2003, 2007, 2010, 2013), the prize won by each finalist against his/her number of Google hits. Note that in a Queen Elisabeth Competition finals, there are 12 finalists, only 6 of whom are actually awarded a prize (prizes 1 to 6). The remaining 6 do not receive a prize, but here I artificially award them with prize “7”.

We can see that 1st prize winners generally have more Google hits than finalists without a prize, so there indeed seems to be a weak trend. (Statistically, this observation is confirmed by the Kendall correlation coefficient of -0.3 and corresponding p-value of 0.0055)

OK, simple enough, right? Well…., maybe not. These are only results for the Queen Elisabeth Competition. What if we want to assess the same trend, but based on all competitions simultaneously? Now we have a problem, because some competitions are more prestigious than others. In other words,  you might expect someone coming in 3rd at the Van Cliburn Competition to have better chances of having a successful career (and thus more Google hits) than someone coming in 3rd at the Premio de Piano Frechilla-Zuloaga, simply because the Van Cliburn Competition is a more prestigious competition. We will call this the “competition effect”. Also, it is not unlikely that the number of Google hits in November 2015 is influenced by the year in which a competition was won. We will call this the “year effect”. So what we want to do is to determine the “prize effect” on the number of Google hits, while correcting for the “competition effect” and the “year effect”. (Now we’ll get into some technical details, but feel free to skip these and head directly over to the conclusion to learn whether prize indeed predicts a successful career) Fortunately, there is a class of statistical models called mixed models that can help us out here. More specifically, we’ll use the lme4 R-package to construct a mixed model predicting number of Google hits from a fixed effect “prize”, a fixed effect “year”, and a random effect “competition”. Suppose that data.frame K0 contains our data, namely columns:

  1. nhits: number of Google hits
  2. prize: the prize won
  3. year: the year in which a prize was won
  4. competition: the name of the competition

Then one way of determining whether the prize effect is significant when taking into account the competition and year effects is the following:


# Log-transform the number of Google hits, to make it a bit better 
# behaved in terms of distribution. To make the log-transform work, 
# we first need to add a 'pseudocount' of 1, so as to avoid taking 
# the logarithm of 0.
K0$nhits <- log10(K0$nhits+1)
# Make "year" into a factor, such that it will be treated as a 
# categorical variable. 
K0$year <- factor(K0$year)
# Train the null model, predicting nhits only from competition.
fit.null <- lmer(
	nhits ~ year + (1|competition),
	data = K0,
	REML = FALSE
)
# Train the full model, predicting nhits from prize and competition.
fit.full <- lmer(
	nhits ~ prize + year + (1|competition),
	data = K0,
	REML = FALSE
)
# compare the null model with the full model by performing a 
# likelihood ratio test using the anova() function
anova(fit.full,fit.null)

Note that year is treated as a categorical variable. This is because it is likely that the effect of year on nhits is nonlinear. Indeed, I observed this for the Queen Elisabeth competition, with relatively many Google hits for the 2003 and 2013 competitions, and fewer for the 2007 and 2010 competition. This could be explained by the fact that 2013 laureates get more hits due to winning a prize, and that 2003 laureates get more hits because they have had 10 years to establish a career. This nonlinearity makes it more difficult to deal with in linear models. However, the number of years is limited, and moreover we are not interested in assessing the magnitude of the year effect, but only in removing it. Therefore, we can treat it as a categorical variable. The above approach gives p < 2.2e-16 for the difference between the two models, thus indicating that, across competitions in general, prize indeed contributes to the number of Google hits. We can try to visualize the prize effect by plotting prize against the residuals of both the null model and the full model:

mixed_model_residuals
Prize against the residual log10 number of Google hits (“nhits”), for the null model as well as the full model.

This demonstrates that there is a significant correlation of prize with residual nhits in the null model that is removed when including prize as a predictor variable in the full model. This indeed indicates a trend across competitions and years for mildly more Google hits with winning top prizes. Also evident from this plot is that the residuals may not be normally distributed, thus violating one of the assumptions of mixed models. This is even more clearly seen in a Q-Q plot, below for residual nhits of the full model:

QQ_plot
Q-Q plot of the log10 residual number of Google hits of 1st prize winners, according to the full model.

If the residuals were normally distributed, they would more closely follow the dashed line. Thus, the mixed model as described above may not be entirely appropriate for our current purpose. Therefore, in order to determine the significance of the prize effect, we may want to replace the likelihood ratio test with a non-parametric test, such as a permutation-based test. Doing this, as it turns out, also gives a significant result, namely p < 0.001 using 1000 permutations. Thus, prize can predict number of Google hits, at least to a certain extent. This is also indicated by the highly significant non-parametric Kendall correlation of prize with residual nhits in the null model. However, at -0.081 the magnitude of this correlation is fairly small. Note of caution: strictly speaking we have not established that winning a top prize actually causes a higher number of Google hits; we have only established an undirected association between the two. Nonetheless, considering that all competition results preceded the retrieval date of the number of Google hits, typically by 5 to 10 years, this is by far the most likely interpretation.

The above observations lead to the following conclusion: if you are a finalist in a piano competition and win a top prize, you do seem to have better chances of having a successful career than a finalist not winning a top prize, at least in terms of number of Google hits. However, this “prize effect” is remarkably small when observed for piano competitions in general.

Faster Hamming distance in R (2)

In my last post, I proposed a very fast and simple matrix multiplication approach for calculating the Hamming distance between all columns of a matrix X. This approach was restricted to binary matrices:


hamming_binary <- function(X, Y = NULL) {
	if (is.null(Y)) {
		D <- t(1 - X) %*% X
		D + t(D)
	} else {
		t(1 - X) %*% Y + t(X) %*% (1 - Y)
	}
}

Since it only relies on low-level matrix algebra routines, the above function is extremely fast, much faster for example than the hamming.distance function from the e1071 package, which is implemented as an R-level nested loop and is also restricted to binary matrices. However, what if we would like to calculate the Hamming distance between strings of characters, such as DNA sequences, which are composed of As, Cs, Gs and Ts? The nice thing is, that matrix multiplication can also be used for matrices with more than two possible values, and different types of elements. Furthermore, the approach below is generalized to calculating the Hamming distances between all pairs of columns of two matrices X and Y:


hamming <- function(X, Y = NULL) {
	if (is.null(Y)) {
		uniqs <- unique(as.vector(X))
		U <- X == uniqs[1]
		H <- t(U) %*% U
		for ( uniq in uniqs[-1] ) {
			U <- X == uniq
			H <- H + t(U) %*% U
		}
	} else {
		uniqs <- union(X, Y)
		H <- t(X == uniqs[1]) %*% (Y == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + t(X == uniq) %*% (Y == uniq)
		}
	}
	nrow(X) - H
}

The function above calculates the Hamming distance between all columns of a matrix X, or two matrices X and Y. Again matrix multiplication is used, this time for counting, between two columns x and y, the number of cases in which corresponding elements have the same value (e.g. A, C, G or T). This counting is done for each of the possible values individually, while iteratively adding the results. The end result of the iterative adding is the sum of all corresponding elements that are the same, i.e. the inverse of the Hamming distance. Therefore, the last step is to subtract this end result H from the maximum possible distance, which is the number of rows of matrix X.

Interestingly, such a "multi-level" Hamming distance can also be quite elegantly expressed as a sum of multiple binary Hamming distances:


hamming <- function(X, Y = NULL) {
	if (is.null(Y)) {
		uniqs <- unique(as.vector(X))
		H <- hamming_binary(X == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + hamming_binary(X == uniq)
		}
	} else {
		uniqs <- union(X, Y)
		H <- hamming_binary(X == uniqs[1], Y == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + hamming_binary(X == uniq, Y == uniq)
		}
	}
	H / 2
}

Note that at the end, we need to divide by two, because by summing the individual binary Hamming distances, we are counting each element twice.

Also note that the function is not dependent on the mode of the matrix. In other words, the matrix can consist of characters, integers, and any other type of element you can store in an R matrix.

Fast Hamming distance in R using matrix multiplication


This post is strictly about binary matrices. For matrices of any data type, have a look at this post.

The hamming distance is perhaps the most frequently used distance measure for binary vectors. It is defined on two vectors of equal length as the number of corresponding elements that differ between the two vectors. A special case of the Hamming distance, is the Hamming distance between two binary vectors, which is often used e.g. in telecommunication to count the number of bit flips in a signal, and in bioinformatics as a similarity measure for hierarchical clustering. While this binary Hamming distance calculation between only two vectors is simple and computationally cheap, calculating all pairwise distances between the rows of a matrix quickly becomes prohibitive, as the computation time scales quadratically with the number of rows in a matrix. The following R function represents a basic implementation of calculating the Hamming distance between all rows of a matrix, using a nested for-loop:


hamming_loop <- function(X) {
	d <- matrix(0, nrow = nrow(X), ncol = nrow(X))
	for ( i in 1:(nrow(X) - 1) ) {
		for ( j in (i + 1):nrow(X) ) {
			d[i,j] <- d[j,i] <- sum(X[i,] != X[j,])
		}
	}
	d
}

 

However, in R, the Hamming distance between the rows of a binary matrix can be computed much faster, by exploiting the fact that the dot product of two binary vectors x and (1 – y) counts the corresponding elements where x has a 1 and y has a 0:


hamming <- function(X) {
	D <- (1 - X) %*% t(X)
	D + t(D)
}

To demonstrate how much faster the matrix multiplication function is than the nested for-loop function, I ran some simulations. In addition to the above two functions, I included the function hamming.distance from the R-package e1071:

Hamming distance computation time in seconds, as a function of number of rows, while keeping the number of columns at 100.
Hamming distance computation time in seconds, as a function of number of rows, while keeping the number of columns at 100.

The relatively naive nested for-loop approach is faster than the hamming.distance function from the e1071 package. This is surprising, since the e1071 hamming.distance function is also implemented as a nested loop. Strikingly, the matrix multiplication-based function is by far the fastest function for computing the Hamming distance, the reason being that it does not use any expensive looping at R-level, only efficient lower level linear algebra routines. Note that the above approach is restricted to binary matrices. However, quite conveniently, matrix multiplication can also be used for computing the Hamming distance for matrices with more than two possible values, and other types of elements (character, numeric, …), as explained in this post.

Which nationalities do well in piano competitions?

Yesterday, I was listening to Guido Agosti’s awesome piano transcription of Strawinsky’s The Firebird, in a great performance by the Italian pianist Francesco Piemontesi. I remembered Francesco Piemontesi from coming in third at the 2007 Queen Elisabeth Competition in Brussels, and I was thinking: how come it seems that Italian pianists always do so well in international competitions, compared to many other European nationalities? It struck me that this was actually an assumption, as I had never seen any concrete numbers on this, at least not across many different competitions world-wide.

Curious as I am, I paid a visit to a website collecting this type of data, to see if I could find any numbers on this. Unfortunately, the website itself provided only very limited functionality for mining the data, by far not enough to answer the question I was interested in. However, being a computational scientist / data scientist (and classically trained pianist), I could not resist: Using Firefox’s Network Monitor, I tracked down the precise HTTP GET request that was sent to query the database for any individual competition. Then, I constructed my own raw HTTP GET requests, and used the computer networking tool netcat to send these in batch to the server. This allowed me to retrieve the results in HTML for about 3000 international piano competitions. I parsed this HTML data mainly using the XML and stringr R-packages (tricky…), downloaded some more data regarding population size by country from Wikipedia, and then did some elementary analysis in R.

Behold, the first results of analyzing this data: the number of prizes per million inhabitants of a country.

Number of prizes in international piano competitions, per million inhabitants, by country
Number of prizes in international piano competitions, per million inhabitants, by country

The results are pretty interesting. Indeed, Italy ranks highest of all Western European countries! Another suspicion confirmed is that Belgium scores substantially better than the Netherlands (my home country). Somewhat surprising is that the charts are heavily dominated by Eastern Europe (top 7), and more specifically by former Soviet states, excluding Russia itself (top 5!). The top Eastern Asian country is South Korea. China ranks very low, so the large numbers of Chinese pianists may currently still be explained by China’s huge population.

Although the results are quite interesting, some caution regarding the interpretation particularly of small differences is in order:

  1. Population sizes are from 2014-2015; competition results are from 2000-2015.
  2. Some countries do not exist anymore. For example, prizes won by Yugoslavian pianists are not counted towards any of the currently existing states that previously made up Yugoslavia.
  3. The number of prizes in a country may somewhat depend on the number of international competitions in that country.
  4. Some number are very low. For example, Mongolia ranks a bit higher than China (!). However, this is based on only three prizes won by Mongolian pianists.

I will probably have another, more sophisticated, look at this data in the near future. To be continued…..