Which nationalities do well in piano competitions? (2)

In my first post on piano competition statistics, I made a few interesting observations relating countries to number of prizes won (per capita) in piano competitions:

  1. Of all Western European countries, Italy wins most prizes in piano competitions.
  2. Belgium scores substantially better than the Netherlands (my home country).
  3. Many top-scoring countries are Eastern European countries (top 7), and more specifically former Soviet states, excluding Russia itself.
  4. The top Eastern Asian country is South Korea.
  5. China ranks very low, so the large numbers of Chinese pianists may currently still be explained by China’s huge population.

These are observations aggregated across the years 2000 – 2015. But what about trends across 2000 – 2015? Are some countries up-and-coming, winning more and more prizes each year? Are other countries in decline? The above observations do not say anything about this. To gain more insight into trends, we need to look at the prizes won in each individual year, without aggregating across all years. As an example, have a look at the prizes won by American contestants across the years 2000 – 2015, visualized as a fraction of the total number of prizes awarded each year:

united-states__prizes_per_year
The United States have been winning more and more prizes between 2000 and 2015. Fraction of total number of prizes awarded each year won by American pianists. The trend is summarized using the Pearson correlation coefficient r (here: r = 0.62), which varies from -1 (clear trend of fewer prizes in recent years) to 0 (no trend) to 1 (clear trend of more prizes in recent years). The low p-value (p = 0.001) shows that the trend is significant, i.e. cannot be explained by random chance. It can thus be considered a ‘real’ trend.

You can see that the general trend is that Americans are winning increasingly many prizes. This increasing trend can be summarized in a single number, the correlation coefficient, which we will denote by r. The correlation coefficient r can take on any value between -1 and 1, with

  • r = -1: meaning a clear trend of fewer prizes in recent years,
  • r = 0: meaning no discernible trend,
  • r = 1: meaning a clear trend of more prizes in recent years.

In the above plot, r = 0.62, indicating a pretty strong positive trend of winning more and more prizes. Moreover, this trend cannot be explained by random chance, as demonstrated by the significantly low p-value (p = 0.001. It can thus be considered a ‘real’ trend. (Technical note; thus feel free to skip: the figure uses the Mann-Kendall test for monotonic trend detection).

Now you are probably curious to see this number r for all individual countries, and not just for the United States, so that you can see which countries are up-and-coming, and which are on the decline. This is precisely what the barplot below shows: for each country, the trend in prize-winning between 2000 and 2015 is summarized in a single correlation coefficient r, and a bar is colored red if the trend is significant (p < 0.05), i.e. can be considered a ‘real’ trend.

prizes_per_year_and_country
Which countries have been winning increasingly many prizes between 2000 and 2015? For each country, the trend in number of prizes won across the years 2000 – 2015, summarized using the Pearson correlation coefficient. For bars > 0, generally there were more prizes in recent years. For bars < 0, generally there were fewer prizes in recent years. For the red bars, these trends could not be explained by random chance.

How does the above plot compare to the five observations stated at the beginning of this post?

  1. Top Western European prize-winner Italy is on the decline. The original observation was that, when aggregating across the years 2000 – 2015, Italy has won most prizes per capita. The above barplot additionally shows that, when looking at the trend across the years 2000 – 2015, Italians have however been winning fewer prizes in recent years (r < 0; p < 0.05).
  2. Are the Netherlands overtaking Belgium? The original observation was that, when aggregating across the years 2000 – 2015, Belgium has won more prizes per capita than the Netherlands. The above barplot additionally suggests that, when looking at the trend across the years 2000 – 2015, the Netherlands may however have been improving a bit throughout the years, although the improvement is not significant (r < 0; p > 0.05).
  3. Top prize-winner Estonia is on the decline. The original observation was that, when aggregating across the years 2000 – 2015, of any country Estonia has won most prizes (per capita). The above barplot additionally shows that, when looking at the trend across the years 2000 – 2015, Estonia is however on the decline and has been winning fewer prizes in recent years (r < 0; p < 0.05).
  4. South Korea is the strongest up-and-coming country. The original observation was that, when aggregating across the years 2000 – 2015, of any Eastern Asian country South Korea has won most prizes per capita. The above barplot additionally shows that, when looking at the trend across the years 2000 – 2015, South Korea can be identified as the strongest up-and-coming country, winning more and more prizes each year (r > 0; p < 0.05).
  5. Beware of China with its huge population. The original observation was that, when aggregating across the years 2000 – 2015, China ranked very low due its large population. The above barplot additionally makes clear that, when looking at the trend across the years 2000 – 2015, the Chinese are however winning more and more prizes (r > 0; p < 0.05).

Finally, a minor note of caution with respect to the interpretation of the above results: It may be that in some cases a small part of the increase in number of prizes can be explained by population growth, which was not taken into account. On a related note: Population size itself is not relevant in this analysis, as we are only looking at the increase (or decrease) in number of prizes, regardless of the population size.

Bias-variance decomposition

In machine learning and statistics, predictive models are typically not perfect. But what does ‘not perfect’ mean? For example, take the case of weather forecasting, and imagine a weatherman who is simply overly optimistic, and too often predicts sunny weather and high temperatures. On the other hand, one could also think of a weatherman who, for some complicated reasons, on some days grossly overestimates the temperature, while on other days grossly underestimating it, such that on average he is closer to the true temperature than the first weatherman. In absolute terms, the errors in degrees Celsius/Fahrenheit made by both weathermen could be comparable. However, the first weathermen may be considered more reliable because he is consistent, but the second may be considered more reliable because on average he is closer to the truth. This is analogous to the bias-variance trade-off in machine learning and statistics: if we see a model making errors, is this generally the result of bias (e.g. consistently predicting high temperatures), of variance (e.g. wildly varying temperature predictions across days), of noise (weather is just unpredictable…), or of a combination of all?

It turns out that mathematically speaking, the error made by a model can indeed be decomposed into two terms corresponding to bias and variance, plus one additional term, representing noise (e.g. daily fluctuations in temperature we cannot account for). To prove this is not difficult. However, as I found many proofs online somewhat lacking in detail, I have written my own, which will consequently be one of the longer proofs you will find online:

 

bias_variance_noise_decomposition

A nice example for demonstrating bias and variance, is the estimation of population variance from sample variance. For example, suppose that we want to estimate the variance in height of all American men, from knowing the heights of only 3 American men? In other words, we want to use the sample variance (of 3 American men) as an approximation of the population variance (of all American men). In principle, for points x_1, x_2, \dots, x_n, the variance is computed as follows:

\frac{1}{n}\sum_{1}^{n}\left(x_i - \bar{x}\right)^2

 

However, it turns out that if you want to use the sample variance as an approximation of the population variance, the above calculation is biased, and you need what’s called Bessel’s correction in calculating the variance:

\frac{1}{n-1}\sum_{1}^{n}\left(x_i - \bar{x}\right)^2

 

The above calculation is not biased. Many proofs of this can be found online, three of them on Wikipedia. I will give a demonstration of this bias by simulation: 10000 times, a sample of size 3 is drawn from the standard normal distribution (zero mean and unit variance). For each of the 10000 samples, the variance is calculated with and without Bessel’s correction. The results are summarized in the density plot below.

bessel_correction
A sample of size 3 is drawn from the standard normal distribution, and the variance is calculated, both with and without Bessel’s correction. This is repeated 10000 times.

The plot above confirms that in calculating the variance using Bessel’s correction, the average of the 10000 approximations of the population variance is very close to the true value (1), much closer in fact than when not using Bessel’s correction. However, its spread is larger compared to not using Bessel’s correction. In other words, Bessel’s correction leads to lower bias but higher variance. Therefore, Bessel’s correction is weatherman no. 2.

A slightly more involved example of the bias-variance trade-off is the following. Imagine (“population”) data that was generated using the sine function on the domain [-\pi, \pi], and adding some Gaussian noise. A complete cycle could look like this:

data
A complete cycle with and without Gaussian noise.

Now, suppose we would be given three training data points, generated in the way described above:

  1. Randomly sample three points x_1, x_2, x_3 from the domain [-\pi, \pi].
  2. Generate Gaussian noise \epsilon_1, \epsilon_2, \epsilon_3 , one for each of the x_i.
  3. Compute y_i = sin(x_i) + \epsilon_i

I should emphasize that we do not actually know the three data points were generated using the sine function, we just see the three data points. Based on these three points (x_1, y_1), (x_2, y_1), (x_3, y_3), we want to fit two models, a very simple one (a constant, or 0th-order polynomial), and a more complex one (a line, or 1st-order polynomial). Suppose furthermore that we would repeat the sampling and fitting 1000 times, and each time we measure the error with respect to the “population” data. For fitting the constant, the results may look like this:

model_fits
Fitting a constant to three points, randomly generated by the sine function and adding noise; repeated 1000 times.

For fitting the line, the results may look like this:

model_fits
Fitting a line to three points, randomly generated by the sine function and adding noise; repeated 1000 times.

It can already be seen that fitting a line seems to better capture the overall increasing trend of the data than does fitting a constant. However, this comes at the expense of high variance in the line fits compared to the constant fits.

This bias-variance trade-off across 1000 fits is even more clearly visualized by summarizing the empirical errors across the 1000 fits, with the empirical error decomposed into bias and variance, respectively. For fitting constants, the decomposition looks like this:

bias_variance_noise
Bias-variance decomposition of the errors observed when fitting a constant 1000 times.

For fitting a line, the decomposition looks as follows:

bias_variance_noise
Bias-variance decomposition of the errors observed when fitting a line 1000 times.

It is clear that although the complex model (the line) on average has a better fit (i.e. low bias), this comes at the expense of much larger variance between the individual fits. Hence, in statistical and machine learning modeling, the impact of model complexity on bias and variance should always be considered carefully.

Efficient recommending with the arules package

The arules package is a great R package for inferring association rules using the Apriori and Eclat algorithms, and can for example be used for recommending items to users, based on known purchases of these items by the same, or possibly different, users.

In this package, one can use the apriori() and eclat() functions to fit a model consisting of association rules. The arules documentation gives the following example:


library(arules)
data(Adult)
imat <- as(Adult, "itemMatrix")

rules <- apriori(
  imat,
  parameter = list(
    supp = 0.5,
    conf = 0.9,
    target = "rules"
  )
)

However, one thing I’ve personally missed in the arules package is a general recommend() function, analogous to a predict() function that often accompanies a fit() function in R packages, to make predictions based on a fitted model and new data. A recommend() function would predict for each combination of user and item whether that particular item would be recommended to that particular user. Such a function can be written as follows:


recommend <- function(
  rules, # object of class rules
  newdata # object of class itemMatrix
) {

  # Which transactions in newdata match
  # the LHS of which rule?
  lhs_match <- is.superset(
    newdata,
    lhs(rules),
    sparse = TRUE
  )

  # For which (item, transaction) pairs is
  # the RHS also matched?
  rec <- lhs_match %*% t(rhs(rules)@data)

  # Make sure the row/column names
  # are the same
  rownames(rec) <- rownames(newdata)
  colnames(rec) <- colnames(newdata)

  # return an itemMatrix
  as(t(rec), "itemMatrix")
}

Building on the example above, recommendations are then done as follows:


rec <- recommend(rules, imat)
# If desired, remove known values
rec@data[imat@data] <- FALSE

Note that the above recommend() function is quite efficient, as it does not do any looping, but instead uses matrix multiplication after the call to is.superset() to combine matches to the left-hand side of rules with matches to the right-hand side of rules.

R: are *apply loops faster than for loops?

In a high-level programming language such as R or Python, it is typically a good idea to avoid loops whenever possible, since looping in high-level programming languages can be very time-consuming.

However, sometimes loops cannot be avoided. In R, in addition to traditional looping constructs such as the for and while loop, the *apply() functions provide looping functionality using a single function call. It is often suggested to avoid traditional looping constructs in favor of these *apply() functions, because a single call to an *apply() function would be faster than running through a for or while loop. To investigate this claim, I will run a small benchmark: I will time a (trivial) loop that returns a vector of integers 1 to n, for different values of n, using four different variants of running a loop in R:

  • Using a for loop: the return vector is dynamically extended while iterating through the loop. This variant I call FOR_LOOP_DYN_ALLOC.
  • Using a for loop: the memory for the return vector is pre-allocated, before iterating through the loop. This variant I call FOR_LOOP_PRE_ALLOC.
  • Using R’s sapply() function.
  • Using R’s lapply() function, followed by a call to unlist() to return a vector instead of a list.

Below, you can find an R function for timing these loops, for a vector of different loop sizes N, and a number of replicates N to average the execution time over.


r_loops <- function(
    N, # The different loop sizes to try
    nrep # The number of replicates to average the execution time over
) {
    T <- 1e3 * data.frame(
        FOR_LOOP_DYN_ALLOC = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- integer(0)
                    for (i in 1:n) {
                        x <- c(x, i)
                    }
                })[3]
            }))
        })),
        FOR_LOOP_PRE_ALLOC = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- integer(n)
                    for (i in 1:n) {
                        x[i] <- i
                    }
                })[3]
            }))
        })),
        SAPPLY = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- sapply(1:n, function(i) {
                        i
                    })
                })[3]
            }))
        })),
        UNLIST_LAPPLY = rowMeans(replicate(nrep, {
            unlist(lapply(N, function(n) {
                system.time({
                    x <- unlist(lapply(1:n, function(i) {
                        i
                    }))
                })[3]
            }))
        }))
    )    

    matplot(
        N,
        T,
        type = "l",
        lty = 1,
        lwd = 2,
        col = rainbow(ncol(T)),
        log = "xy",
        xlab = "size of loop",
        ylab = "execution time (milliseconds)",
        main = sprintf("Average of %i replicates", nrep)
    )
    legend(
        "topleft",
        lty = 1,
        lwd = 2,
        col = rainbow(ncol(T)),
        legend = colnames(T)
    )

}

Let’s run this for relatively short loops and 10 replicates:


png("short_r_loops.png")
r_loops(2^(7:15), 10)
dev.off()

This is what the result looks like:

short_r_loops
Short loops: execution time in milliseconds of different types of looping in R, as a function of the size of the loop.

It seems that for these relatively short loops, using for is actually quite comparable to using sapply(). The lapply() function seems a bit faster than sapply(), probably because sapply() is essentially a wrapper for lapply() and thus includes some extra overhead. The for loop with dynamically extended return vector (FOR_LOOP_DYN_ALLOC) is by far the worst performing.

Let’s try some longer loops (omitting FOR_LOOP_DYN_ALLOC, because of the poor performance; using only three replicates):

long_r_loops
Long loops: execution time in milliseconds of different types of looping in R, as a function of the size of the loop.

Now this is interesting: it seems that for long loops, intelligent use of for (by pre-allocating memory) is actually faster than both of the *apply() functions! My guess is that, when calling an *apply() function, the size of the return object is dynamically increased (i.e. during the iteration). For long loops and larger amounts of memory, this will take more time compared to pre-allocation using for.

In any case, the above two plots show that *apply() is not necessarily faster than for. Therefore, an important reason for using *apply() functions may instead be that they fit the functional programming paradigm better, where everything is done using function calls and side effects are reduced: The *apply() functions are essentially ‘functionals‘ that take a certain function f as an input argument. The scope of the variables defined within f is limited to f, and variables defined outside f cannot be modified inside f (except using the special scoping assignment operator <<-). This is what makes calls to *apply() functions very different from running for loops.

Order of appearance and final ranking in the Queen Elisabeth Competition

The degree to which competitions can be objectively judged depends on the type of competition. For example, athletes competing in a 100m sprint can be objectively ranked by the time they need to cross the finish line. On the other hand, objective judgement is virtually impossible in music competitions, because so much depends on highly subjective personal opinions, which can differ widely across members of a jury. A famous example is given by the 1980 Chopin Competition in Warsaw, where Ivo Pogorelich was eliminated in the 3rd round. This prompted Martha Argerich, declaring Pogorelich a ‘genius’, to resign from the jury in protest. It is also important to note that, since personal opinions play such an important role in judging music competitions, there is an increased risk of biases resulting from non-music related (i.e. confounding) factors. For example, it is often said that the order of appearance affects the final ranking of the candidates. In addition to jury members as a potential cause of these biases, the contestants themselves may also be involved generating these biases, for example when a specific positioning within the program of a competition has a certain psychological effect on a contestant.

The Queen Elisabeth Competition is one of the most prestigious music competitions in the world, and while watching the final round of this year’s edition in May, I was wondering to what extent the order of appearance has influenced the final rankings over the years. To assess this influence, I collected the data on order of appearance and ranking in all Queen Elisabeth Competition finals since 1952 from the website of the competition, in total 216 results for 18 piano competitions. In the final round of the Queen Elisabeth Competition, 12 pianists give their performances spread across 6 days, where each pianist is randomly (!) allocated a specific time slot on one of the 6 days, either before or after the intermission. Hence, one can imagine the final ranking being influenced (‘biased’) by the day of performance (1 – 6), as well as by the order of performance on a given day (1 or 2). This is visualized in the boxplot below, where prizes won by finalists over the years are divided into categories based on the day of performance, and the order of performance on a given day.

prize_from_day_times_program
How does the prize won by a Queen Elisabeth competition finalist depend on the day and order of performance? Each red dot represents a prize won by a candidate. For each day there is a gray box that summarizes the prizes won by candidates on that day: the thick black line within each of the gray boxes represents the typical prize for that particular day/finalist. More specifically, it is the median prize, i.e. 50% of prizes are higher and 50% of prizes are lower than the black line.

Note that for all competitions since 1995, only prizes 1 – 6 have been awarded, and the remaining finalists are unranked. In the plot above, these unranked finalists have been awarded the ‘9.5th prize’ (the average of 7 and 12).

Quite strikingly, the boxplot shows that final ranking is indeed substantially influenced by day and order of appearance. For example, if you’re the first finalist on the first day, this may mean you have lost before you have played even a single note: the 2nd finalist on the last day was typically ranked 5 to 6 places higher than the 1st finalist on the first day! (as measured by the difference in median prize, i.e. the thick black line within each of the gray boxes). At least equally striking is how prize depends on the order of performance on a given day. For example, as a finalist you may be happy about being selected to play on the last day. However, it may not do you any good, unless you are scheduled to play after the intermission: the 2nd finalist on the last day was typically ranked 5 places higher than the 1st finalist on that same day! More generally, the 1st finalist on a given day typically did quite a bit worse than the 2nd finalist on the same day. Moreover, a 1st finalist on days 4, 5 or 6 typically did worse than a 2nd finalist on days 1, 2 or 3.

The above observations imply that if we would want to place bets regarding the final ranking, without having any prior knowledge of music, we may actually be able to do quite a bit better than random guessing. For example, suppose we would want to predict whether a finalist receives a prize or is unranked (or, for competitions before 1995: whether a finalist ranks between 1 to 6, or between 7 to 12). In this case, by random guessing we would expect to classify 6 out of 12 finalists correctly. Guessing more than 6 correctly as receiving a prize or not, is better than expected. (Some technicalities are following, but feel free to skip to the last paragraph of this post.) One way of trying to do better than the expected number of 6 correct guesses, is to use a Random Forest classifier. Doing this in the statistical programming language R, using the randomForest package, gives a specificity of 0.56 and a sensitivity of 0.71, as determined using the predicted classes of the input samples based on out-of-bag samples (p ~= 0.0001 using Fisher’s exact test). Together, these give a generalization error rate of 37%.

Hence, using a classifier it is expected that one would classify 100% – 37% = 63% of finalists correctly as receiving a prize or not, purely based on day and order of performance of the finalists. Note that 63% of finalists amounts to 7 – 8 finalists, which is 1 – 2 more than expected by random guessing. This again demonstrates the substantial bias in the final ranking of finalists in the Queen Elisabeth Competition, induced by day of performance, and order of performance on a given day.

 

SVM with recursive feature elimination in R

The support vector machine (SVM) is a very powerful classifier due to its inherent regularization properties as well as its ability to handle decision boundaries of arbitrary complexity by its formulation as a kernel method. However, suppose that we’re not so much interested in the decision boundary itself, but more in the relative importance of the features in specifying this decision boundary. For example, given two sets (i.e. ‘classes’) of samples, one from tumor tissue and one from normal tissue, and associated RNA-seq gene expression data, one may be interested in ranking genes (i.e. ‘features’) according to their importance in specifying the difference (i.e. ‘decision boundary’) between normal and tumor tissue. This represents an important problem in the field of machine learning, and is called feature ranking. Guyon et al. proposed combining SVMs with recursive feature elimination (SVM-RFE) for feature ranking. Recursive feature elimination in its simplest formulation starts with the complete set of features, and then repeats the following three steps until no more features are left:

  1. Train a model (in the present case, an SVM).
  2. Compute a ranking of features.
  3. Remove the feature with the worst rank.

For step 2, ranking the features, Guyon et al. proposed to take the coefficients to the SVM primal problem (i.e. the coefficients to the SVM hyperplane) as feature weights. Since SVMs are computed by solving the dual problem using quadratic programming, these hyperplane coefficients first need to be computed from the coefficients to the dual problem. In R, for a matrix X with samples in the rows and features in the columns, and a factor y containing the classes, this looks as follows:


# single-step SVM-RFE
svm_rfe__single_step <- function(X, y, ...) {

    library(kernlab)

    # keep track of the iteration during which
    # a feature was eliminated
    ii <- rep(NA, ncol(X))
    i <- 0
    while ( any(is.na(ii)) ) {
        # indices of remaining features
        not_elim_yet <- which(is.na(ii))
        # train the classifier on the remaining features
        fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
        # compute the primal problem coefficients from the dual
        # problem coefficients
        sv_i <- alphaindex(fit)[[1]]
        w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ]
        # eliminate the feature with the smallest squared weight
        to_elim <- not_elim_yet[ head(order( w * w ), 1) ]
        ii[to_elim] <- i
        i <- i + 1
    }
    # convert iterations into ranks
    i - ii

}

The above algorithm eliminates one feature (i.e. gene) per iteration. Typical RNA-seq expression datasets have hundreds to tens of thousands of features, so a call to the function above also to needs to train an SVM hundreds to tens of thousands of times. Needless to say, this can take quite a while. Note that we’re most likely mainly interested in the top-ranking genes, and less so in lower ranking genes. Therefore, we could also eliminate relatively many features during early iterations, and then decrease the number of features to be eliminated as the iteration progresses. Taking the approach of exponential decrease (i.e. eliminate a fixed fraction of remaining features), this looks as follows:


# multi-step SVM-RFE
# default elim_frac corresponds to single-step
svm_rfe <- function(X, y, elim_frac = 1 / ncol(X), ...) {

    library(kernlab)

    # keep track of the iteration during which
    # a feature was eliminated
    ii <- rep(NA, ncol(X))
    i <- 0
    while ( any(is.na(ii)) ) {
        # indices of remaining features
        not_elim_yet <- which(is.na(ii))
        # number of features to eliminate
        n_to_elim <- ceiling ( elim_frac * length(not_elim_yet) )
        # train the classifier on the remaining features
        fit <- ksvm(X[,not_elim_yet], y, scaled = FALSE, ...)
        # compute the primal problem coefficients from the dual
        # problem coefficients
        sv_i <- alphaindex(fit)[[1]]
        w <- t( coef(fit)[[1]] ) %*% X[ sv_i, not_elim_yet ]
        # eliminate the features with the smallest squared weights
        to_elim <- not_elim_yet[ head(order( w * w ), n_to_elim) ]
        ii[to_elim] <- i
        i <- i + 1
    }
    # convert iterations into ranks
    i - ii

}

Let’s see how this works in practice. For this, we’ll use the ‘cervical’ data from the Bioconductor package MLseq. This dataset describes the expression levels as measured by RNA-seq of 714 miRNAs in 29 tumor and 29 normal human cervical samples. We can load and process the data, and run the above functions, using the following code:


library(MLSeq)

# load RNA-seq expression data
data(cervical)

# preprocess
y <- factor(grepl("^T[0-9].*", colnames(cervical)))
levels(y) <- c("normal", "tumor")
X <- scale(t(data.matrix(cervical)))

# SVM-RFE single-step
system.time(I <- svm_rfe(X, y))
# SVM-RFE multi-step
system.time(J <- svm_rfe(X, y, 0.1))

In the example above, using default values removes one feature at a time and thus trains 714 SVMs, one for each feature. On the other hand, setting elim_frac to 0.1 (i.e. at each iteration, 10% of features are eliminated), needs to train only 47 SVMs. Although setting elim_frac at 0.1 trains roughly 15 times fewer SVMs, it runs approximately 30 times as fast (~1 second vs. ~33 seconds on my laptop). This is because the number of SVMs trained on many features is relatively more constrained by the multi-step approach than the number of SVMs trained on fewer features. So, we lose some resolution in the relatively uninteresting lower ranking features, but gain a lot in terms of execution time. This gain in execution time is visualized in the following plot.

Execution time of SVM-RFE in seconds (left) and number of SVMs trained (right); as a function of the fraction of features eliminated per iteration.
Execution time of SVM-RFE in seconds (left) and number of SVMs trained (right); as a function of the fraction of features eliminated per iteration.

Rcpp to integrate C with R

If my R script needs a speed boost, I normally use R’s .C() function to call a function written in C from within R. However, after having read many good things about the Rcpp R-package lately, I decided to give this package a try. In principle, Rcpp was developed to integrate C++ with R. However, C++ is almost a superset of C, and therefore integrating Rcpp with C instead of C++ is mostly just a matter of knowing the Rcpp interface and writing some C code.

As a first step in getting to know the Rcpp interface, I set myself the goal of designing a function myOrder() that returns the ordering of the elements of a given numeric vector. In R, this would normally be done by calling the function order(). For example, ordering the vector [2.1, 4.5, 1] would return [3, 1, 2]. I wrote two functions for doing this type of ordering and stored these in a text file myOrder.cpp, with the extension .cpp for C++:


#include <Rcpp.h>
using namespace Rcpp;

int pointer_compare(
    const void* a,
    const void* b
) {
    /* Two double-pointers p1 and p2 */
    double* p1 = *(double**)a;
    double* p2 = *(double**)b;

    /* Compare pointers based on the value they point to */
    if ( *p1 < *p2 ) {
        return -1;
    } else if ( *p2 < *p1 ) {
        return 1;
    }
    return 0;
}

// [[Rcpp::export]]
IntegerVector myOrder(
    DoubleVector x
) {

    int i, n = x.size();
    IntegerVector ret(n); 

    double** p = (double**) malloc(n * sizeof(double*));

    /*
    Retrieve the addresses of the elements in x,
    and store these in the pointer array p.
    */
    for ( i = 0; i < n; i++ ) {
        p[i] = &x[i];
    }

    /*
    Sort the pointers based on their de-referenced
    values. (i.e. the values of the elements in x)
    */
    qsort(p, n, sizeof(double*), pointer_compare);

    /*
    'Normalize' the pointers to a range of [1...n]
    by subtracting the address of the first element
    in x.
    */
    for ( i = 0; i < n; i++ ) {
        ret[i] = p[i] - &x[0] + 1; // add 1 for R
    }

    free(p);

    return ret;
}

The above functions are C code, except for only a few lines:

  • Line 2: defines a namespace for the code, such that anything defined in the header file Rcpp.h that is used in the above code, does not need to be prepended with “Rcpp::”. For example, type DoubleVector is defined in Rcpp.h, and the DoubleVector parameter of function myOrder() would normally need to be declared as “Rcpp::DoubleVector x”. However, by using the namespace, it is sufficient to write “DoubleVector x”.
  • Line 21: specifies that function myOrder() needs to be callable from within R.
  • Lines 22, 23, 27: IntegerVector and DoubleVector are Rcpp-specific C++ type definitions, corresponding to integer and numeric vectors in R, respectively.
  • Line 26: size() is a method for a vector object, which returns the length of that vector.

The above functions can be compiled, linked and called by running the following R code:


# load the library
library(Rcpp)
# compile, link and load the myOrder() function
sourceCpp("myOrder.cpp")
# generate some data
x <- rnorm(10)
# return the ordering of the data
myOrder(x)

Note that by calling the sourceCpp() function, the Rcpp package takes care of compiling and linking, and after that the C function can conveniently be called from within R.

It is interesting to compare the performance of R’s internal order() function with this custom-made Rcpp function.

order
Comparing the running time of R’s internal order() function to the myOrder() function for ordering numeric vectors.

You can see that the Rcpp function is quite a bit faster than R’s internal function. Obviously, R’s internal order() function is more generic than myOrder(). Also, myOrder() does not deal with NAs and only allows for non-decreasing ordering, to name just two things. Still, the above figure shows that, with relatively little effort and quite painlessly compared to R’s .C interface, it is possible to speed up R code substantially, even when considering low-level implemented R functions such as order().

 

Comparing for-loops in R and Python

Python and R are two of the most commonly used high-level programming languages for data analysis. An excellent infographic that compares these languages for data analysis can be found here.

It is sometimes suggested that Python is faster than R, often illustrated using relatively complex examples. However, increasing complexity usually also means an increasing number of ways of optimizing code for better performance, making a direct comparison between the two languages more convoluted. Therefore I was wondering: How do Python and R compare in something as simple as running through a for-loop? To make things as simple as possible, I will compare Python and R in running through an empty for-loop. More specifically, within the body of the loop, nothing will happen at all, so all that’s measured is the actual looping time. In R, for a loop of 10 million iterations, repeated 10 times, this looks as follows:


N <- 1:1E7
nrep <- 10

cat(replicate(nrep, system.time({
	for ( n in N ) { ; }
})[3]), sep = "\n")

In Python, the repeated empty for-loop looks as follows:


import time

N = xrange(0,int(1E7))
nrep = xrange(0,10)

for rep in nrep:
   t0 = time.time()
   for n in N:
      pass
   print time.time() - t0

The results of running these empty for-loops are shown below.

R_vs_Python
Comparing execution times of for-loops of 10 million iterations, repeated 10 times.

Interestingly, R is quite a bit faster than Python! Note of caution: I have only tested this on OS X 10.9.5, using R 3.1.2 and Python 2.7.11, and I cannot exclude the possibility that results would look different under different OSs using different versions of R and Python. Also, to a certain extent, the question of which of these two languages is faster is not very relevant. More specifically, while in general it is important for programming languages to be fast, the primary importance of these two languages is that they are convenient for data analysis, and when it comes to data analysis both languages are great, but just serve slightly different purposes. (Again, for more details, have another look at the infographic.)

 

 

Piano competitions and successful careers: the statistics

In my first post about piano competitions, I explained how I collected data from a website describing more than 11000 results from about 3000 piano competitions. By doing some elementary analysis in the statistical programming language R, among other things I confirmed my suspicion that Italians do very well in piano competitions, in terms of number of prizes won per million inhabitants of a country.

However, doing well in piano competitions should not be an end in itself, and there are quite some examples of pianists winning top prizes at major piano competitions without going on to having a successful career. Therefore, one might wonder how predictive doing well in piano competitions is of achieving this ultimate goal: a successful career. To try and answer this question, we first need to think about the two things we want to relate:

  1. What defines “doing well in piano competitions”? We only have data for pianists who actually reached the finals of any given competition, not for those not reaching the finals. Therefore, “doing well” is determined by the ranking of the pianists within each of the finals, i.e. 1st prize, 2nd prize, 3rd prize, etc., and possibly finalists without a prize.
  2. What defines a “successful career”? Obviously, one could think of measures such as “number of concerts per year” or “yearly income”. Needless to say, these data are pretty hard to come by. Therefore, I decided to go with the relatively pragmatic “number of Google hits on the rainy Saturday afternoon of November 28, 2015”, as measured using the Custom Search API from the Google Developers Console for doing the following search: <“first_name last_name” piano>. In other words, the assumption is: the more Google hits, the more success.

So, we will try and establish whether knowing the prize won in a competition allows us to predict the number of Google hits. We will call this the “prize effect”, i.e. the effect of prize on number of Google hits. For example, we can take the Queen Elisabeth Competition, and plot the prize won by each finalist against his/her number of Google hits:

Queen_Elisabeth_Competition
For the Queen Elisabeth Competition (years 2003, 2007, 2010, 2013), the prize won by each finalist against his/her number of Google hits. Note that in a Queen Elisabeth Competition finals, there are 12 finalists, only 6 of whom are actually awarded a prize (prizes 1 to 6). The remaining 6 do not receive a prize, but here I artificially award them with prize “7”.

We can see that 1st prize winners generally have more Google hits than finalists without a prize, so there indeed seems to be a weak trend. (Statistically, this observation is confirmed by the Kendall correlation coefficient of -0.3 and corresponding p-value of 0.0055)

OK, simple enough, right? Well…., maybe not. These are only results for the Queen Elisabeth Competition. What if we want to assess the same trend, but based on all competitions simultaneously? Now we have a problem, because some competitions are more prestigious than others. In other words,  you might expect someone coming in 3rd at the Van Cliburn Competition to have better chances of having a successful career (and thus more Google hits) than someone coming in 3rd at the Premio de Piano Frechilla-Zuloaga, simply because the Van Cliburn Competition is a more prestigious competition. We will call this the “competition effect”. Also, it is not unlikely that the number of Google hits in November 2015 is influenced by the year in which a competition was won. We will call this the “year effect”. So what we want to do is to determine the “prize effect” on the number of Google hits, while correcting for the “competition effect” and the “year effect”. (Now we’ll get into some technical details, but feel free to skip these and head directly over to the conclusion to learn whether prize indeed predicts a successful career) Fortunately, there is a class of statistical models called mixed models that can help us out here. More specifically, we’ll use the lme4 R-package to construct a mixed model predicting number of Google hits from a fixed effect “prize”, a fixed effect “year”, and a random effect “competition”. Suppose that data.frame K0 contains our data, namely columns:

  1. nhits: number of Google hits
  2. prize: the prize won
  3. year: the year in which a prize was won
  4. competition: the name of the competition

Then one way of determining whether the prize effect is significant when taking into account the competition and year effects is the following:


# Log-transform the number of Google hits, to make it a bit better 
# behaved in terms of distribution. To make the log-transform work, 
# we first need to add a 'pseudocount' of 1, so as to avoid taking 
# the logarithm of 0.
K0$nhits <- log10(K0$nhits+1)
# Make "year" into a factor, such that it will be treated as a 
# categorical variable. 
K0$year <- factor(K0$year)
# Train the null model, predicting nhits only from competition.
fit.null <- lmer(
	nhits ~ year + (1|competition),
	data = K0,
	REML = FALSE
)
# Train the full model, predicting nhits from prize and competition.
fit.full <- lmer(
	nhits ~ prize + year + (1|competition),
	data = K0,
	REML = FALSE
)
# compare the null model with the full model by performing a 
# likelihood ratio test using the anova() function
anova(fit.full,fit.null)

Note that year is treated as a categorical variable. This is because it is likely that the effect of year on nhits is nonlinear. Indeed, I observed this for the Queen Elisabeth competition, with relatively many Google hits for the 2003 and 2013 competitions, and fewer for the 2007 and 2010 competition. This could be explained by the fact that 2013 laureates get more hits due to winning a prize, and that 2003 laureates get more hits because they have had 10 years to establish a career. This nonlinearity makes it more difficult to deal with in linear models. However, the number of years is limited, and moreover we are not interested in assessing the magnitude of the year effect, but only in removing it. Therefore, we can treat it as a categorical variable. The above approach gives p < 2.2e-16 for the difference between the two models, thus indicating that, across competitions in general, prize indeed contributes to the number of Google hits. We can try to visualize the prize effect by plotting prize against the residuals of both the null model and the full model:

mixed_model_residuals
Prize against the residual log10 number of Google hits (“nhits”), for the null model as well as the full model.

This demonstrates that there is a significant correlation of prize with residual nhits in the null model that is removed when including prize as a predictor variable in the full model. This indeed indicates a trend across competitions and years for mildly more Google hits with winning top prizes. Also evident from this plot is that the residuals may not be normally distributed, thus violating one of the assumptions of mixed models. This is even more clearly seen in a Q-Q plot, below for residual nhits of the full model:

QQ_plot
Q-Q plot of the log10 residual number of Google hits of 1st prize winners, according to the full model.

If the residuals were normally distributed, they would more closely follow the dashed line. Thus, the mixed model as described above may not be entirely appropriate for our current purpose. Therefore, in order to determine the significance of the prize effect, we may want to replace the likelihood ratio test with a non-parametric test, such as a permutation-based test. Doing this, as it turns out, also gives a significant result, namely p < 0.001 using 1000 permutations. Thus, prize can predict number of Google hits, at least to a certain extent. This is also indicated by the highly significant non-parametric Kendall correlation of prize with residual nhits in the null model. However, at -0.081 the magnitude of this correlation is fairly small. Note of caution: strictly speaking we have not established that winning a top prize actually causes a higher number of Google hits; we have only established an undirected association between the two. Nonetheless, considering that all competition results preceded the retrieval date of the number of Google hits, typically by 5 to 10 years, this is by far the most likely interpretation.

The above observations lead to the following conclusion: if you are a finalist in a piano competition and win a top prize, you do seem to have better chances of having a successful career than a finalist not winning a top prize, at least in terms of number of Google hits. However, this “prize effect” is remarkably small when observed for piano competitions in general.

Faster Hamming distance in R (2)

In my last post, I proposed a very fast and simple matrix multiplication approach for calculating the Hamming distance between all columns of a matrix X. This approach was restricted to binary matrices:


hamming_binary <- function(X, Y = NULL) {
	if (is.null(Y)) {
		D <- t(1 - X) %*% X
		D + t(D)
	} else {
		t(1 - X) %*% Y + t(X) %*% (1 - Y)
	}
}

Since it only relies on low-level matrix algebra routines, the above function is extremely fast, much faster for example than the hamming.distance function from the e1071 package, which is implemented as an R-level nested loop and is also restricted to binary matrices. However, what if we would like to calculate the Hamming distance between strings of characters, such as DNA sequences, which are composed of As, Cs, Gs and Ts? The nice thing is, that matrix multiplication can also be used for matrices with more than two possible values, and different types of elements. Furthermore, the approach below is generalized to calculating the Hamming distances between all pairs of columns of two matrices X and Y:


hamming <- function(X, Y = NULL) {
	if (is.null(Y)) {
		uniqs <- unique(as.vector(X))
		U <- X == uniqs[1]
		H <- t(U) %*% U
		for ( uniq in uniqs[-1] ) {
			U <- X == uniq
			H <- H + t(U) %*% U
		}
	} else {
		uniqs <- union(X, Y)
		H <- t(X == uniqs[1]) %*% (Y == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + t(X == uniq) %*% (Y == uniq)
		}
	}
	nrow(X) - H
}

The function above calculates the Hamming distance between all columns of a matrix X, or two matrices X and Y. Again matrix multiplication is used, this time for counting, between two columns x and y, the number of cases in which corresponding elements have the same value (e.g. A, C, G or T). This counting is done for each of the possible values individually, while iteratively adding the results. The end result of the iterative adding is the sum of all corresponding elements that are the same, i.e. the inverse of the Hamming distance. Therefore, the last step is to subtract this end result H from the maximum possible distance, which is the number of rows of matrix X.

Interestingly, such a "multi-level" Hamming distance can also be quite elegantly expressed as a sum of multiple binary Hamming distances:


hamming <- function(X, Y = NULL) {
	if (is.null(Y)) {
		uniqs <- unique(as.vector(X))
		H <- hamming_binary(X == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + hamming_binary(X == uniq)
		}
	} else {
		uniqs <- union(X, Y)
		H <- hamming_binary(X == uniqs[1], Y == uniqs[1])
		for ( uniq in uniqs[-1] ) {
			H <- H + hamming_binary(X == uniq, Y == uniq)
		}
	}
	H / 2
}

Note that at the end, we need to divide by two, because by summing the individual binary Hamming distances, we are counting each element twice.

Also note that the function is not dependent on the mode of the matrix. In other words, the matrix can consist of characters, integers, and any other type of element you can store in an R matrix.