# Piano competitions and successful careers: the statistics

In my first post about piano competitions, I explained how I collected data from a website describing more than 11000 results from about 3000 piano competitions. By doing some elementary analysis in the statistical programming language R, among other things I confirmed my suspicion that Italians do very well in piano competitions, in terms of number of prizes won per million inhabitants of a country.

However, doing well in piano competitions should not be an end in itself, and there are quite some examples of pianists winning top prizes at major piano competitions without going on to having a successful career. Therefore, one might wonder how predictive doing well in piano competitions is of achieving this ultimate goal: a successful career. To try and answer this question, we first need to think about the two things we want to relate:

1. What defines “doing well in piano competitions”? We only have data for pianists who actually reached the finals of any given competition, not for those not reaching the finals. Therefore, “doing well” is determined by the ranking of the pianists within each of the finals, i.e. 1st prize, 2nd prize, 3rd prize, etc., and possibly finalists without a prize.
2. What defines a “successful career”? Obviously, one could think of measures such as “number of concerts per year” or “yearly income”. Needless to say, these data are pretty hard to come by. Therefore, I decided to go with the relatively pragmatic “number of Google hits on the rainy Saturday afternoon of November 28, 2015”, as measured using the Custom Search API from the Google Developers Console for doing the following search: <“first_name last_name” piano>. In other words, the assumption is: the more Google hits, the more success.

So, we will try and establish whether knowing the prize won in a competition allows us to predict the number of Google hits. We will call this the “prize effect”, i.e. the effect of prize on number of Google hits. For example, we can take the Queen Elisabeth Competition, and plot the prize won by each finalist against his/her number of Google hits:

We can see that 1st prize winners generally have more Google hits than finalists without a prize, so there indeed seems to be a weak trend. (Statistically, this observation is confirmed by the Kendall correlation coefficient of -0.3 and corresponding p-value of 0.0055)

OK, simple enough, right? Well…., maybe not. These are only results for the Queen Elisabeth Competition. What if we want to assess the same trend, but based on all competitions simultaneously? Now we have a problem, because some competitions are more prestigious than others. In other words,  you might expect someone coming in 3rd at the Van Cliburn Competition to have better chances of having a successful career (and thus more Google hits) than someone coming in 3rd at the Premio de Piano Frechilla-Zuloaga, simply because the Van Cliburn Competition is a more prestigious competition. We will call this the “competition effect”. Also, it is not unlikely that the number of Google hits in November 2015 is influenced by the year in which a competition was won. We will call this the “year effect”. So what we want to do is to determine the “prize effect” on the number of Google hits, while correcting for the “competition effect” and the “year effect”. (Now we’ll get into some technical details, but feel free to skip these and head directly over to the conclusion to learn whether prize indeed predicts a successful career) Fortunately, there is a class of statistical models called mixed models that can help us out here. More specifically, we’ll use the lme4 R-package to construct a mixed model predicting number of Google hits from a fixed effect “prize”, a fixed effect “year”, and a random effect “competition”. Suppose that data.frame K0 contains our data, namely columns:

1. nhits: number of Google hits
2. prize: the prize won
3. year: the year in which a prize was won
4. competition: the name of the competition

Then one way of determining whether the prize effect is significant when taking into account the competition and year effects is the following:

```
# Log-transform the number of Google hits, to make it a bit better
# behaved in terms of distribution. To make the log-transform work,
# we first need to add a 'pseudocount' of 1, so as to avoid taking
# the logarithm of 0.
K0\$nhits <- log10(K0\$nhits+1)
# Make "year" into a factor, such that it will be treated as a
# categorical variable.
K0\$year <- factor(K0\$year)
# Train the null model, predicting nhits only from competition.
fit.null <- lmer(
nhits ~ year + (1|competition),
data = K0,
REML = FALSE
)
# Train the full model, predicting nhits from prize and competition.
fit.full <- lmer(
nhits ~ prize + year + (1|competition),
data = K0,
REML = FALSE
)
# compare the null model with the full model by performing a
# likelihood ratio test using the anova() function
anova(fit.full,fit.null)

```

Note that year is treated as a categorical variable. This is because it is likely that the effect of year on nhits is nonlinear. Indeed, I observed this for the Queen Elisabeth competition, with relatively many Google hits for the 2003 and 2013 competitions, and fewer for the 2007 and 2010 competition. This could be explained by the fact that 2013 laureates get more hits due to winning a prize, and that 2003 laureates get more hits because they have had 10 years to establish a career. This nonlinearity makes it more difficult to deal with in linear models. However, the number of years is limited, and moreover we are not interested in assessing the magnitude of the year effect, but only in removing it. Therefore, we can treat it as a categorical variable. The above approach gives p < 2.2e-16 for the difference between the two models, thus indicating that, across competitions in general, prize indeed contributes to the number of Google hits. We can try to visualize the prize effect by plotting prize against the residuals of both the null model and the full model:

This demonstrates that there is a significant correlation of prize with residual nhits in the null model that is removed when including prize as a predictor variable in the full model. This indeed indicates a trend across competitions and years for mildly more Google hits with winning top prizes. Also evident from this plot is that the residuals may not be normally distributed, thus violating one of the assumptions of mixed models. This is even more clearly seen in a Q-Q plot, below for residual nhits of the full model:

If the residuals were normally distributed, they would more closely follow the dashed line. Thus, the mixed model as described above may not be entirely appropriate for our current purpose. Therefore, in order to determine the significance of the prize effect, we may want to replace the likelihood ratio test with a non-parametric test, such as a permutation-based test. Doing this, as it turns out, also gives a significant result, namely p < 0.001 using 1000 permutations. Thus, prize can predict number of Google hits, at least to a certain extent. This is also indicated by the highly significant non-parametric Kendall correlation of prize with residual nhits in the null model. However, at -0.081 the magnitude of this correlation is fairly small. Note of caution: strictly speaking we have not established that winning a top prize actually causes a higher number of Google hits; we have only established an undirected association between the two. Nonetheless, considering that all competition results preceded the retrieval date of the number of Google hits, typically by 5 to 10 years, this is by far the most likely interpretation.

The above observations lead to the following conclusion: if you are a finalist in a piano competition and win a top prize, you do seem to have better chances of having a successful career than a finalist not winning a top prize, at least in terms of number of Google hits. However, this “prize effect” is remarkably small when observed for piano competitions in general.