Are statistically significant results always relevant? Let’s have a look at a simple hypothetical example. Suppose we have two groups of 2500 men. All men in group 1 have a beard, and all men in group 2 do not have a beard. Moreover, we know the height of all men, and it turns out that the bearded men are statistically significantly taller than the beardless men (t-test, p < 0.05). For example:
As mentioned above, in our example, the difference in height is significant. However, arguably the more interesting question is: What can we do with this result? Is it practically relevant? For example, we could ask ourselves: Given the height of a man we have not seen yet, can we predict whether he has a beard?
Well, we could, but we would do very poorly: only slightly better than random. For example, an optimal classifier would statistically be expected to put the decision boundary at the average of 1.84 and 1.835, thus showing a misclassification rate of about 0.49. For example, among 100 men, we would be expected to correctly predict only one more than by mere random guessing. Why is this? Because the effect size is so small: The difference in average height between the two groups is just so small that it can hardly be used for prediction. While it cannot be denied that there is a difference between the two groups, it is of little practical relevance, and we had better look for something that better predicts beardedness.
So, in reporting results, we should not only look at statistical significance, but also at effect size. Nonetheless, in practice, cases where effect size is under-reported are no exception. An interesting example is this article, on “how intelligence, population density, and friendship affect modern happiness”. It received quite some attention in the media. One of the main results in the paper was that there is an “interaction effect between frequency of socialization with friends and intelligence on life satisfaction”, such that “more intelligent individuals were actually less satisfied with life if they socialized with their friends more frequently”. This was summarized in the following graph:
Indeed, people with higher IQs seem unhappier when they have more social interactions, and Li and Kanazawa showed that these results were significant (p = 0.016). So far so good. However, look at the y-axis. The article states that life satisfaction was reported on a scale from 1 to 5, but the figure only spans a tiny fraction of the entire range, from 4.10 to 4.16. Moreover, only mean life satisfaction is reported, and no indication whatsoever is given of the spread in life satisfaction scores: Most likely, the large majority of the individual scores are either larger than 4.16 or smaller than 4.10, and therefore lie outside the range of the y-axis. To get a proper idea of how small the differences actually are, look at the same data, mean life satisfaction, but now with a y-axis ranging from 1 to 5:
To get a feeling for the effect size of this difference, we might ask a question similar to the one in the toy example we started with: Would you be able to predict whether someone has a high IQ just by knowing whether he/she socializes frequently and how happy he/she is with his/her life? Most likely you would do very poorly, close to random in fact, as the Cohen’s d statistics of 0.05 and -0.03 reported in the article suggest. With a large sample size of 15197, as reported in the article, even very small effects can be identified as statistically significant.
Concluding: Is there an effect? Yes, there is. Is it relevant? Very questionable, considering the small effect size.