We turn again to our investigations of ZwiftPower race data. In the second of the recent Cruiser Sunday posts I discussed briefly whether the spotted difference between cat A and cat C with regards to relative effort levels among top contenders was statistically significant. Now we will try to analyze race data properly, with a third approach.
An Explanatory Sidetrack
We will start with a little loop before we get back on track. Imagine you have kids and that you recently moved to a new area. There are two nearby schools to put your kids in and you have the choice between either and want to choose the one where the students have the highest grades. Is there a difference at all, and if there is, can we somehow determine whether that difference is not just random?
Or let’s make it really simple. You and a friend throw dice. You roll a die 100 times each. The objective is to score the highest total. If the dice are fair, then there should be no difference between your results, right? Or rather, there will be a difference but only a small one. Either of you had a streak of luck resulting in a slightly higher total. Do it all again and it might be reversed.
But if it turns out your friend’s total is 516 and yours is only 321, is that just luck? Well, in theory it could be. It’s just not very likely that you will see such a large difference. He would have to have rolled a large number of 6’s to get to that total score. It could happen once in a blue moon, sure, but at the same time it wouldn’t be unreasonable to suspect a loaded die. Or?
A better approach here would be to not begin with trying to decide whether the difference is random or not, because right now we don’t know, but rather to start with determining how likely such an extreme random difference would be. Maybe the difference isn’t that big after all when it comes to probabilities?
Fortunately, there are ways to determine this likelihood for various scenarios. In the case of the schools or the dice you can use a fairly simple statistical test called the Mann-Whitney U-test. If the test score is high enough, it indicates that the probability that the differences in dice total is just random is very low.
You typically set a limit beforehand as a decision rule. In smaller studies where the results aren’t life critical, a 5% limit, a so-called 5% confidence interval, is standard. So if we were to do the 100 dice rolls over and over and you would see differences of the magnitude of 516 vs 321 only in less than 5% of the trials, then we have decided that it is so unlikely that we are better off looking for other explanations than just chance. I.e. we would rather suspect that your friend is cheating.
We will use this same method when looking at the race results on ZwiftPower next.
We will look at HR distributions graphs on Zwift.com among the top 3 in 100 consecutive races in the recent past, in both cat A and cat C.
If a rider spends the best part of his time in the race in a higher HR zone than the other two, visibly so, then that rider has worked harder. The HR graphs aren’t a perfect description of everyone’s fitness, especially when HR zones aren’t tuned to an individual, but on average they will be and we are looking at 300 riders in each category. It will likely average out.
If the winner of a race has worked harder than the rest of the podium, then we will score that race as 0, meaning nobody worked harder than him. If either of the other guys have worked harder than the winner, then we will score the race as 1, meaning one guy worked harder than the winner. If both of the other riders worked harder than the winner, then we will score the race as 2, meaning two others worked harder than the winner.
If there is no HR data available for someone on the podium, we will skip that rider and instead look at the next guy on the results list. It is not uncommon that HR data is missing and the typical reason is that the rider’s Zwift profile is set to private. So if the winner has no HR data, then we will compare the no 2 guy to the no 3 and no 4 guy instead. And if the no 3 guy has no HR data, we will compare the winner to the no 2 and no 4 guy instead. The reason we do this is that the display of all recent races on ZwiftPower is somewhat limited and we need to make sure we get a sample size big enough, 100 races. And it should really make no difference when it comes to our assumptions, or our hypothesis in this study. More about that below.
Once we have scored 100 races in cat A and cat C, we will then compare the results using the Mann-Whitney U-test. If there is a difference big enough to be statistically significant (remember the 5% rule here), then and only then will we draw uncomfortable conclusions.
Assume we are with the ZP team and we LOVE the W/kg category system. We firmly believe it is fair and reasonable. Every sport should be categorized with W/kg, we think. There is no better option. We just need to get rid of those pesky sandbaggers first somehow…
Then what do we expect in a race with regards to relative effort levels among the top contenders? Perhaps there are two possibilities here. We could for example assume that the strength and prowess among the top contenders is roughly the same. So why does someone come out on top? Because he works harder than the others. All else equal, on average, someone working harder than the others will win. So we expect the winner to have worked the hardest (score 0).
Or we could assume that winning a race isn’t just about working hard, even if you are as fit as other top contenders. It is also about random events in the race, such as splits and breakaways and powerups and whatnot. Maybe those random events, a.k.a. luck, play such a large part in a race that we can’t separate the podium places with differences in effort levels. So instead we assume that the relative effort among the top 3 will be roughly the same. Obviously, the top 3 will be more fit and potentially also work harder than the ones coming in last in a big race, but among the top 3, we assume that the effort of each respective rider will be about the same, if not in every race then at least on average in 100 races. Thus what we will not see is a tendency for score 2 in a lot of races. Rather, races will converge around score 1.
And what do we expect when comparing cat A with cat C? We expect to see no difference in relative efforts in the two categories. Cat A riders might be used to working harder but when comparing the top 3 in a cat A race, there should be no greater differences among them than among the top 3 in a cat C race. There may or may not be a difference in overall relative effort between cat A and cat C but there will not be a difference between riders in a category that is different from the other category.
Possibly, since we make no distinction between A and A+ riders, and since it is not uncommon that a cat A race is won by an A+, followed by two A riders, we might find a slight tendency for cat A winners to work a little less hard than the rest of the podium. We do not, however, expect to see this in cat C. Because cat C is fair and the W/kg system is appropriate in Zwift, or so we claim.
The “Oh Shit!” Scenario
Now, if we were to find that there is a tendency for cat C winners to work less hard than the rest of the podium, and that there is less of that tendency in cat A, then that would scare us. Because it is unintuitive. Why should races be won by people who work less hard than others, especially when there is an upper limit to performance (W/kg) in a category? We wouldn’t like that. It goes against the nature and ethics of the sport and would distance us from outdoor cycling too.
And it may also indicate that the phenomenon of cruising is a real issue in the lower categories, i.e. that some riders exploit the W/kg system on ZwiftPower by staying behind in a category they are too strong for, making sure they don’t go over W/kg limits, and thus get an unfair advantage in races over riders who couldn’t go over limits due to fitness and who would have to (and will) work extremely hard to finish anywhere near the top.
100 races were sampled starting Fri 7 Aug 2020 and forward in cat A and cat C. According to the scoring method described above, cat A got a total score of 80 whereas cat C got a total score of 106.
In 43 races in cat A, the top 1 guy worked harder than the following two. In 34 races in cat A, one following rider worked harder than the top 1 guy. In 23 races in cat A, both following riders worked harder than the top 1 guy.
In 29 races in cat C, the top 1 guy worked harder than the following two. In 36 races in cat C, one following rider worked harder than the top 1 guy. In 35 races in cat C, both following riders worked harder than the top 1 guy.
The Mann-Whitney U-test gives a test score of -2.15, which translates into a probability, a p-value, of 0.032 (3.2%) for a random occurence. This is lower than the 5% limit we set. There is indeed a difference between the categories and it goes in a direction we did not expect, that there would be no statistically significant difference between the two categories or that if there was, then it would lean in the other direction, towards a tendency for winners in cat A to work less hard compared to the other two on the podium than in cat C. Hence we have to draw the conclusion that we cannot refute the “Oh shit!” scenario.
The “Oh shit!” scenario is real. We do not live in the best of all Watopias. We live in a Watopia where it pays off to work hard in cat A but apparently not so much so in cat C. We live in a Watopia where the category system makes us behave weirdly in races in the lower categories B-D. We live in a Watopia where you can get away with cruising, even on ZwiftPower.
Now we have a choice. We can either accept that racing is inherently unfair in the lower categories and just live with it. Or we can, inspired by other working and efficient category systems in real-life sports, find a new category system that would prevent not only sandbagging but also weird discrepancies such as the one we just looked at, a system that would also unchain racers in all categories and prevent cruising.
Your choice. I have made up my mind already.