Primer – How do Elo models work?

Short answer: with a lot of time spent in Excel and Google Sheets.

Long answer: It depends on what you want to do.

I introduced the Elo rating system in a previous primer. Now it’s time to put it to work.

I think most sport’s fans would agree with the following definitions of form and class –

  • Form – Short term performance, related to luck, match fitness, weather
  • Class – Long term performance, related the structural competence of the team in question

Something like “Form wins games, class wins premierships” seems appropriate.

The 2003 Panthers, 2005 Tigers and 2014 Rabbitohs all experienced extended periods of sizzling form, taking premierships not long after spending time in the doldrums but they returned to more “normal” levels not long after.

On the other hand, the Broncos of the late 90s, the Roosters of the early ’10s and the post-’07 Storm (salary cap breaches, I know) have demonstrated real class. They don’t win everything but they were always there or thereabouts. Yes, it does seem that money buys class.

So what do you want to do? Form or class? Which one is more important?

¿Porque no los dos?

porque no los dos

There are a number of different ways to undertake Elo modelling. For instance, there are two ways to calculate ratings:

  1. By result. I call this “winner takes all” (WTA) because the winner of the match takes the rating points and their rating always improves. This is how the (W-We) part of the soccer Elo ratings work.
  2. By margin. In this, we calculate what we expect the margin between the two teams to be based on their relative ratings. Generally, the winners’ rating goes up but if a lowly rated team manages a close loss to a highly fancied team, their rating may improve at the expense of the higher rated team. This is how The Arc’s ratings work.

Then, we need to pick a starting point for modelling. This is the point where each team starts on an even basis. There are two schools on this as well:

  1. Discrete – at the start of each season, each team is given a rating of 1500.
  2. Continuous – the team’s rating at the start of the season is related to their rating at the end of the previous season.

In the continuous school when ratings roll over from season to season, the teams have a discount applied to their advantage (or disadvantage) so that their rating reverts somewhat to mean. This is because stringing together back-to-back seasons of imperious (or terrible) form is rare and most teams become a little more average over the summer break. For example, a team that finishes on 1400 with a 30% discount will start the next season on 1430.

Finally, we need to settle on some values for the weighting of each game. A low value of K will mean that ratings change slowly in response to individual match results but a high value of K will elicit a faster response from rating changes. Typically, a higher weighting is given to games of more importance (e.g. grand finals compared to regular season games) but can also be used to account for the rhythm of the season (e.g. the games around Origin weighted less than the remainder of the season).

My driver when I first started was to predict the outcomes of individual games as often as possible. This basically means I was looking for a way to use Elo ratings to model form. Not knowing what I was doing, I created four separate systems:

philosopher table

And of course named them after Greek mathematicians because if Pythagoras gets a system, why not Euclid? Incidentally, I had originally named them after halfbacks (Hunt, Cronk, Thurston and Johnson) but figured people might get pissy that their favourite halfback wasn’t included.

I used information from the first five seasons of the NRL, 1998 to 2002, to calculate variables like the home field advantage, and then tested the performance of the systems based on the results from 2003 to 2016. This is yet another primer waiting to happen.

The results were not stunning but not terrible. Euclid came out on top with a prediction rate of 61.8% from 2003 to 2016. This is about 10% less than The Arc managed for AFL over a similar period but about 10% better than just picking the home team. Thales was just behind on 61.4%. The WTA pairing of Hipparchus and Archimedes came in with 59.5% and 58.0% respectively. And look there is some evidence to suggest that the NRL is a bit less predictable than other sports.

I also developed a system, Eratosthenes, to measure class. In this, the K values are low, it’s winner-takes-all and tuned such that the rating of a team is roughly equivalent to their win rate for the previous three years. Eratos’ prediction power is relatively poor (57.6%) but better suited to answering questions like “Which was the best team not to make the finals” (A: any team that copped a big points deduction for salary cap breaches) or “Who had the highest average rating from 2007 – 2016?” (A: Melbourne, 1629).

Eratosthenes and Euclid are obviously the way to go for class and form respectively but they have their drawbacks. Eratos moves very slowly in response to games, where it takes a couple of seasons to move a rating by 100 points. Euclid probably overreacts to individual results and that same 100 point shift can happen as a result of a single massive upset.

Euclid can also toss up strange results. The 11th placed Auckland Warriors were the top rated Euclid team in 1999 due to a quirk of how their final half dozen games played out. And, unlike WTA systems, the idea that ratings go up even if the team loses is not intuitive. WTA have a lower discount from season-to-season which makes for a better set of starting ratings. So what I do with Thales, Hipparchus and Archimedes remains to be seen.

 See also: What variables do Elo models use?