How it all works

I appreciate that it’s difficult to keep up with the Pythago NRL Expanded Universe™ of metrics and ratings. Not only are they generally more complicated than standard stats, I frequently tweak them based on what I learn using them.

Here’s a short reference guide to what it all means.

Pythagorean expectation: There is a relationship between for and against and winning percentage, which is expressed by the Pythagorean expectation formula. This formula is used to estimate the number of wins a team should have based on their points scored and conceded. Generally, if a team outperforms their Pythagorean expectation (that is, wins more games than predicted by the formula) they will win fewer games in the following season and vice versa. Pythagorean expectation gets increasingly accurate at estimating win percentage over a longer time period.

Strengths: identifying teams whose record is inflated compared to their skill. Weaknesses: almost every season there is a team or two that greatly outperforms their expectation and it can be difficult to separate legitimate contenders from pretenders.

SCWP: Should-a Would-a Could-a Points using the principle of Pythagorean expectation but, in lieu of using actual points scored and conceded, utilises metres gained and conceded and line breaks gained and conceded to estimate the number of points that the team should have scored and conceded. The results are called “second-order wins”, disctinct from “first-order wins” calculated by standard Pythagorean expectation and actual wins (sometimes “zeroth-order”).

Strengths: utilises repeatable statistics that are less prone to randomness to estimate team quality; second-order wins are a marginally better predictor of future performance than first or zeroth order wins. Weaknesses: some of the improved accuracy comes from teams being regressed mean; system is increasingly divorced from on-field results.

Elo ratings: Elo ratings were developed by Arpad Elo to rank chess players and are now used in FIFA’s official world rankings. Formerly known by the names of Greek philosophers, I’ve been using Elo ratings since 2017 to assess the quality of rugby league teams. I massively expanded the number of systems during the 2019-20 off-season to cover most of the major men’s leagues in the world.

The variables behind each system and league are different but typically, the average rating is 1500 and a higher rating reflects a better team. We can use the difference in Elo ratings to calculate the winning probability of two teams. I maintain two systems for each league –

Form ratings are designed to reflect short term performance and move quickly to reflect recent results (I typically say about six to eight weeks in the NRL). The system variables are optimised to maximise head-to-head tipping success. When two teams match up, an expected margin is estimated between the two teams based on their respective ratings. If a team beats the expected margin, and even if they lose the match, their rating goes up by exactly the same amount the other team’s rating goes down. Form ratings only track regular season performance.

Strengths: cuts through individual results to rate the quality of team’s wins by score and quality of opponent; optimised for best head-to-head predictive power. Weaknesses: ratings can be more noise than signal when blowout wins or big upsets inflate ratings; can be counter-intuitive when a heavily favoured team has a narrow win and their rating goes down.

Class ratings are slower moving than form ratings and take multiple seasons to change significantly. Unlike form ratings, class ratings go up only when you win. They go up more for winning finals games and more still for winning grand finals. Class ratings reflect team’s innate quality and act as a handbrake from looking too closely at the last couple of matches. For example, a team wrecked by Origin selections may have a poor form rating at the start of August but will maintain a high class rating.

Strengths: makes a good basis for historical comparisons of teams; ratings always go up with wins. Weaknesses: very slow moving; poor predictive power; does not factor in scale of wins.

Historical ratings are available for:

Disappointment line: The number of wins we expect for each team based on their class rating compared to the system average. An average team will be expected to win half of their games, an above average team will be expected to win more than half and vice versa. Failing to reach the number of wins indicates a disappointing season and the greater the miss, the more disappointing the season has been.

Strengths: works as a good proxy fan expectations. Weaknesses: does not factor in changes the team makes pre-season and only looks at previous results.

Taylor Player Ratings: Taylors (Ty) are the units for measuring production, the sum of valuable work done on the rugby league field, as measured by counting statistics that correlate with winning. Taylors by themselves can be misleading, so we have:

  • TPR: Taylor Player Rating. This compares the amount of production done by the player to the average player in that position, adjusting for time spent on the field. An average player has a rating of approximately .100, with fringe first graders sitting at .060 and top players nearing .180. To qualify for TPR, a player must play at least five games in a regular season.
  • WARG: Wins Above Reserve Grade. This compares a player’s production that to a typical fringe first grader in that position to estimate the number of additional wins the player’s team gains by having him on the roster. We explored the concept in What makes a million dollar NRL player? and Rugby league’s replacement player. James Tedesco is the NRL’s career leader in WARG, with 10.5 since 2013 to the end of 2020.
  • Projection: Estimating the performance of a player for the next season, measured in TPR, based on previous seasons and reversion to mean per The Art of Projection.
  • Composition/Experience: we break the players into four categories. For composition, this is based on their projected TPR with stars above .130, fringe below .070 and the average is set at 0.100. For experience, we classify a rookie as a player with no TPR in the previous three seasons, a sophomore has registered one in the previous three, a senior, two in three, and a veteran has a TPR for all three of the previous seasons. Players without a TPR in the previous season are not included in roster composition considerations. To make comparing this a little easier, we have the Simple Score, in which we multiply the number of players in the top category by 4, next down by 3, 2 for sophomores/below average and 1 for players in the lowest category and sum up over the roster.

The Super League equivalent to Taylors is Rismans, which uses the same principles to measure slightly different stuff off a worse dataset.

Strengths: generally identifies which players are making significant contributions to their team’s performance; allows for the quantification of roster changes. Weaknesses: stat-padders look good if they can rack up metres or fall on enough tries; does not track defensive capability of individual players well.

Historical ratings are available for:

Coaching: There are two coaching metrics, which have dubious predictive power but seemingly good descriptive power –

  1. Coach Factor compares the pre-season player TPR projections against the player’s actual performance. The ratio between the actual TPR and the projection is then taken as an average across the team (weighted by the number of games played by each player), halved (arbitrarily to account for luck, intangibles, etc), adjusted for production inflation and attributed to the coach for that season. A positive factor indicates that the coach is a net positive influence on player production compared to the league average and a negative factor indicates that the coach is net drag on production.
  2. Career Rating Points looks at the change in the team’s class Elo rating during the coach’s tenure. A good coach should leave the club with the rating higher than when they started and this should be demonstrable over sufficiently long tenures. We explored this concept in The coaches that fucked up your club.

Strengths: attempts to quantify coaching performance. Weaknesses: changes in total rating points can be influenced by the team’s state at the time of the coach’s start; coaching factor can be undermined if the projections happen to set a bar too high after an outperforming season; factor only compares coaches within the league that year, making absolute comparisons over time problematic.

Simulations: Formerly the Stocky but just called Sims today, these are Monte Carlo simulations of rugby league football matches. The simulations are used to determine the probabilistic outcomes of the season. Simulations use input ratings based on Elo, Pythagorean/SCWP or Taylor systems to calculate winning probabilities of individual matches, a random number determines the result of the matches and this is repeated ten, fifty or a hundred thousand times through the season. The outcomes give insight in to how we expect the season to unfold, based on what we know right now by exploring which pathways are more likely than others.

Strengths: creates interesting looking forecasts, which accounts for the draw and the current state of the team. Weaknesses: only as accurate as the input ratings; cannot account for significant future changes in performance.

WCL: The Worm Chess Lathe graph shows the in-game probability of a team winning the game, based on the past results of clubs at that moment with that margin in that competition.

Discontinued systems: