Opponent-Adjusted Performance Score: An Alternative to Speaker Points as the First Tiebreaker

Jan 12, 2015

Introduction

Speaker points suck - we all know it, and we all complain about it. Its problems are why more and more tournaments have started minimizing their influence by breaking all 4-2’s or 5-2's.

In this article, I will propose a way to make the use of speaker points as a tiebreaker better - hopefully significantly better. I call it Opponent-Adjusted Performance Score, and it looks like this:

I promise, it's easier than it looks. Please bear with me. :)

The Problem with Speaker Points (If you already know this you can skip to the next section)

1. Inconsistent scales

Anyone who has ever looked at a tournament result packet or read a few judge paradigms realizes that the scales judges use are entirely inconsistent. Some judges regularly assign double 30’s, while others go seasons without assigning a single one. While some judges will not give below a 27 unless something horrific takes place in the round, other judges regularly drop as low as 24 or 25. Because debaters will likely be judged by very different sets of judges, this inconsistency makes speaker points (SP) largely useless as a metric for comparing the quality of debaters in a tournament.

Some tournaments have attempted to address this problem by explicitly endorsing certain scales. This strategy has always failed. Judges either fail to effectively implement the change, slide back into old habits, or completely ignore the suggestions.

2. Point Inflation

A related but distinct problem is that average speaker points have been steadily rising over time. Inflation has many potential causes, but I think the most commonly cited is the desire for judges to make debaters happy. If a judge has to make a decision in a close round, has an RFD they are not confident in, or wants to keep a debater’s coach happy, they will give higher than normal points. Some judges also feel compelled to give high points to good debaters even in rounds where they perform poorly because giving a 28 might keep someone otherwise deserving from breaking. These incentives also create upward pressure on the scales of all other judges. As points rise suddenly anything below a 29 can “screw” a debater, causing them not to break, so other judges adjust their scales upward to avoid being the judge that ruined a debater’s tournament. Point inflation makes SP less useful as a tiebreaker because it compresses the scale (meaning there is less distinguishing debaters of different seeds) and SP have cease to be indicators of a debater’s skill.

3. Fails to Account for Strength of Opponent

SP are intended to measure the strength of a debater compared to the rest of the field at a tournament, based on their in-round performance. It does not take into account the strength of a debater's opponents. This is a big problem. A debater who goes 4-2, while only losing to debaters who finish undefeated, is clearly more deserving of breaking than a 4-2 who loses to average debaters. While both have the same number of wins, the losses are not comparable. It seems wrong that the second debater should break, while the first does not.

Potential Solutions

1. Opponent Wins

On the surface it appears that at the very least Opp Wins (OW) account better for the strength of opponents than SP, so perhaps OW are a good candidate to be the new first tiebreaker. However, while OW does account for opponent strength, OW are often completely out of the control of debaters. If a debater happens to be paired against two really good debaters in presets they could be advantaged over another debater, who by no fault of his own, hit easier opponents in presets.

2. OW + SP

A few tournaments have used OW + SP to break ties - MBA comes to mind, for example. While this does a better job taking into account strength of opponent than SP alone, and a better job taking into account performance than OW, the faults of both still exist.

3. Judge Variance

I think in an ideal world Judge Variance is the best alternative to SP, other than OAPS. For those who do not know, Judge Variance (JV) is a measure of how many more (or less) speaker points a debater receives from a judge, given the average of that judge’s speaker point distribution over the course of a tournament. So, if all of a debater's judges collectively average 28.5 and they receive an average of 29, their JV would be .5.

While JV seems like a reasonable solution to the tiebreaker problem in theory, there are still a few problems. For instance, JV is meaningless without a large sample size. Some judges only judge a few rounds at a tournament and if those rounds are unusually good or bad, JV scores could be completely skewed.

Some have suggested creating a a database that would allow us to use all the rounds a judge has judged in a season (or ever?) to calculate JV. Beyond the obvious logistical difficulties of this solution, a database still would not solve judges who judge very infrequently, or first-year-out judges.

Even if these difficulties could be overcome, problems remain. First, point inflation creates a major problem for JV. As scales become compressed differences in JV do as well. Second, the biggest problem for JV is that it does not take into account opponent strength at all.

So, at least on its own JV is likely not the solution to the SP problem. However, it is possible that if the sample size problem could be overcome JV could be incorporated into OAPS as a way of more effectively solving the problem of inconsistent scales. This could be done by substituting both debater's JV scores in each round for their SP.

Introducing Performance Score

1. Explanation

Performance Score (PS) is defined as a debater’s speaker points (SP) in a given round minus the speaker points of their opponent (OSP) in that round:

For example, if in round 1 Debater A receives a 29 and their opponent, Debater B, receives a 28, then Debater A has a round 1 PS of +1, and Debater B as a PS of -1.

PS differs in a number of ways from raw SP. Instead of measuring the skill of a debater in an absolute sense, PS is only a measurement of the difference between two debaters in a given round. Basically, PS is a measure of the “margin of victory” or the “skill gap” demonstrated by debaters in a given round. The difference between PS and SP is the same as the difference between saying the Minnesota Vikings scored 42 points, and saying the Minnesota Vikings beat the Green Bay Packers by 3 touchdowns (21 points).

Over the course of a tournament a debater’s PS for each round can be added to find their Total Performance Score (TPS):

Below are hypothetical tournament results for a debater. The results show what TPS would look like in practice:

Round 1 2 3 4 5 6 Total SP 28.5 29 30 29 29.5 28 174 OSP 27 28.5 29 30 27.5 29 171 PS 1.5 0.5 1 -1 2 -1 3

2. Advantages of PS

PS limits the impact of inconsistent judge scales. As explained above using SP as a tie-breaker requires us to treat SP from different judges the same, even though we know that the scales used by judges are completely inconsistent. PS allows us to use SP assigned from different judges while diminishing the impact of inconsistent judge scales. A PS of +1 can be achieved when Debater A receives 30 SP while Debater B receives 29 SP, or when Debater A receives 28 SP and Debater B receives 27 SP. I believe that the difference in SP is more consistent than the absolute SP given by judges. The main exceptions are judges who assign both debaters high speaks in an effort to make everyone happy, which brings me to the second advantage of PS.
PS removes the incentive to inflate speaker points. In a world where PS is used as the first tiebreaker instead of SP, the incentives that drive point inflation are much less powerful. The desire to make both debaters happy, or to cover an uncertain/bad decision with good speaks, no longer makes sense. PS is zero-sum, so by assigning one debater higher SP than they deserve, a judge is punishing the better debater. PS would mean that in most cases double 30’s would no longer be a cause of celebration for both debaters, unless of course both debaters end up with very positive TPS numbers. In a case where both debaters are actually very strong, the double 30 could be slightly positive for both.
Better measures the skill demonstrated in a given round. Absolute speaks do not fully account for a debater’s performance in the round. Often decisions made by one debater, like which positions to run, can affect the performance of the other debater. If debater knows their opponent is bad at framework debate, then choosing a philosophy-heavy position can cause their opponent to debate worse than they otherwise would. Even the persuasive skills of a debater can negatively affect their opponents debating by convincing them they are behind in certain parts of the debate. Thus, point differential is a better way of grading debaters - it takes into account both debaters performance and their mutual influence on each other’s performance.

3. Remaining Issues

PS largely fails to account for the strength of opponents. Like SP and OW, PS still fails to adequately address differences in the strength of opponents. If Debater A debaters significantly worse debaters in rounds 1 and 2, their PS might be artificially inflated. In fact, it is possible that PS would be more skewed by opponent strength than SP.

Refining Performance Score: Opponent-Adjusted PS

1. Explanation

Opponent-Adjusted Performance Score attempts to adjust for the strength of opponent by using the average of the Total PS scores of a debater’s opponent:

While the above equation might look complicated, the concept breaks down simply: you take a debater’s Round 1 opponent’s TPS, add it to the Round 2 opponent’s TPS … until you add the TPS score of all opponents that a debater hits in prelims of a tournament. Then you divide by the total number of rounds (take the average).

This process takes into account a couple things: first, if you sum an opponent’s PS score, you take into account whether, in general, that opponent performs well or poorly. An opponent that performs well might consistently get +1 or +1.5 as a PS score in rounds - the sum of their PS over prelims would thus be a positive number. Similarly, an opponent that performs poorly might get a lower positive number or a negative number as a result of summing their PS’s across prelims. Averaging the aggregates of a debater’s opponent’s PS scores would then say, on average, that debater faced “good” or “bad” opponents and would accordingly adjust the debater’s score.

So, if a debater has a TPS of +5 but if the average of their opponents TPS is -5, then that debater has performed exactly as expected and would earn an OAPS of 0. If on the other hand a debater had a TPS of +5 but their opponents have an average TPS of +1, then that debater would receive an OAPS of 6, to account for their above average competition.

2. OAPS takes into account strength of opponent without the problems of OW.

While OW are hindered by the randomness of presets, and unfairly reward debaters for hitting good opponents regardless of their performance, OAPS avoids both problems. Debaters are only rewarded for performing better than average against a particular debater and only punished for performing below average. If a debater hits a great opponent, who averages a PS of +3, but debates them well enough to earn a PS of -1, they are rewarded for debating better than the average debater against that opponent. On the other hand, if a debater hits that same opponent and receives a PS of -4, they are punished for debating below average relative to that opponent. Unlike OW, they do not receive a bonus because they happened to debate a talented opponent.

3. Potential Objections

"This seems too complicated." -- Eh, not really. While the formula may seem complicated, it is really not more difficult to calculate JV, something tab software already does. While TRPC and Tabroom.com do not currently calculate OAPS, I suspect that it would not take much to add that to the software.
"That is not what speaker points are for." -- Some might object that PS doesn't make sense because SP are not about relative skill, but rather speaking ability or something else entirely. Sure, that's fair, but it really doesn't make a huge difference. We already primarily use SP to break ties, if we believe that speaking ability is how ties ought to be broken, all of the same reasoning offered above still applies.
"This doesn't solve for inconsistent judge scales. If one judge has a bigger range than another, doesn't that mess things up?" -- This was sort of addressed above, but it is a big enough point to address again. It is true that nothing can really be done to solve this issue completely. However, OAPS does a few things to address it. First, OAPS creates an incentive for judges to use more of the scale. Some of the same reasons speaks have become inflated could work in favor of OAPS. Judges who use a compressed scale would end up punishing good debaters, over time judges would adapt in order to protect the best debaters, and to avoid making people angry. Second, OAPS should be more consistent between judges than SP. I believe that there is not as much inconsistency in separation judges create between debaters with SP as there is with the absolute SP that are assigned.
"I don't know how, but it seems like low-point wins would mess everything up, right?" -- Not at all. Under the current system, a low-point win is a signal that even though a debater lost this specific round, the judge believes they are" better" (at whatever set of skills a judge believes is relevant for SP), and if the two debaters are tied at the end of the tournament, the losing debater should be favored. The same would be true if we chose to switch to OAPS.
"OAPS? That name sucks!" -- I agree. Any suggestions for a better name? If this is going to stick, we need something catchy.

Conclusion

I believe that our current method of determining debater seeding is broken, instead tournaments should use OAPS. SP are too flawed to be the first tiebreaker and OAPS provides an alternative that solves most if not all of the problems presented by SP.

I would love to hear thoughts from readers on how to improve OAPS, additional objections, or alternative metrics in the comments section.

Chris Theis is the owner and Co-Director of Victory Briefs. He won the 2008 and 2009 TOC and currently coaches at Peninsula High School (CA) and Apple Valley High School (MN).