SSBMRank 2022
Table of Contents
Notes on SSBMRank 2022
Somehow, I have ended up with the role as data lead for SSBMRank 2022. This meant I was responsible for stuff like assembling the final list, dealing with outlier votes, normalizing the ballots, and the like. Since I have access to the ballots, I'm going to be posting some more nuanced information about the list beyond simply the point value used to determine the placements. I talk about some things people keep wanting to hear about, with respect to this years ranks.
Standard Deviations, and What They Mean
Probably the single ask with the most important implications are for there to be standard deviations published alongside the scores. Below I've provided this information alongside the list, without commentary.
Rank | Tag | Score1 | Stdv |
---|---|---|---|
1 | Zain | 10.00 | 0.000 |
2 | aMSa | 9.90 | 0.001 |
3 | Mang0 | 9.89 | 0.002 |
4 | iBDW | 9.77 | 0.005 |
5 | Hungrybox | 9.68 | 0.008 |
6 | Jmook | 9.60 | 0.009 |
7 | Leffen | 9.51 | 0.011 |
8 | Plup | 9.42 | 0.013 |
9 | SluG | 9.27 | 0.019 |
10 | Axe | 9.24 | 0.018 |
11 | lloD | 9.18 | 0.021 |
12 | KoDoRiN | 9.09 | 0.017 |
13 | moky | 8.97 | 0.019 |
14 | Fiction | 8.96 | 0.027 |
15 | S2J | 8.84 | 0.021 |
16 | Aklo | 8.66 | 0.036 |
17 | Wizzrobe | 8.61 | 0.020 |
18 | n0ne | 8.60 | 0.034 |
19 | Joshman | 8.59 | 0.022 |
20 | Pipsqueak | 8.53 | 0.026 |
21 | Ginger | 8.43 | 0.037 |
22 | Soonsay | 8.41 | 0.068 |
23 | Polish | 8.19 | 0.032 |
24 | Krudo | 8.17 | 0.041 |
25 | Magi | 8.15 | 0.025 |
26 | SFOP | 8.01 | 0.035 |
27 | SFAT | 7.97 | 0.027 |
28 | Lucky | 7.80 | 0.035 |
29 | Spark | 7.78 | 0.065 |
30 | null | 7.69 | 0.094 |
31 | Salt | 7.62 | 0.067 |
32 | Zamu | 7.28 | 0.160 |
33 | TheSWOOPER | 7.27 | 0.186 |
34 | bobby big ballz | 7.24 | 0.066 |
35 | Zuppy | 7.18 | 0.094 |
36 | Mekk | 7.13 | 0.134 |
37 | Jflex | 7.11 | 0.117 |
38 | Trif | 7.06 | 0.222 |
39 | Skerzo | 7.01 | 0.082 |
40 | Swift | 7.00 | 0.181 |
41 | 2saint | 7.00 | 0.166 |
42 | Medz | 6.98 | 0.107 |
43 | Rishi | 6.87 | 0.102 |
44 | Smashdaddy | 6.70 | 0.235 |
45 | Franz | 6.54 | 0.265 |
46 | Aura | 6.50 | 0.174 |
47 | Frenzy | 6.38 | 0.232 |
48 | Lunar Dusk | 6.31 | 0.165 |
49 | Professor Pro | 6.10 | 0.136 |
50 | Azel | 5.98 | 0.157 |
51 | FatGoku | 5.96 | 0.284 |
52 | Eddy Mexico | 5.90 | 0.544 |
53 | Kalamazhu | 5.77 | 0.311 |
54 | Panda | 5.76 | 0.118 |
55 | Ralph | 5.70 | 0.398 |
56 | Colbol | 5.66 | 0.183 |
57 | Kürv | 5.65 | 0.227 |
58 | Bbatts | 5.63 | 0.318 |
59 | Ben | 5.55 | 0.146 |
60 | Wally | 5.48 | 0.162 |
61 | Spud | 5.45 | 0.297 |
62 | SDJ | 5.42 | 0.240 |
63 | ChuDat | 5.40 | 0.690 |
64 | Logan | 5.37 | 0.182 |
65 | KJH | 5.29 | 0.377 |
66 | Chem | 5.21 | 0.184 |
67 | Grab | 5.21 | 0.175 |
68 | Mot$ | 5.12 | 0.269 |
69 | Suf | 5.08 | 0.468 |
70 | DrLobster | 4.92 | 0.352 |
71 | Dawson | 4.86 | 0.206 |
72 | Khryke | 4.83 | 0.138 |
73 | Faceroll | 4.82 | 0.165 |
74 | Gahtzu | 4.77 | 0.250 |
75 | Panko | 4.54 | 0.263 |
76 | Drephen | 4.50 | 0.209 |
77 | Palpa | 4.44 | 0.204 |
78 | Chape | 4.29 | 0.250 |
79 | Sirmeris | 4.19 | 0.286 |
80 | Kalvar | 4.13 | 0.246 |
81 | JJM | 3.88 | 0.354 |
82 | essy | 3.86 | 0.204 |
83 | Mad Tyro | 3.81 | 0.234 |
84 | Nickemwit | 3.70 | 0.206 |
85 | Wevans | 3.64 | 0.158 |
86 | Khalid | 3.59 | 0.238 |
87 | TheRealThing | 3.48 | 0.244 |
88 | Umarth | 3.36 | 0.302 |
89 | 404cray | 3.34 | 0.282 |
90 | Slowking | 3.28 | 0.196 |
91 | Eggy | 3.19 | 0.302 |
92 | Kevin Maples | 3.15 | 0.213 |
93 | Voo | 3.14 | 0.236 |
94 | Free Palestine | 2.96 | 0.224 |
95 | Logos | 2.96 | 0.254 |
96 | JustJoe | 2.94 | 0.259 |
97 | Abbe | 2.94 | 0.328 |
98 | Rocket | 2.92 | 0.277 |
99 | nut | 2.81 | 0.145 |
100 | Matteo | 2.79 | 0.183 |
101 | shabo | 2.72 | 0.525 |
How can we interpret this? Well, standard deviation is a measure of how "wide" the spread of data is. In our case, we can think of this as a measure for how much the panelists tended to "agree" on a specific player. High standard deviation means there was a lot of disagreement, and that the player was controversial. Low standard deviation means there was mostly agreement, and that most panelists probably put them around the same spot.
Roughly speaking, there are two effects you can notice. The first, is that the problem is heteroscedastic: that is, the further down the list you go, the higher the variance usually is. The second, is that players who attend less usually tend to have a much higher variance. This is easiest to see with ChuDat, who had extremely sparse attendance and who panelists put all over the place.
In cases where the distribution of votes looks like a nice normal distribution (which is the case for most, but not all, of the players), you can further extrapolate that roughly 68% of the data is within one standard deviation, 95% of the data is within two standard deviations, and 99% of the data is within three standard deviations. If we zero in on, for example, Chape: a player with a score of 4.29 and a standard deviation of 0.250, we can say that 68% of panelists have him and 4.04 and 4.54.
It's important to appreciate what this tells us about roughly where these players get voted: 4.04 would move Chape down to 80 (+2), 4.54 would move them up to 75 (-3). The number matters a lot to individual competitors and to their fans, but in appreciating the data it's helpful to remember that the final number is a roughly "center" value to an established range that player seems to fall into.
Point Estimates for Irregular Distributions
That "center" value is actually a somewhat trickier thing to determine than it might appear at first. In fact, sometimes it's simply impossible.
There's certainly a temptation to just take the list of votes, to take the mean of the votes, and to use this as "the panel's score". What we are essentially aiming for is a single point estimate of the voting distribution, with the eventual aim of sorting the point estimates to generate a final list. If you have normally distributed data, the mean is a great point estimate. In fact, the vast majority of players have distributions which are either normal, or close enough to normal where the mean works just fine.
However, this problem becomes annoying when you have a player for whom you get non-normal distributions for. There's a specific nuance here: if you have a player who is hard to rate, and the panelists disagree on where they should go, they still usually get a normal distribution, just a very wide one. You get non-normal distributions when there are groups of panelists who agree, and the groups disagree with each other. For example: a player who does very well at locals but very poorly at majors, where "major fan" panelists will rank them low and "local enjoyer" panelists will rank them high.
Our job is to take a distribution which looks like this and to find some sort of point estimate which accurately represents the votes cast by the panel. In the figure2 above, you'll have an example of how the different measures of center simply answer different questions: the mean answers a question like "how much are people paid here" whereas the median answers "what is the typical person paid here". One is a statement about the total amount of money in the dataset, the other is a statement about the typical worker. You can't really generate a point estimate which properly conveys everything you need, a point estimate here necessarily answers a specific question instead.
What we want, vaguely, is "who is better", which is vague and poorly defined! We have to confront all sorts of confusing thought experiments to answer this question. If we have a truly bimodal distribution, and we take the mean to yield a value nobody is happy with, is that "the will of the panel"? If two players have equal medians, representing their performances at majors, but one player has a pool of votes reflecting their superior performances at smaller tournaments, shouldn't we reward that player instead of ignoring those votes altogether? There are arguments for and against each measure of center, and "just picking one" will screw over some subset of players.
We ended up running with a unweighted average of their mean score and their median score. Doing this allowed us to exclude the fewest ballots as "outliers" while not overly punishing any particular player purely based on what measure of center we used. Overall this entire problem does very little to the list: we made this decision largely without looking at the list output to avoid bias, but we found that it only really affected a handful of players. In fact, most players had means and medians within 1 or at most 2 spots apart from each other. But for a couple players this avoided them getting punished just based on what we chose.
Maybe a detail which could have just been ignored. But that's not really how I do things! A project for the summer rank is to run experiments with a weighted average to get something which better approximates how people think a panel should behave in these situations.
Violin Plots
Something I did about 5 years ago was generate violin plots for KayB's West Coast Bias in SSBMRank and Why it Doesn't Exist. These were pretty popular for the time, and I think people appreciated the insight they provided to understanding how the panel voted for specific players. Since I have access to the ballots again, I'll do the same as I did then (with outlier votes removed so nobody gets too upset).
Notable about these is that Zain's "violin" seems to be invisible. This is because he was number 1 on every ballot, and had a standard deviation of 0 since the normalization moved every rank 1 to be the same value. Not much of a violin to be drawn there.
How Close is Close
I recognize most of you are probably here about Mango vs aMSa, and I have tricked you into scrolling past a bunch of charts and numbers before getting here. Please understand this is for my own sanity, as I would prefer discussion around this topic to be had among people who are willing to look at lots of charts and numbers before engaging.
First, without any commentary, let's look at a zoomed in violin plot of Mango and aMSa.
I hope people can appreciate, looking at this, that it truly could have simply gone either way. Fortunately for me, the measure of center thing I talked about earlier didn't actually adjust the order here at all (aMSa leads all measures), but I hope people can simply look at this image and understand that there's a clear argument for either player to be rank 2.
You might look at the number attached to the playercard, with aMSa 0.01 above Mango, and conclude that all 29 panelists agreed that aMSa was 0.01 over Mango. This is not only not the case (I personally voted Mango 2nd, for whatever that's worth), but it's extremely far from the case. In fact, I'll go so far as to reveal how many panelists voted each way:
- Mango Higher: 12
- aMSa Higher: 15
- Ties: 2
I'm not privy to the voting distributions of any of the previous top 100 lists, so what I am about to say is complete conjecture: I truly do not know if there has ever been a top 10 spot which has ever been so contested as 2022's 2nd spot. There's so much discourse about X tournament counting, or Y tournament not counting, or what would have happened if Z went differently. I truly believe this to be emblematic of how electrifying 2022 was, despite a huge chunk of it being ripped away due to Omicron. The top of the rankings is the most interesting it has ever been - maybe the most it will ever be. If everybody agreed on everything, it would be boring.
Closing Thoughts
There's lots I didn't go into in this writeup (normalization, outlier supression, etc), but these were just some thoughts I thought were interesting enough to share with others. With more time, I'm confident this process will grow more sophisticated. However, I'm really just glad we got the list out, and that people generally seemed to like the result. It meant a lot to me to be trusted to fill this important sort of role, and I'm thrilled that the death of Panda was not, in turn, the death of a community-agreed ranking of the top players. We got it all done under severe time crunch (and I had covid during my time working on all of this), so if it went as well as it did I'm sure it will only get better from here.
Footnotes:
Score is displayed to the second decimal point for readability.