The Great Valuation System Test: The Results
Yesterday, I shared the exciting news that my project partner Jason Bulay and I have completed The Great Valuation System Test, which involved a whopping 13 fantasy baseball player valuation systems. As usual, I feel like I could have done a slightly better job of explaining our process and goal. Essentially, we wanted to determine which valuation system most accurately converts a player’s statistical line (accounting for his position or ignoring it) into a dollar value. I eagerly awaited all the data so I could run the correlations and had my fingers crossed that the system I use, the REP method, performed well, if not the best. So it’s time to unveil the results.
The Results
The first set of results represent the correlations between the total dollar values earned and the total standings points achieved by each team.
System | Correlation | R-squared |
---|---|---|
SGP – Winning Fantasy Baseball Denoms | 0.9697 | 0.9403 |
Todd Zola REP | 0.9672 | 0.9355 |
Jason’s Z-scores | 0.9670 | 0.9350 |
ESPN | 0.9663 | 0.9337 |
Razzball – Old System | 0.9633 | 0.9279 |
Razzball – 0% Pos Adjustment | 0.9633 | 0.9279 |
SGP – Tony Fox Denoms | 0.9630 | 0.9273 |
Zach’s Z-scores | 0.9628 | 0.9271 |
Last Player Picked | 0.9616 | 0.9246 |
Razzball – 25% Pos Adjustment | 0.9615 | 0.9244 |
Razzball – 50% Pos Adjustment | 0.9579 | 0.9175 |
Razzball – 75% Pos Adjustment | 0.9524 | 0.9070 |
Razzball – 100% Pos Adjustment | 0.9453 | 0.8936 |
Let’s begin interpreting these results by looking at the big picture. With correlations ranging in the mid-to-high 0.90 range, the systems do a fairly strong job of valuing players accurately. That’s a good thing. But perhaps it’s a bit of a surprise that the correlations aren’t even higher. It suggests that there may still be potential work to do to improve player valuations in order to inch our way up closer to a perfect 1.0 correlation. That of course is unlikely to ever happen, but I don’t see why we couldn’t push the correlation up into the 0.98-0.99 range.
The next overarching observation is that these valuation systems all do a remarkably similar job valuing players. The top 10 systems all sport correlations between 0.9615 and 0.9697, which is a rather small gap. We see a bit more differentiation and separation between systems when looking at the R-squared though. There is a much more clear cut grouping, suggesting that the valuation system you choose to employ does actually matter, which is not something you would come away feeling when solely looking at the correlations.
Our winner is crowned! And sure enough, it’s actually the SGP method that I abandoned over 10 years ago. The R-squared also suggests that it performed like the Mike Trout of valuation systems, with a meaningful gap between it and the next system. I was still satisfied seeing the Zola REP system finish in second, something I was nervous about considering my auction style is driven by my dollar values and I closely stick to them.
There’s something very interesting about the SGP results though. If you recall, we actually tested two different sets of denominators. The top system used the denominators from the book Winning Fantasy Baseball and those denoms weren’t even derived from leagues using the same format (the format included a second catcher). But then we find the second set of SGP results using Tony Fox’s denoms from his league sitting in a lowly seventh place. What this tells us shouldn’t be a surprise — the denominators you choose when employing the SGP method are extremely important and values could change dramatically depending on what you settle on.
I knew Tony’s denoms were from a different format (much smaller roster size) and looked far different than the other set. He convinced me to still test those values though and I’m glad I did. We now know that the SGP method does indeed work very well, but you just need to make sure your denominators are either from your own league history or from another set of leagues with the exact same format as yours.
In the caveats section of yesterday’s article , I mentioned that two users calculating values employing the same system may still get different results, and specifically mentioned that I know this was the case for the two z-score methods tested. We see from the correlation table that Jason’s z-scores ranked third, while Zach’s z-scores ranked eighth. I cannot be sure that Jason followed the exact same procedure as Zach did (he must not have), but the fact that Jason’s values performed pretty well does validate the method as being a legitimate option, though perhaps not the best. Well, at least the version of the method Jason used.
I was quite surprised by the poor performance from the Last Player Picked calculator. I used to compare my values to theirs to make sure I wasn’t totally off and their values always seemed to match up pretty well with mine, and better than other sets of values.
Aside from correlating the dollars earned to the standings points achieved, I also correlated each team’s rank within the league of the dollars earned with the place in the standings. So the correlation would be between a second place finish in the standings with a dollars earned total that ranked third. I initially didn’t feel like this would yield meaningful results as the original test above, but Jason convinced me to test it anyway. Unfortunately, the results were head-scratching. Rather than share the same correlation table as above, below is a table that compares each system’s rank when correlating dollars earned to standings points (Rank – Points column and the ranking of systems from the above table) and the dollars earned ranking with standings rank (Rank – Rank column).
System | Rank – Points | Rank – Rank |
---|---|---|
SGP – Winning Fantasy Baseball Denoms | 1 | 1 |
Todd Zola REP | 2 | 7 |
Jason’s Z-scores | 3 | 2 |
ESPN | 4 | 11 |
Razzball – Old System | 5 | 4 |
Razzball – 0% Pos Adjustment | 6 | 5 |
SGP – Tony Fox Denoms | 7 | 6 |
Zach’s Z-scores | 8 | 8 |
Last Player Picked | 9 | 9 |
Razzball – 25% Pos Adjustment | 10 | 3 |
Razzball – 50% Pos Adjustment | 11 | 10 |
Razzball – 75% Pos Adjustment | 12 | 12 |
Razzball – 100% Pos Adjustment | 13 | 13 |
For the most part, the systems match exactly or are off by just one. SGP remains the king in both correlation tests. But three of the systems moved dramatically in the order in the rank correlations. Zola’s REP method dropped from second to seventh, ESPN fell from fourth to 11th and Razzball – 25% Pos Adjustment climbed from 10th to third. Jason and I couldn’t figure out what could possibly be causing these discrepancies. The only possible explanation is what concerned me to begin with that discouraged me from even calculating these correlations — that by just looking at the ranking, we’re completely ignoring the gap between teams. Because of this, it could cause funny things to happen in the valuation system correlation results. The overall correlations were also lower, which makes sense for that same reason.
The Next Steps
This test proved that the valuation system you choose does indeed matter, though admittedly your projections remain far more important. But it isn’t enough to know that on the whole, a valuation system performed better than another. Perhaps most crucial is getting individual player types right. When projection systems are evaluated, tests usually involve an RMSE (root-mean-square error) calculation. I’m no expert on that type of calculation, but it means a projection system isn’t going to perform well in an accuracy test if all its individual player forecasts are wrong, even if the average of the entire projected population turns out to be correct (I think).
Unfortunately, there’s no way that I’m aware of to test which system values each individual player most accurately. Does one system overvalue a certain position or category versus other systems, but it gets washed out in the overall results because it then undervalues another position or category? It’s very possible. I would love to test this. Any suggestions?
Ttomorrow I’ll take a look at some players with the biggest range in dollar values between the systems. We probably can’t determine which system is right, but it will be interesting to discuss and see if we could discover any trends.
Mike Podhorzer is the 2015 Fantasy Sports Writers Association Baseball Writer of the Year and three-time Tout Wars champion. He is the author of the eBook Projecting X 2.0: How to Forecast Baseball Player Performance, which teaches you how to project players yourself. Follow Mike on X@MikePodhorzer and contact him via email.
Excellent Mike. You wondered yesterday if you used a big enough sample size for this experiment. I have nothing to add to that except more questions. Is it possible you get different results in five years? Or even lessening the sample to 4 or 3 years of data? Any off the cuff thoughts on this?
Since this is the first test I’ve seen, then I’m as unsure as you. Sure it’s possible, but maybe the test was good and results stabilize quickly? I really don’t know.
We looked at some preliminary results after the first 10 leagues, and they were very consistent with the final results. So I’m pretty sure we have a good sample size at 50 leagues, and with that many teams, individual players have been combined in pretty much every way you can imagine. There could be bias from using just one season of stats, though. If we do a follow-up, we’ll use a different season.