Exodus 16/Numbers 11 Smackdown

More matter with less art, we hear some spoilsports telling us. We came for data; give us data. Fair enough, and here you go. The thing is, though, you never know. Sometimes you complain about the menu and you wind up with manna. Other times, you complain, and you get so much of what you’re after that, as God so trenchantly put it, “it come out of your nostrils and it be loathsome unto you.” Keep us posted as to which it is; we know you won’t be bashful about it.

We don’t have to tell you that there’s been a Second Revolution in sabermetric analysis. The first one was led by Bill James and ratified by the publication of Moneyball in 2003. The second one is going on right now, and is probably better exemplified by Fangraphs than by anything else. Take the progression in the way that people have approached BABIP. When the First Revolution began, nobody was thinking about Batting Average on Balls in Play. The fundamental insight about it—one of the things that marked the beginning of the Second Revolution–was that an abnormally high or low BABIP is hard for either a hitter or a pitcher to maintain from season to season. Expected BABIP reflects and tries to adjust for that difficulty. When, early in the Second Revolution, people started to calculate xBABIP, they just used actual BABIP to do so. Then they started factoring in the type of batted ball (line drive, ground ball, fly ball). Now, thanks to Jeff Zimmerman’s work on Fangraphs, they’re also considering whether each of those batted balls entailed soft, medium, or hard contact. In the same vein, we now know not only about balls and strikes, or first-pitch strikes, but also swinging strikes at specific pitches of specific velocities with specific vertical and horizontal movements. We have heat maps of pitched and batted balls and of zone contact, and probably heat maps of heat, reflecting temperature differences in different portions of the batter’s box.

We’d say that this Second Revolution entails two connected but distinct features. First, and more obviously, it’s more granular. The smallest unit of analysis available in the first wave was, generally, the plate appearance. No one knew where individual pitches were winding up; no one knew how hard batted balls were hit; no one knew, with precision, the consequences of having a particular count on a particular batter. That’s because no one was out there gathering these data, or at least no one was doing it who was willing to tell the public about it. That’s changed.

The second feature is that the analysis is less moored to actual on-field outcomes—the zero-sum ball-vs.-strike, on-base-vs.-out, run-vs.-no-run, and win-vs.-loss fabric of ordinary fandom. It’s thus less moored to fantasy categories and, as far as anyone knows, fantasy success. No one’s using, say, Swinging Strikes Induced on Outside Sliders Between 85 and 90 MPH as a category in a Fantasy league. It’s possible—even likely—that the new-wave stats are being used retrospectively in contract negotiations or arbitration. But for our purposes, these stats matter only if they’re predictive of the things we care about, and then only if they’re more predictive than the stats that are visible to the naked eye. And we just don’t know—yet–whether all the additional information we’ve got enables people (including you and us) who predict fantasy-relevant baseball performance to predict it better. If you’d do about the same as the Zimmermans of the Fantasy universe using, say, a weighted average of the plain-vanilla stats of a player’s last three seasons, why would you bother with something else?

So the question is, does any of the new stuff do any good? And the answer seems to be that it does. Since 2010, the estimable Wil Larson has gathered the prognostications of whichever experts are willing to go on record with their predictions, computed their accuracy, and determined for each year who did best. In 2010 there were just 6 forecasters in Larson’s sample; in 2011 there were 12; in 2014 he was up to 19. We wondered: with the explosion in stats, have these guys’ predictions collectively gotten more accurate? We wanted to compare the same forecasters over as long a period as possible so that they had a chance to incorporate the new knowledge produced by the new-wave stats into their forecast models, and we also wanted enough forecasters to provide a meaningful sample. So we chose the forecast years of 2011 and 2014 for the 10 forecasters who submitted predictions to Larson for both years for the 10 standard Rotisserie categories. (Actually, there were only 9 for hitters, because Steamer didn’t start with hitters until 2012.)

We begin by reporting how the forecasters did across the 5 hitting categories. We use Root Mean Square Error (which is a way of measuring the difference between predicted and expected outcomes) as our measure of accuracy. The lower RMSE (signifying the better forecast) for each forecaster/category is in bold. Here are the hitting forecasts:

Forecaster Year Runs HR RBIs BA SB
AGG PRO 2011 25.729 8.377 26.829 0.04 6.889
  2014 22.64 8.23 23.34 0.024 6.42
Cairo 2011 25.609 8.488 25.755 0.037 6.929
  2014 21.55 7.87 22.53 0.025 6.3
CBS SPORTS 2011 28.261 9.387 28.345 0.041 7.794
  2014 26.28 9.94 26.9 0.027 7.14
ESPN 2011 27.575 9.455 28.568 0.04 7.257
  2014 25.88 9.88 27.25 0.028 7.04
Fangraphs Fans 2011 30.001 8.918 32.646 0.041 7.532
  2014 27.2 9.24 28.98 0.029 7.62
Marcel 2011 22.958 8.108 24.186 0.039 7.054
  2014 21.67 7.62 22.76 0.027 7.04
Razzball 2011 26.766 9.331 28.791 0.041 7.961
  2014 24.57 8.9 27.45 0.027 7.14
Rotochamp 2011 29.162 9.032 27.048 0.04 7.746
  2014 21.73 8.49 24.6 0.026 6.93
Wil Larson 2011 24.474 8.066 25.113 0.042 6.729
  2014 24.88 8.75 24.37 0.029 7.08
2014 RMSE < 2011 RMSE? YES 8 5 9 9 7
  NO 1 4     2
Probability of RMSE Change Distribution   3.52% 49.22% .4% .4% 14%
Matched Pair T-Statistic (significance level)*   -3.70 (.006) -.154 (.880) -5.78 (.000) -28.44 (.000) -2.50 (.036)

*Wilcoxian Signed-Ranks tests yielded similar significance levels

So: pretty much everyone’s predictions got better in pretty much every hitting category from 2011 to 2014. The exception is home runs, which is something you might actually have expected, because home runs (as opposed to “power”) in a single season are kind of random. Say you project a player to hit 25 home runs. As Ron Shandler has noted, if the guy ends up with 22 or 28 home runs, it may just be a matter of a few gusts of wind. The more numerous counting stats are less likely to be subject to randomness, and they all show significant improvement. We did what’s called a T-test—a statistical technique designed to determine whether apparently significant results might instead be random. Nine forecasters is a small sample, but the results (embodied in the bottom row; a negative t-stat indicates a smaller RMSE, and thus a better prediction, in 2014) make randomness a pretty unlikely explanation. Even the fact that offense in general was down over the four years, and thus that the category mean and mean forecast errors might be suppressed, can’t explain away the improvement.

Now for the pitching predictions, in the four standard roto categories other than saves that Wil Larson tracks. (We’re assuming it’s self-evident why he omits them.)

Forecaster Year WINS ERA WHIP K
AGG PRO 2011 4.67 1.14 0.17 44.19
  2014 4.94 0.992 0.144 59.18
Cairo 2011 4.29 1.11 0.16 45.03
  2014 4.96 1.022 0.17 58.76
CBS SPORTS 2011 5.19 1.14 0.17 46.55
  2014 5.47 0.995 0.143 67.18
ESPN 2011 4.78 1.16 0.17 43.76
  2014 5.4 0.994 0.141 63.31
Fangraphs Fans 2011 4.5 1.13 0.17 45
  2014 5.56 1.005 0.141 64.09
Marcel 2011 4.71 1.14 0.16 43.58
  2014 4.9 1.003 0.143 57.93
Razzball 2011 3.97 1.12 0.17 44.06
  2014 5.25 0.99 0.149 62.89
Rotochamp 2011 4.09 1.11 0.17 40.96
  2014 5.04 0.989 0.145 64.18
Steamer 2011 3.97 1.13 0.17 40.08
  2014 4.94 1.006 0.15 57.89
Wil Larson 2011 4.35 1.14 0.17 43.84
  2014 4.77 0.992 0.148 56.62
2014 RMSE < 2011 RMSE? YES   10 9  
  NO 10   1 10
Probability of RMSE Change Distribution   .2% .2% 1.96% .2%
Matched Pair T-Statistic (significance level)*   5.58 (.001) -19.92 (.001) -5.68 (.001) 16.29 (.001)

*Wilcoxian Signed-Ranks tests yielded similar significance levels

Hmmm. Kind of puzzling. The rate-stat predictions (ERA and WHIP) are clearly improving, but not the counting stats (Ws and Ks). Why might this be? Well, it could be that the forecasting just isn’t getting better. We considered a different explanation: that the characteristics of the pitchers chosen for Larson’s sample have changed. If the starter/reliever ratio was higher in 2014 than in 2011, the mean of wins and strikeouts, and thus the RMSE for the counting stats, would have increased, and with it the apparent error. We checked; that’s not it, either. We’re partial to the explanation offered by Fangraphs Rookie of the Year Alex Chamberlain: Projection systems aren’t designed to predict pitcher injuries, which tend to be catastrophic and can produce enormous forecasting counting-stat errors. Pitcher injuries have increased over the last few years. More injuries equals less accurate prediction. Hitter injuries have also increased, but (1) they’re generally less devastating, and (2) they’re arguably more foreseeable. Larson could test this hypothesis by converting K projections to K% or K/9—that is, by converting a counting stat to a rate stat. If he does, he finds that the rate projections improve, and the Chamberlain hypothesis thus proves correct, then Chamberlain gets the sabermetric Nobel Prize, if not canonization by the sabermetric Vatican.

So what have we told you that you didn’t know nine paragraphs ago? That we (not just Fangraphs nobility, but also us serfs) here in Fantasyland seem to be getting better at improving our forecasting ability. One way to help pin this down would be to pick, for each forecaster, a reasonably-sized sample of hitters and pitchers who have, say, at least 400 ABs or 150 IPs projected and actual in 2011, and then another same-sized sample in 2014, and see if, once injuries are corrected for, the predictions improve. If you’ve got time to do this, be our guests.

We know that we’ve been assuming something that hasn’t been proved: that the new-wave stats play a part in the improved accuracy of the predictions. We don’t know what the guys whose predictions Larson looked at were using to generate those predictions. Perhaps they were using new-wave stats to create proprietary algorithms of brain-cambering complexity; perhaps they were reading the entrails of bulls. We also realize that what is arguably the most interesting question here is one we haven’t explored: assuming the new-wave stats are helping, which one or ones are helping most? Maybe xBABIP is the way, the truth, the light, whereas xFIP is a snare and a delusion. All we know is that something is enabling these particular forecasters to improve their numbers, and that it probably has more to do with xBABIP than with bull entrails, though, to be entirely fair, unlike bull entrails xBABIP isn’t enhanced if you add to it a bit of star anise, cinnamon, and grated lemon rind.

One last thing we’re sure you’re saying to yourselves: why didn’t the Birchwood Brothers use R2 ? Those charlatans! Here, verbatim, is the explanation for this, offered by the BB who’s chiefly responsible for the present installment: “To momentarily digress, we did not look at R2 , as the logic is that the regression (Actual = a + b1*forecast + error) eliminates biases (i.e., the coefficient b1 is different than 1) in the forecasting system. Our problem here is that if we knew the ex post biases ex ante – or even had some unbiased estimate of them – we could adjust the forecasts. Wil Larson has suggested that some systems show systematic biases and hence RMSE understates their forecast value. If these systems show a systematic ex ante bias, however, one should be able to improve their forecasting (as measured by RMSE) with an unbiased estimate of coefficient b1 – perhaps by using last year’s b1 or some moving average of previous estimated b1.”

So that’s settled. Glad you asked. See you next time.





The Birchwood Brothers are two guys with the improbable surname of Smirlock. Michael, the younger brother, brings his skills as a former Professor of Economics to bear on baseball statistics. Dan, the older brother, brings his skills as a former college English professor and recently-retired lawyer to bear on his brother's delphic mutterings. They seek to delight and instruct. They tweet when the spirit moves them @birchwoodbroth2.

12 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Urban Shocker
8 years ago

This is the sabermetric equivalent of Wilder Mind. Please bring back the delphic mutterings.