Hitter xBABIP v2.0: A Long-Needed Update by Alex Chamberlain July 13, 2016 Purposes of this post: Glass half-full: To update a year-old xBABIP equation that estimates a hitter’s batting average on balls in play (BABIP) based on his batted ball data; Glass half-empty: To pay restitution to readers and beautiful human Mike Podhorzer for damages incommensurable, wrought by the careless oversight of the initial version of the equation; Regardless, a glass with some amount of water in it: To provide, for those lacking attention or care, the updated version of equation aforementioned, found here for one’s immediate gratification sans linguistic obstruction. An Explanation Last year, when FanGraphs obtained Baseball Info Solutions (BIS) batted ball data, I was, in a word, jazzed. Kind of like the Statcast revolution now (but not nearly as popular or flashy), FanGraphs in conjunction with BIS bestowed upon the sabermetric community more great tools to describe and predict player performance. I developed an equation to estimate a hitter’s expected batting average on balls in play, or xBABIP. This was not an original idea — other iterations of xBABIP already existed — and I freely admitted then (and now) that my equation was not necessarily superior to any other. My equation simply offered new value by (1) using the new batted ball data and (2) incorporating metrics from a single, easy-to-locate source to make easier the calculation of the equation. Unfortunately, the equation has long been overdue for a makeover — right from the very start, basically. See, in my attempt to create a relatively simple equation, I overlooked a critical element of the data that has produced imperfect xBABIP estimates. In an attempt to make this a learning experience for anyone who cares, let’s take a look at batted ball data by season at the league level. For the statistically disinclined (a categorization of homo sapiens into which I might snugly fit at this point, honestly), look at hard-hit rate (Hard%) and notice how much it varies over the years. It climbs almost nine percentage points between 2002 (the first year of BIS batted ball data available on FanGraphs) and 2016. For the purposes of econometric or statistical modeling, one would argue it’s important to hold constant that kind of year-to-year variation, especially if there exists measurement errors or changes to BIS’s measurement methodology over time. These are distinct possibilities given (1) we are not privy to the nature of BIS’s data collection/coding processes and (2) the glut of technological advancements achieved in the last decade and a half. Which gets to the heart of the matter: a reader commented on Mike’s post yesterday remarking on my equation overestimating individual-hitter and league-wide xBABIPs across the board for 2016. It shook me a little. What the hell is going on?, I asked myself at work while definitely not crying into my bowl of lukewarm borscht. Ah, yes — the equation needed to control for the aforementioned year-to-year variation in batted ball metrics. Hard% fluctuates the most, but other metrics — line drive rate (LD%), infield fly ball rate (IFFB%), etc. — waver a bit over time as well. Accounting for what we call year fixed effects addresses this issue, but it also adds a layer of complexity I hoped to prevent from pervading my equation. I’ll address this momentarily. I also discovered, while digging through the data, that the way I constructed my sample inadvertently created a selection bias. I originally limited the sample to only hitters who qualified for the batting title in a given season. Funny how things work; I learned yesterday (and maybe you’re learning now) through some manipulation that qualified hitters inherently hit for higher BABIPs than non-qualified hitters. BABIP by Plate Appearance Increments PA BABIP ΔBABIP 150 – 249 .2924 — 250 – 349 .2956 +.0032 350 – 449 .2981 +.0025 450 – 549 .2988 +.0007 550 – 649 .3034 +.0046 650+ .3094 +.0060 Calculated as averages of BABIPs across all players and seasons from the sample There’s a pretty distinct trend here: players who accrue more playing time generally hit for higher BABIPs. (Quick aside: Despite the evidence, I did not include PAs as an independent variable in the model. I tested it, and it very slightly improves the explanatory power of the model. But from a theoretical standpoint, the best one could do to justify including PAs as a variable is by treating it as a proxy for talent or helping capture selection bias by managers and front offices — things that a player’s talent should generally facilitate on its own. Otherwise, including PAs essentially posits that BABIP correlates with playing time — a hypothesis that, in a vacuum, seems fundamentally untrue. With that said, roughly every 60 PAs correlates with a 1-point increase in xBABIP. In other words, player Y’s xBABIP would be expected to clock in five points worse if he accrued only 300 PAs instead of 600 PAs.) This makes sense, in a way — full-time players typically are more talented from a batted-ball standpoint relative to their benchwarming peers. Also, there are always instances every year of managers quitting on prospects and rookie hitters because they suffer BABIP-fueled slumps upon their debuts. It’s Major League Baseball’s natural selection process. However, by accidentally conforming to it via excluding hitters with less playing time (aka arguably less-talented players) from my sample, I artificially inflated xBABIP calculations. All of this, well… it’s a shame. In a way, I feel like I let down the small portion of the sabermetric community who cares with a suboptimal metric. I know it has been cited elsewhere, so it’s embarrassing to know I led people ever-so-slightly astray. It’s probably not as bad as I’m making it out to be. The mismeasurements likely didn’t make a huge difference on a micro, player-level scale. And, in the grand scheme of things, the tool was made to identify large disparities between certain players’ BABIPs and xBABIPs — something the equation could do even if it was a few points off. Regardless, I’m here to make reparations with an updated xBABIP equation. The catch is this: from now on, you’ll need to incorporate an additional constant to control for the yearly batted ball environment. I’ll list those constants below accordingly. HERE’S THE NEW EQUATION: xBABIP = .1770 — .3085*(True IFFB%) — .1285*(True FB%) + .3684*LD% + .0798*Oppo% + .0045*Spd + .2287*Hard% + Year Constant Year Constants: 2002: 0 2003: -.00867 2004: .00050 2005: -.00701 2006: .00786 2007: .00077 2008: .00184 2009: .00813 2010: .00199 2011: .00541 2012: -.01034 2013: -.01131 2014: -.00755 2015: -.00828 2016: -.01015** **See disclaimer below. General notes: The sample now consists of 5,771 player-seasons of at least 150 plate appearances from the 2002 through 2016 seasons. Fixed effects reduce the adjusted R-squared a tad from .456 to .424. I’m not very concerned by this result, mostly because of the increased variance in the player sample and additional variables in the model specification (each year constant serves as its own variable). Plug into the xBABIP equation the year constant term from the list above that corresponds with the season of the data you’re using. You have to calculate True IFFB% and True FB% by hand, so to speak, prior to inserting them into the equation: True IFFB% = FB% * IFFB% True FB% = FB% * (1 — IFFB%) Input Spd as-is — for example, 2.5, rather than converting it to a percentage (.025). If you manually crunch the numbers for league-wide xBABIP again, you’ll find that it’s still a few ticks too high for 2016 (in the .305 to .306 range, compared to .301-ish actual). It’s almost annoying that it isn’t closer after all this. But with the year fixed effects in place, the model has had its say. Hey, maybe the league and its improved offensive state does deserve a higher BABIP than what it has produced thus far. Disclaimer: As the 2016 season unfolds, its batted ball data will very slowly evolve. By the end of the season, the year constant for 2016 may have changed, resulting in what should be slightly different estimations than what you see above. (This, in a single point, highlights my reluctance to include year fixed effects — it makes everything more complicated and requires frequently updating the inputs.) Fortunately, it’s likely that the composition of the 2016 season’s batted ball data will not change much from now through the end of the season, ensuring that estimates in September will, at most, be only minimally different from how they look now. Just know that when 2016’s year constant changes, all of them will; and when 2017 data is introduced, all of the year constants will change again, if just slightly; and the cycle continues. In light of this, I’ll post periodic updates of the equation to keep pace. Epilogue Thanks, all, for being patient and understanding. (I’m ignoring those of you who don’t possess either trait.) If there’s a silver lining to this, it’s that open-source data and processing triumph once again. I’m a champion of open-source programming, and I’m glad procedural transparency enabled others to catch a mistake, ultimately for the benefit of the community. Feedback is appreciated as always. If you have questions about the xBABIP of specific players… use the equation, silly!