You may have noticed that FanGraphs now feeds batted ball data, courtesy of Baseball Info Solutions, into its leaderboards. The day the data appeared, my mind buzzed with ways they could be useful in improving our understanding of a hitter’s batting average on balls in play (BABIP).
Mike Podhorzer already augmented previous attempts at devising an equation for expected batting average on balls in play (xBABIP) for hitters by incorporating elements of a hitter’s power, speed, plate discipline and batted ball tendencies. So, with fresh numbers in hand, I embarked on a journey to further improve the ever-evolving xBABIP. However, I sought to do so by using only batted ball data. Basically, I intended to develop a convenient xBABIP equation, one that can be computed using almost entirely variables found on the same page.
What I ultimately developed is a hitter xBABIP that is more a complement to Mike’s xBABIP than a substitute — that is, it is arguably no better than Mike’s equation, nor no worse, but it’s still different. I will explain what I mean in due time.
I chose batted ball variables that I thought would correlate well with BABIP (obviously):
- LD%, True FB% and True IFFB%: In his introduction to the new batted ball data, Tony Blengino demonstrated the frequencies with which certain types of batted balls turn into hits. Thus, optimal proportions of certain batted ball types could maximize BABIP. True IFFB% represents infield fly balls as a percentage of all balls in play, not just fly balls; it is calculated by multiplying IFFB% and FB%. True FB% denotes all fly balls minus infield flies. Econometric note: Because the fractions of every batted ball type sum to one (100 percent), I must omit one of them or else the regression will do it for me. That is why you do not see ground ball rate here.
- Hard%: Hard%, one of the new statistics, indicates how often a player hit a ball hard. (Who knew?) Eno Sarris was as surprised as I am to find that there is “virtually no correlation” between line drive rate and hard-hit percentage.
- Oppo%: Oppo%, another new statistic, indicates how often a player hit to the opposite field. One could argue in favor of using Pull%, the percentage of balls pulled, which would likely be negatively correlated with BABIP, especially for guys who encounter a ton of infield shifts. But righties experience shifts less often, so Pull% might not adequately capture the effect we might seek.
- Spd: I originally wanted to include infield hit rate (IFH%) so the equation would consist entirely of batted-ball variables. The idea was to capture (skilled) hits by bunts and (lucky) hits by dinkers and dribblers. However, per Jeff Zimmerman’s insight, I reconsidered my inclusion of them because they’re not necessarily exogenous to BABIP. A hitter’s speed score (Spd), on the other hand, is independent of infield hits; in other words, infield hits are a function of speed, not the other way around.
This model specification could be considered an expansion on Jeff’s work regarding hitter analytics in which he uses the aforementioned Hard% and Spd to generate expected BABIP values.
I limited the sample to all qualified hitters from 2002 through 2014, good for 1,971 observations. What follows are the results from the OLS regression:
xBABIP = .1975 — .4383*(True IFFB%) — .0914*(True FB%) + .2594*LD% + .1822*Hard% + .1198*Oppo% + .0042*Spd
Adjusted R-squared = .456
In light of Mike’s adjusted R-squared of .424, it’s clear that, on its own, more granular batted ball data hardly, let alone significantly, improve our understanding of BABIP. (I’ll interject and say it’s unwise to judge a model strictly by its R-squared, as there are a variety of statistical tests one can perform to test a model’s validity. But, alas, it is commonly used and more easily understood.) Even the improvement in year-to-year correlation is only the slightest upgrade to Mike’s results:
Y1 BABIP to Y2 BABIP: .4072
Y1 xBABIP to Y2 BABIP: .4712
So what can we conclude? For one, there isn’t necessarily a “correct” or “better” way to approach xBABIP — at least not yet. I’m sure if we threw the kitchen sink at the problem, everything would fall into place. But the sink would probably break and it would be really messy and no one would want to clean it up and that’s why we can’t have nice things. From what I observe, the new spray statistics (Pull%, Cent%, Oppo%) replace, rather than augment, absolute average angle, as used by Mike and provided by Baseball Heat Maps, and isolated power (ISO) serves as a proxy for the various degrees of contact quality as represented by Hard%, Med% and Soft%. Ultimately, it appears that despite having more precise batted ball data, we are not much closer to explaining away the luck component of BABIP in consideration of my attempt here — an attempt that is far from the be-all and end-all.
While my equation appears to be “better” at first glance, we won’t know for sure until the xBABIPs from Mike’s and my equations are compared side by side (in the form of, say, minimizing root mean squared error, or RMSE). Until then, indulge in the xBABIPs of 2015’s qualified hitters provided below. “Diff” represents the difference between xBABIP and BABIP; I conditionally formatted the cells so that blue indicates an overachiever and red an underachiever.