Introducing the Newest Hitter xBABIP by Mike Podhorzer February 10, 2022 It’s been a looooooooong journey toward understanding what underlying skills drive a hitter’s BABIP ability. No matter how much understanding we have gained over the years, it has been a struggle to develop an equation that produced an R-squared much over 0.50. That’s not terrible, but when my hitter xHR/FB equation spits out an impressive 0.826 R-Squared, I continue to strive for better. I shared my last hitter xBABIP equation almost exactly five years ago, and since, I have yet to see a better one. After the 2021 season, I went back to the drawing board, as I usually do for my xMetrics, to figure out if there was something I was missing. Or perhaps there was some time-consuming data gathering task I could do that would improve the equation, maybe one I had been too lazy to do in the past. Sure enough, I suddenly had several epiphanies and excitedly pulled the data I needed and ran a regression. Then I stared at a lower R-squared than my 2017 equation and my excitement waned. Huh? I was about to give up and just stick with my latest equation until I remembered something. Statcast already calculates a hitter’s expected batting average, or xBA! They do so by using the holy grail — individual batted ball data. Rather than use season totals and averages like all my xMetrics, Statcast’s xBA calculates a hit probability for every single batted ball and then totals those probabilities up for an expected hit total, which is used to calculate xBA. That’s a far superior method. So why reinvent the wheel if I could just drop the idea of creating a better xBABIP and simply go with what Statcast is selling? Well, Statcast’s xBA metric only accounts for exit velocity, launch angle, and as its glossary states “on certain types of batted balls, Sprint Speed.” What’s missing here and makes an obvious difference is horizontal direction, as in pull, center, or opposite. Statcast includes a filter for that in its search, but doesn’t incorporate that data into its calculation. The side effect of the missing variables is that xBA also does not account for defensive shifts, which we know have had a dramatic effect on some hitters as the strategy has become more common over the years. Knowing xBA’s shortcomings, I figured it might be prudent to attempt to improve upon the metric by incorporating what’s missing. First though, I needed to calculate an implied Statcast xBABIP, because it doesn’t calculate it for us. How, you ask? Let’s go over the steps: Go to this custom leaderboard for all the stats you need. Let’s take Marcus Semien as an example: In Excel, or in your math ninja head, multiply Semien’s xBA of 0.245 by his AB of 652 to calculate his expected hits or xH. So, xH = Statcast xBA * AB. In this case, xH = 159.74, versus 173 actual hits. Then, Statcast xBABIP = (Statcast xH – HR)/(AB – HR – SO + SF). For Semien, Statcast xBABIP = (159.74 – 45) / (652 – 45 – 146 + 3) = 0.247. That compares to an actual BABIP of .276, if you were curious. Just like that, you have officially calculated an implied Statcast xBABIP! Once I calculated Statcast’s xBABIP, I wanted to compare how well it correlated with actual BABIP during the entire Statcast era from 2015-2021. Shockingly, its R-squared was just 0.462. If you recall from the intro, I mentioned developing an equation with an R-squared of just over 0.50, as my 2017 xBABIP sported an R-squared of 0.538. So Statcast’s xBABIP, which uses individual batted ball data, explained actual BABIP worse than my equation that used season totals and averages. That was confirmation that Statcast’s xBABIP was ripe for improvement. The hope then became that I could develop a version of xBABIP that included Statcast xBABIP as a variable and it would produce a better R-squared than 0.538, and hopefully much better. I decided to do some horizontal direction research by using Statcast search to find out the league average BABIP on various combinations of batted ball variables. Below are the results: Batted Ball Type BABIP Batted Ball Type BABIP Opposite Shift IF Alignment GB 0.527 Opposite GB 0.376 Opposite Standard IF Alignment GB 0.356 Opposite Strategic IF Alignment GB 0.353 Straightaway Strategic IF Alignment GB 0.264 Strategic IF Alignment GB 0.256 Standard IF Alignment GB 0.252 Ground Ball 0.247 Straightaway GB 0.241 Straightaway Standard IF Alignment GB 0.239 Straightaway Shift IF Alignment GB 0.236 Pull Standard IF Alignment GB 0.233 Pull Strategic IF Alignment GB 0.215 Pull GB 0.214 Shift IF Alignment GB 0.213 Pull Shift IF Alignment GB by R 0.166 Pull Shift IF Alignment GB 0.127 Pull Shift IF Alignment GB by L 0.109 The Ground Ball row highlighted in yellow in the middle is the control. The league average BABIP on all grounders has been .247. I then chose to highlight Opposite GB, or grounders hit to the opposite field, and three buckets of pulled grounders into the shift that appear at the bottom. I included the BABIP on all pulled grounders while the infield alignment was shifted, as well as that same BABIP broken out by batter handedness. You’ll notice that left-handed hitters have been hurt more than right-handed hitters when pulling grounders into the shift. This was an aha! moment. I knew immediately that Opposite GB% would become a variable and given the stark BABIP difference between right-handed and left-handed batters on pulled grounders into the shift, I would include both Pull Shift IF Alignment GB As R% and Pull Shift IF Alignment GB As L%. But I wasn’t done quite yet. Despite the Statcast glossary page quoted above that mentions accounting for Sprint Speed, I have continued to find that it’s not accounted for enough. When I have sorted hitters by BABIP – xBABIP differential, the underperforming group had a significantly higher Sprint Speed than the overperforming group. That discovery confirmed that I still needed to add a speed variable. My preference was to add HP to 1B, for obvious reasons. Unfortunately, it’s not available for every player, and I wasn’t going to use a different xBABIP equation for hitters with no HP to 1B data. So I settled on using Sprint Speed, as I had no other choice. That became the fourth variable I would add to Statcast xBABIP for my new and (hopefully) improved Pod xBABIP. After running my regression, I was thrilled — R-squared improved and even jumped over my 2017 equation. However, there was a problem when looking at individual seasons, as my 2015 to 2019 league xBABIP marks came close to actual league BABIP marks, but my 2020 and 2021 marks did not. Perhaps you could figure out why from this table: League BA vs Statcast xBA BA Statcast xBA Diff 2015 0.259 0.244 0.015 2016 0.259 0.248 0.012 2017 0.259 0.251 0.009 2018 0.252 0.244 0.008 2019 0.256 0.248 0.009 2020 0.246 0.245 0.001 2021 0.248 0.246 0.002 From 2015 to 2019, Statcast’s xBA consistently sat well below actual BA. That’s pretty odd, as an xMetric should come pretty close to the actual metric for the entire league during a season. On an individual player basis, it’s going to be all over the map, but it shouldn’t be for the league in aggregate. But then beginning in 2020 and continuing in 2021, Statcast’s xBA was suddenly very close to actual BA. So I reached out to Mike Petriello to find out if he had an explanation, and this was our short Twitter convo: today of all days, Mike I don't know the precise answer off-hand but my guess (emphasis on guess here) would be the switch to much better tracking hardware starting in 2020 limited the number of fill-in-the-gaps that had to be done. — Mike Petriello (@mike_petriello) December 2, 2021 So it would seem as if there was an actual change that led to improved Statcast xBA calculations beginning in 2020. No wonder my 2020 and 2021 calculations were off! Those seasons were using the same equation as the 2015 to 2019 seasons when xBA was further away from actual BA and needed to be corrected. But the 2020 and 2021 xBA marks closely matched actual BA and did not need to be corrected. So I decided to create two separate regression equations — one to use for 2015 to 2019 and another for 2020 to 2021 and all future seasons. That solved the issue and I was back in business. It’s now time to reveal the equations: Pod xBABIP 2015-2019 = -0.01876 + (Statcast xBABIP * 0.84139) + (Sprint Speed * 0.00276) + (Pull Shift IF Alignment GB As R% * -0.08450) + (Pull Shift IF Alignment GB As R% * -0.12089) + (Opposite GB% * 0.14197) Pod xBABIP 2020 & Beyond = -0.02373 + (Statcast xBABIP * 0.93377)+ (Sprint Speed * 0.00175) + (Pull Shift IF Alignment GB As R% * -0.11485) + (Pull Shift IF Alignment GB As R% * -0.11195) + (Opposite GB% * 0.11621) Here is a table of adjusted R-squared comparisons: Comparison of Adjusted R-Squared With BABIP Seasons Pod xBABIP Statcast xBABIP 2015-2019 0.538 0.459 2020-2021 0.593 0.542 Overall 0.551 0.462 Those are big improvements from the Pod xBABIP over Statcast. The 2020-2021 marks are higher because the sample size of player seasons was much smaller. Note that while these marks aren’t significantly higher than my 2017 equation, it’s not a true apples-to-apples comparison. I used a minimum of 200 non-home run balls in play for my equations this time, versus 400 at-bats back then. If you could believe it, the fewest at-bats a hitter recorded while still putting 200 non-homers in play was 224. Not only does using a balls in play minimum versus an at-bat minimum make far more sense for an xBABIP equation, but the smaller sample size of balls in play I used for this new equation gives it a disadvantage versus the 2017. If I used a 400 at-bat minimum for my 2015-2019 equation instead of the 200 balls in play minimum, my adjusted R-squared would rise to 0.564, from the 0.538 in the above table. So, it’s another confirmation that this latest Pod xBABIP is superior to my 2017 version. Now for a quick explanation on pulling the data for the four variables used with Statcast xBABIP in the equations: Sprint Speed Found on the Statcast leaderboards Click the Download CSV button to the right of the top filters. Pull Shift IF Alignment GB As R% & Pull Shift IF Alignment GB As L% Perform a Statcast search by filtering Batted Ball Direction = Pull, IF Alignment = Shift, Batted Ball Type = Ground Ball, and Batter Handedness = Right or Left Click the disk button at the top right of the search results that when hovered over says “Download Results Comma Separated Values File”. Column A, “pitches”, is your value. It’s the total number of pulled ground balls hit into a shifted infield. Perform the same search, but switch the Batter Handedness filter to the other hand to ensure you download the data for each handedness. Opposite GB% Perform a Statcast search by filtering Batted Ball Direction = Opposite and Batted Ball Type = Ground Ball. Click the disk button at the top right of the search results that when hovered over says “Download Results Comma Separated Values File”. Column A, “pitches”, is your value. It’s the total number of opposite field ground balls hit. Once you have these totals, you will need to calculate what percentage of balls in play (which excludes home runs) these batted ball buckets represent. Calculate balls in play (BIP) using the stats you have already downloaded as: BIP = AB – HR – SO + SF Now simply divide each of the three batted ball bucket totals by BIP and you have your percentages to use in the Pod xBABIP equation. That’s a wrap for today. We’ll dive into the fun part next, looking at hitters whose Pod xBABIP most differs from Statcast xBABIP, underperformers and overperformers, and perhaps some leaderboards in each of the batted ball bucket rates.