Linear Modeling for BB%

The experiment from last week’s post on modeling for strikeout rate continues this week with a look at walk rate:

  1. I’ve limited to players with at least 120 PAs because it is a good point of stabilization for hitter BB%.
  2. I’m using 2017-2019 as a training set and then deploying my model on 2021 data to look for differences between model predictions and actuals.
  3. My model only tells us what should be expected from a hitter who accumulates at least 120 plate appearances in a season based on what other players have done in the same situation from 2017-2019. 2020 is excluded. The predictions of this model should not be confused with expectations.

If you scroll down to the comment section of last week’s post, you’ll notice a few readers who were concerned about the impact of multicollinearity on the reliability of the model. If you’re modeling for a hitter’s walk rate and you’re inputting Swing%, O-Swing%, and Z-Swing%, then the true impact of these variables becomes distorted given how correlative they are. I’ve taken that into account with this model and I’ve removed the variables that are highly correlative (anything greater than .8) with other variables. That left me with the following inputs:

O-Contact%, Z-Contact%, Zone%, F-Strike%, O-Swing%, Z-Swing%

I then trained a multi-linear regression model on data from 2017-2019 and used the model’s equation on what actually happened in 2021 to predict how often a hitter should have walked based on the model. With fewer inputs, the equation (Y = B0 + B1*X1 + B2*X2 + … + Bi*Xi) becomes more legible and the coefficients are below:

Model Coefficients
Input Coefficient
O-Contact% .02
Z-Contact% -.10
Zone% -.43
F-Strike% -.06
O-Swing% -.43
Z-Swing% .01

Since the model was never shown 2021’s data, we can use it to predict player BB% and then compare what the model predicts and what actually happened. Here’s what that looks like when I decile the actual BB% and compare the predictions with the actuals:

You can see that hitters with very high walk rates are showing big gaps between actuals and preds and that is to be expected when regressing to the mean. I did not treat those hitters as outliers but chose to keep them in the training data. More on that later.

Now, let’s find a player to evaluate based on the discrepancy between what the model suggested and what actually happened. Jake Fraley was a deep league flier for some fantasy managers in 2021 given his 67th statcast percentile sprint speed, above average walk rate, and a 109 wRC+ in 265 plate appearances. While his path to playing time has a few potholes in it, Steamer has him projected at a 100 wRC+ with an 11.2% walk rate in 2022 in 405 plate appearances. Let’s look at what rate the model said he should have walked to first base at in 2021. Below, I’ve taken the model coefficient, multiplied it by Fraley’s 2021 stats to get the result column:

An Example of Calculation – Jake Fraley
Stat Coefficient 2021 Fraley Result
O-Contact% .02 x 57.9 .998
Z-Contact% -.10 x 81.3 -8.10
Zone% -.43 x 41.5 -17.6
F-Strike% -.06 x 54.7 -3.05
O-Swing% -.43 x 21.4 -9.22
Z-Swing% .01 x 65.7 0.42

Taking the sum of the ‘Result’ column (-36.61) and then adding that to the model’s intercept (50.54), we can see that the model predicted Fraley to have walked 13.9% of the time when he really walked 17.4% time. In that sense, perhaps we need to taper our walk expectations for Fraley in 2021 and Steamer does. When compared to the average stats of the training data (2017-2019) we see that Fraley had a lower O-Contact% (57.9%, 62.7% average) and a very low O-Swing% (21.4%, 30.8% average) and that O-Swing really moves the needle. According to the model for every 1% increase in O-Swing, we can expect BB% to decrease by .43% with everything else staying the same. Can Fraley continue to avoid the chase? Regression to the mean says no, but we’ll have to wait and see.

The table below basically shows us a bunch of outliers. Some of these players we can confidently say have a unique skill set that allows them to walk at a rate higher than what should be expected, but others may be surprising. I bet you didn’t expect Bryce Harper and Travis Jankowski to be on the same list:

Largest Negative BB%/Pred BB% Diff
Name O-Swing% Z-Swing% O-Contact% Z-Contact% Zone% F-Strike% BB% Preds BB_diff
Yasmani Grandal 18.7 51.1 63.2 84.7 40.7 48.5 23.2 15.4 -7.8
Juan Soto 15.1 62.8 62.6 88.7 41.6 53.7 22.2 16.0 -6.2
Bryce Harper 29.4 74.5 54.2 78.9 39.4 55.3 16.7 11.6 -5.1
Mike Trout 22.1 63.4 61.5 79.1 42.1 56.2 18.5 13.6 -4.9
Luis Guillorme 25.5 57.8 80.0 92.8 43.7 62.2 14.7 10.0 -4.7
Abraham Almonte 27.4 64.8 68.0 85.9 41.7 62.9 14.9 10.5 -4.4
Ben Gamel 27.2 65.7 51.2 87.6 43.9 60.5 12.8 9.3 -3.5
Jake Fraley 21.4 65.7 57.9 81.3 41.5 54.7 17.4 13.9 -3.5
Daniel Vogelbach 20.7 49.5 69.5 89.8 42.2 53.5 16.7 13.3 -3.4
Travis Jankowski 21.2 59.8 76.9 88.5 47.3 61.8 14.0 10.7 -3.3

Below are the hitters who the model really would have expected to walk more. Who stands out to you?

Largest Positive BB%/Pred BB% Diff
Name O-Swing% Z-Swing% O-Contact% Z-Contact% Zone% F-Strike% BB% Preds BB_diff
Joe Panik 23.7 68.3 82.6 89.7 44.9 60.3 6.6 10.8 4.2
Jake Cave 25.5 68.1 45.9 82.5 46.6 59.6 5.6 9.4 3.8
Jason Martin 29.1 73.0 58.3 79.3 45.7 58.4 5.2 8.9 3.7
Jonah Heim 31.9 75.6 71.4 86.5 41.4 60.0 5.3 8.9 3.6
Jake Meyers 31.0 68.3 57.4 86.0 40.0 61.3 6.1 9.6 3.5
Austin Nola 21.6 72.4 75.3 93.9 46.0 60.8 7.2 10.7 3.5
Luis Rengifo 36.3 70.2 66.7 80.7 39.8 61.1 4.7 8.1 3.4
Byron Buxton 34.5 80.0 60.0 79.7 40.5 65.4 5.1 8.4 3.3
Ryan O’Hearn 33.8 75.4 59.8 82.9 41.2 60.6 5.1 8.3 3.2
John Nogowski 18.9 68.9 71.9 92.9 46.9 58.0 8.4 11.6 3.2

In OBP leagues, this analysis of plate discipline can be useful for finding value in players for 2022 but realistically you want a player who can both get on base with walks and batted ball skills. This deep dive allows us to see which skills are more translatable to BB% and given all the 2022 Baseball HQ Forecaster pictures I’ve seen on Twitter lately, I can tell the fantasy baseball community values skills. A deep dive into skills is what we’re all about. But, on the other hand, maybe it’s just fun to see that Austin Nola’s 2021 Z-Contact% was 93.9% and to know you have a cool baseball statistic to pull out while sitting with friends and family around the holiday fire.





4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
VinnieDaGooch
2 years ago

I’m a big fan of Jonah Heim. I think he could put up some seasons with surprisingly high WAR totals