Linear Modeling for Hitter K%

Experiment alert! Prepare yourself to digest a very simple linear model that looks at plate discipline data. I’ll do some explaining of the model along the way, but here are a few points to cleanse your already superb palate before sampling the charcuterie:

  1. I’ve limited to players with at least 60 PAs because it is a good point of stabilization for hitter K%.
  2. I’m using 2017-2019 as a training set and then deploying my model on 2021 data to look for differences between model predictions and actuals.
  3. My model only tells us what should be expected from a hitter who accumulates at least 60 plate appearances in a season based on what other players have done in the same situation from 2017-2019. 2020 is excluded. The predictions of this model should not be confused with expectations.
  4. Hasn’t this been done before? Probably.

To start, many readers will have seen how one metric correlates with another. In this case, we see a high negative correlation between how often a hitter strikes out (K%) and the contact that they make outside of the zone (O-Contact%). As O-Contact increases, K rate decreases, or hitters who can make a ton of contact when the ball is placed outside of the zone will strike out less often. The graph below shows that relationship from 2017-2019.

Take for example the cult classic Willians Astudillo who owned the lowest 2021 K% among hitters with at least 60 PAs at 5.6%! Along with that low K%, he also owned the highest O-Contact% at 89.1% among the same group of hitters. Here’s a link to a video that represents a perfect example of how this combination doesn’t always produce results. But, this is probably too intuitive to graph and plot, and calculate. Hit the ball, don’t strikeout. It makes sense.

What can be a little more informative is a multi-linear regression model that tries to make sense of how each input, or independent variable, affects the dependent variable. So, that’s what I did. I went over to our plate discipline leaderboards, did a little bit of this and a little bit of that, and created a model. Trained on 2017-2019, the model used what it learned from those seasons and predicted a hitter’s K% based on their plate discipline metrics. Here’s how the model predicted versus what actually occurred in 2021:

Not too shabby for a quick first run. Since I’m using a multi-linear regression, the coefficients of the inputs of the model can be used to see how the predicted K% is calculated. Remember, “the coefficient tells you how much the dependent variable [K%] is expected to increase when that independent variable [Input] increases by one, holding all the other independent variables constant.”

Model Coefficients
Input Coefficient
O-Swing% -0.64
Z-Swing% -0.43
Swing% 1.19
O-Contact% -0.08
Z-Contact% -0.27
Contact% -0.59
Zone% -0.35
F-Strike% -0.02
SwStr% 0.28
CStr% 0.75

Using the definition from above, we can see that for every 1% increase in Swing%, a player’s K% should increase roughly 1.2%. Let’s use this equation on Keibert Ruiz, the young Nationals catcher who came over from the Dodgers at the trade deadline (I can’t remember who the Dodgers got for him in the deal, but you can look it up) to see how well it predicted his 2021 K%. Rather than trying to type out a long, multi-linear regression model equation (Y = B0 + B1*X1 + B2*X2 + … + Bi*Xi), I’ll present it in a table. Here I am using the table to show the equation in a spread-sheet type of manner. Each of Ruiz’s 2021 statistics is multiplied by the coefficients of the model to get the ‘Result’ column:

An Example of Calculation – Keibert Ruiz
Stat Coefficient 2021 Ruiz Result
O-Swing% -0.64 x 41.4 -26.36
Z-Swing% -0.43 x 67.8 -29.25
Swing% 1.19 x 51.8 61.57
O-Contact% -0.08 x 87 -6.74
Z-Contact% -0.27 x 89 -23.73
Contact% -0.59 x 88.1 -51.97
Zone% -0.35 x 39.4 -13.96
F-Strike% -0.02 x 66.7 -1.30
SwStr% 0.28 x 6.2 1.74
CStr% 0.75 x 16.6 12.51

Taking the sum of the ‘Result’ column (-77.47) and then adding that to the model’s intercept (89.75), we can see that the model predicted Ruiz to have struck out 12.3% of the time when he really struck out 9.4% time. While there is no doubt that there are more things than just plate discipline data that should be used to explain how often a player strikes out, this does a nice job of isolating K% on a handful of predictors. The best way to use these results would be to check what the model predicted for 2021 hitters versus what the hitters actually did and pick apart the differences. I’ve limited the tables below to the two highest and two lowest coefficients for simplicity. Let’s take a look at the top 10 largest discrepancies in either direction:

Largest Negative K%/Pred K% Diff
Name Team O-Swing% Swing% Contact% CStr% K% Pred K_Diff
Andrew Romine CHC 38.4 52.1 73.9 12.8 37.5 23.3 -14.2
Jahmai Jones BAL 23.4 42.8 78.8 21.0 36.1 24.0 -12.1
Ender Inciarte ATL 31.8 49.1 83.8 15.3 24.7 15.0 -9.7
Jake Lamb – – – 27.4 44.4 78.2 17.7 30.0 21.6 -8.4
Chad Wallach MIA 36.0 48.6 59.9 17.1 48.5 40.3 -8.2
Jarren Duran BOS 38.6 51.7 69.4 13.5 35.7 28.2 -7.5
Will Craig 크레익 PIT 35.9 51.0 75.4 18.6 33.8 26.4 -7.4
Leody Taveras TEX 29.8 47.7 75.1 16.4 32.4 25.2 -7.2
Kyle Isbel KCR 39.8 48.6 77.6 17.1 27.7 20.5 -7.2
Max Schrock CIN 41.4 55.2 85.4 11.3 17.9 10.8 -7.1

In this first table, we see some very young hitters who are all (except Schrock and maybe Inciarte) striking out well above the 2o21 league average of 23.2%. While they are all striking out at a high clip, the model expected them to do a little better. That could mean there’s something missing from the model that better explains K%, which is likely, but it could also mean that these players have something in their plate discipline profile that suggests a skill is in place that could lower their K%. Por ejemplo, Leody Taveras struck out 32.4% of the time but my model says, based on the inputs, he should have struck out less. If we go back and look at all the inputs of the model, we can see that Taveras actually had a lower O-Swing% than the average in the training set (not to be confused with the 2021 league average) and a higher O-Contact% rate. These skills lower his predicted K%, but his lower than the average Z-Contact% and higher than average Zone% and F-Strike% are hurting him in reality. Pitchers attacked Taveras in 2021. The takeaway for Taveras should be that he does a good job of limiting the chase, makes contact when he does chase but needs to do better on making quality contact when the ball is in the zone. Here’s what the other side of the spectrum looks like:

Largest Positive K%/Pred K% Diff
Name Team O-Swing% Swing% Contact% CStr% K% Pred K_Diff
Alejandro Kirk TOR 28.4 44.8 82.1 19.8 11.6 19.7 8.1
Danny Santana BOS 34.1 53.9 67.3 11.8 23.6 30.9 7.3
Oscar Mercado CLE 26.7 47.2 74.2 14.3 17.6 24.6 7.0
Alex Avila WSN 16.5 36.8 62.4 20.9 33.3 40.3 7.0
Kyle Lewis SEA 25.4 47.0 68.3 15.7 25.2 32.1 6.9
Vladimir Guerrero Jr. TOR 28.3 47.3 73.9 12.7 15.8 22.6 6.8
Keston Hiura MIL 34.7 51.9 54.3 13.6 39.1 45.9 6.8
Jesús Aguilar MIA 32.0 44.5 75.3 18.2 18.2 24.8 6.6
Kelvin Gutierrez – – – 32.3 45.1 69.8 20.7 25.8 32.2 6.4
Tomás Nido NYM 53.4 65.3 64.2 8.3 27.3 33.6 6.3

The table above shows us who the model thinks should have struck out a little more. Let’s use Danny Santana as an example. His O-Swing% finished about 3% higher than the training data set’s average (34.1% Santana, 30.9% average), his Z-Swing% finished nearly 10% higher (79.8% Santana, 67.9% average) and his O-Contact% was far below the average (48.0% Santana, 62.2 average). Mixing a low O-Contact% with a high O-Swing% is not a good combo and it seems the model really thought Santana would strike out more than he did. Perhaps the fact that he can keep his CStr% low kept him from heading back to the dugout with a stadium “Cha-Ching” noise ringing out, but keeping a profile like this in 2022 should create more “Cha-Ching” sounds.

Experimentation is fun for me and it helps me learn more about the intricacies of hitting. Maybe you’ve made it to the end of this article because you like math, maybe it’s because you’re interested in how projection systems work, or maybe you just thought I would recommend a hitter to target because I think he’ll strike out less often and now you feel let down. If you fall into that last category, try to think of things like this; certain plate discipline metrics are more impactful on a player’s K% than others. For example, if you’re excited about a prospect but don’t really know if he is draftable yet, dig into his plate discipline profile, use the coefficients and equation to see what his predicted K% might be, and make your decision from there. That’s half the fun of draft prep! Of course, you could always just ask me to look into one specific player in the comments and I’d be happy to share my results. As always, the experiment will continue to evolve and in another post, I’ll use this same system to predict BB%. Until then, go to Willians Astudillo’s savant page, click on “Show Random Video” and watch him swing away at every ball thrown to him and never strike out.





8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
jdmmember
2 years ago

Nice work, I’ve always looked at plate discipline metrics to see if a players improvement in K% is just signal or noise but I was too lazy to do a systematic examination.

I would offer one suggestion though. I don’t know what standard errors you got from this fit but you might get better estimates using a parsimonious version of this model (might have introduced some multicollinearity). For example, since Swing% is just a weighted average of O-Swing% and Z-Swing%, I would guess that the true coefficients of O-Swing% and Z-Swing% is distorted by the inclusion of Swing%.

When I looked at the coefficient estimates, it seems strange that O-Swing and Z-Swing have nontrivial negative coefficients but raw Swing%’s coefficient is nontrivially positive (seems like the coefficient for Swing% is mitigating some of the effect resulting from its subsets since the sum of the coefficients is close to zero). But I have done zero work and this is all purely speculation without any other model results so I could be way off base here. Regardless, I enjoyed the article and I’m looking forward to future editions.

jgrub7
2 years ago
Reply to  jdm

Agreed — and I was thinking exactly the same thing. I do not think it makes sense to include Swing % and Contact % because it is being double counted and definitely some multicollinearity is introduced).