Not Impossible, Just Improbable: Beat the Streak Is Back!

You may remember, in the before times, a game called Beat the Streak. The game challenged baseball fans to hypothetically beat one of the greatest records of all time. The idea was to pick one player each day to get a hit and to do that 57 consecutive times, beating Joe DiMaggio’s 56 game hit streak record. Way back in preseason 2020 I wrote about my ambitions of becoming a millionaire by using predictive, machine learning models to aid in winning the competition. The game’s 2020 cancellation gave me time to think, time to read, time to learn how noisy my upstairs neighbors are, and time to build a better model.

As you can imagine, this is hard to do. Many have tried, all have failed and some have pointed out the insanely low probability of actually reaching 57. But whether you’re attempting to actually beat those very, very small odds, or you’re just looking for daily fantasy production, predicting daily hit probabilities can be useful. This seems like a natural fit for machine learning, so fit we will! 

If you could give a computer some variables to learn from, let’s say 10 or less, what variables would you choose? What factors would you say are most important to a hitter, getting a hit? You might think of righty/lefty matchups. Ok, check. What else? Think about this from the manager’s point of view when trying to set the lineup. Kiley McDaniel and Eric Longenhagen discuss this exact situation in their book, Future Value:

 

“There’s a lot of ideas to consider when a manager is setting a lineup for the game that night….The most important concepts to consider in this situation are the plane of the pitchers’ fastball to the plate and the plane of the hitter’s swing through the zone…[I]magine the hitter being a power hitter who hits the ball in the air and has the bat speed to get away with having a bit of a longer path to the zone (think Ronald Acuna), designed to increase bat speed and scoop the ball into the air. The ball is steep coming down from the pitcher’s hand, much steeper than average, and the hitter’s swing path is also steep coming up at the ball, also much steeper than average. You can see why this is a good matchup, all else being equal, his swing path is “on plane” with the pitcher for a longer time, giving him margin for error.”

 

That’s a lot to consider for a manager, but not for a computer and if I were the manager, I would be using a model to tell me which players were most likely to hit tomorrow’s starter. Ok, I would have an analyst build me a model to tell me all that stuff, preferably in short order. Beat the Streak allows me to play the role of analyst and manager, fun! Here are a few rows of training data, or data the computer initially learns from:

Model Training Data
Pitch Type Release X Release Y P Throws B Stands Plate X Plate Z LA Spin Rate Park Factor Hit
1 SL -0.84 6.82 R R 0 2.03 -21 2154 99 0
2 SL -1 6.87 R L 0.4 2.41 -44 2203 100 0
3 FF -1.34 6.71 R R 0.02 2.23 11 2368 99 1
4 FF -1.1 6.6 R L 0.54 1.75 3 2348 97 0
5 SL -1.14 6.83 R R -0.15 1.83 -35 2300 101 0

The model is learning from all of these unique combinations and determining what patterns create a hit. Keeping it simple limits the number of unrealistic assumptions I have to make when I deploy my model. For example, I can’t know for sure whether there will be a shift on or not. Beat the Streak forces me to make my pick before the National Anthem is even sung, so there’s only so much you can prepare for. But, I do know what a certain pitcher’s average fastball spin rate is and I can adjust that as the season goes on. From the variables I did use, here’s what the model says is most important when predicting a hit:

 

Feature Importance Chart

 

Since I’m using a random forest model to predict my target, the feature importances are a little different than the coefficients of a regression. If you’re interested, you can read about how they’re calculated here. The computer says, “Ok, I’ve learned what makes a hit and what doesn’t and now I’m a trained model.” But now I, the human, need to feed that trained model fresh, new data. The problem is, I don’t know what pitchers will throw and what will be the resulting location, but I can assume that they will do something similar to what they’ve done in the past. The same goes for the hitter and their launch angle. 

On opening day, the model predicted Michael Brantley was likely to hit. It chose Michael Brantley, only because it thought Chris Bassitt’s pitch plane and Brantley’s swing plane would equate to a hit. Of course, Bassitt had to throw some pretty specific pitches. Something like this:

 

PitchLocations

 

That shouldn’t be too hard to believe, right? The idea of Bassitt throwing a slider low in the zone didn’t throw any red flags. Brantley went 3 for 4 and got 2 of those hits off of Bassitt. My model predicted Brantley to get a hit off a slider thrown low in the zone, yet it was a changeup more towards the middle of the zone that resulted in Brantley sliding into 2nd.

 

A comparison of what the model predicted (above) and what actually happened can provide some insight as to whether or not the model is totally random, or providing some help. Pitch locations (plate_x and plate_z in the image above) are much more important, by the model’s standards than pitch type. But leave a changeup right there in the middle of the zone like that and a professional like Brantley is going to hit it. Was it a lucky guess, or is there something the model is picking up on that I otherwise wouldn’t? It’s simply too early to tell. Here’s the model’s performance so far, which I’m calling “Jolt” in honor of the true record holder, Joe DiMaggio:

Model Hit Pick Results
Date Pitcher Home Team Batter LA Outcome
1 4/1 Chris Bassitt OAK Michael Brantley 10.2 3-for-4
2 4/1 Kyle Hendricks CHC Bryan Reynolds 10.2 1-for-4
3 4/2 Antonio Senzatela COL Corey Seager 11.9 3-for-5
4 4/3 Zach Plesac DET Jeimer Candelario 12.1 3-for-4
5 4/4 Adrian Houser MIL Luis Arraez 11.4 3-for-3
6 4/5 Matt Shoemaker DET Jeimer Candelario 13.3 0-for-4
Two picks were made on opening day (4/1) as Beat the Streak allows players to “Double Down”, rewarding you with 2 hits for the day but adding risk.

 

Just like it’s hard for the best hitters in the world to get a hit in a major league game, it’s hard for mediocre data scientists and baseball fans to predict who will be successful. But, as Jolt learns more and more from this year’s data, I’m hoping to make it a little further than 6. Let’s say, 57? If I get there, I’ll be sure to let you know. For now, pay attention to swing and hit planes and see if you can add a new, fun dimension to your fantasy season. You may just win a huge pile of cash.





10 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
ihatepantsmember
2 years ago

I’ve been fascinated with the challenge the last few season. Several factors I’d also have the model consider.
With today’s bullpens and shorter pitcher starts, I’ve also included the strength of a bullpen as a factor, using the RP positional war estimates from the zips article.
I also look for hitter’s who perform well on the road since the away team is always guaranteed to bat in the 9th where as a home team can miss the 9th if they’re ahead. Generally favoring teams in a better lineup as well to increase the chances of more hits and times through the lineup. I think taking a look at average plate appearances per game is a good way to look at it.
I’m also considering team defense and looking for teams in the bottom 10 by DRS. I’m also looking for the most consistent hitters, some have good averages but are streaky, I prefer to track how many games a hitter gets at least one hit.

ihatepantsmember
2 years ago
Reply to  Lucas Kelly

What language/tool did you build the model in R? Python? I’m using the project to build out my python skills.

b.now
2 years ago
Reply to  ihatepants

I’d also be interested in learning more about the how as an amateur data scientist myself. I haven’t played around much with anything (mostly just trying to absorb concepts) and mixing in baseball seems like the easiest way to learn more. So I’ll start with a couple newbie questions – what’s the best source of data for a developing a model? Does FG have an API I could connect to? Best to extract csvs from the leaderboard pages? A best method to account for current season data?

Mostly any starting points would be largely appreciated, thanks for the rather interesting article. Looking forward to seeing more about it.