Creating Synthetic Data In a Data-less World

What will we do without the zeros and ones of spring training? The underground, black market .csv file that comes from the person who knows the person who operates a Rapsodo in a mini-camp? How will we go on without knowing spin rates or the depth of clay infield impression drilled by various brands of signature spikes? I have an idea, let’s make it up.

I won’t pretend to know how video games like Out of the Park Baseball and MLB The Show create their simulations of a starting pitcher’s fastball, but it must be sampling from a pool of data that the pitcher has showcased before, right? Mariano Rivera was known for his high and tight cutter boring in on the hands of left-handed hitters and video games must just take that data and replicate it with a slight variation here and there. That’s what I’ll do in this post, try to create or simulate data.

In my first official post at FanGraphs, I wrote about a model I built to aid in selecting a player to get a hit each day in hopes of beating the streak in the yearly contest, you guessed it, “Beat the Streak“. I am not millions of dollars richer now than I was when I wrote that article, and a little logic will tell you that I did not, in fact, beat the streak. For those unfamiliar with the game, let me explain in one run-on sentence. Each day the contestant (that’s you) must choose a player that will get a hit and you must do this for 57 consecutive days, thus beating Joe DiMaggio’s 56-game hit streak. Sounds simple enough, right? While it is simple to guess who will get a hit, it’s incredibly complicated to use a model to predict or project who will get a hit. This is mostly because it’s difficult to pin down what will happen before the game starts and feed it into a model. But, just like video games create simulations, I’ll try to simulate what should happen based on what has happened already.

I’ll use one specific example by going back in time and pretending I’ve just woken up on August 22nd, 2021, and am using my new approach. Come with me, won’t you? For simplicity’s sake, let’s also pretend there was only one game on that day. You can almost hear the baseball sounds spilling out of Wrigley Field, can’t you? Now, we’re in our pretend time travel world and we’ve decided to play Beat The Streak. We need to make a pick. Alec Mills is pitching for the Cubs, taking on a young Carlos Hernández of the visiting Royals. The model I’ve built has been trained on data for the past two months (start_dt=’2021-06-20′, end_dt=’2021-08-20′) and it’s ready to deploy to predict who will get a hit in this matchup. Here are the starting lineups:

2021-08-22: Royals @ Cubs
Cubs Royals
Name Name
Rafael Ortega – CF Whit Merrifield – 2B
Frank Schwindel – 1B Nicky Lopez – SS
Ian Happ – LF Salvador Perez – C
Patrick Wisdom – 3B Andrew Benintendi – LF
Matt Duffy – 2B Carlos Santana – 1B
Jason Heyward – RF Hunter Dozier – RF
Austin Romine – C Michael A. Taylor – CF
Sergio Alcántara – SS Emmanuel Rivera – 3B
Alec Mills – P (Righty) Carlos Hernández – P (Righty)

At this point in the season, Mills has a 4.19 ERA, given up 86 hits for an opponent batting average of 0.279 in 77.1 IP. Hernández on the other hand has a 4.33 ERA, given up 45 hits for an opponent batting average of 0.234 in 52 IP. It looks like we chose a good game to make a pick from. Remember that Beat the Streak only asks you to pick a player to get a hit. Walks don’t count, doubles are just as valuable as singles, and speed, while not in my model, does play a factor. So, go ahead, take your pick. Who would you have chosen in these matchups?

Rather than choose, I’ll use my model to predict a hit. Now comes the part where I need to create some synthetic data. My model is built to predict a hit based on just a few things; pitch type, pitch location, spin-rate, launch angle, hit speed, and a few other statcast data points. I need to feed this model the same data that it has been trained on, but I don’t know what Alec Mills or Carlos Hernández will throw. I don’t know how hard Carlos Santana will hit the ball. I don’t know what launch angle Nicky Lopez will send a ball into play with. What I can do is collect data points from what they have already done in the past two months to create simulated data. If you are familiar with machine learning, you might be worried about feeding the model data that it has already been trained on. But, the variances that I’ll add into each event should be enough to trick it. Here’s a pseudo code version to what this process looks like:

Pitching data

– create an empty dataframe for pitchers
– fill the dataframe one row at a time with randomly sampled data points by randomly sampled pitches in that pitcher’s arsenal
– compile it up into synthetic pitching data

Hitting data

– create an empty dataframe for hitters
– fill the dataframe on row at a time with randomly sampled data points from previous batted ball data from each player in the lineup that day
– compile it up into synthetic hitting data

Batted ball data

– merge the two data sets together for simulated batted ball data

Now, I can run this hypothetical data through my model to make predictions, sort by the probability of the event occurring, and analyze the event to see if it makes any sense at all. Here’s a version of the output simplified to fit on the page:

Hit Picks by Probability
Batter Hit Prob Pitch Type LA Hit Speed Plate X Plate Z Spin Rate P Throws B Stand Park Factor
Emmanuel Rivera 44.2% FF 9.0 105.9 -0.18 2.5 2186.0 R R 99
Andrew Benintendi 42.1% FF 54.0 74.1 0.45 1.89 2163.0 R L 99
Carlos Santana 41.6% FF 45.0 71.8 0.75 2.98 2163.0 R R 99
Salvador Perez 33.6% FF 14.0 90.9 0.33 2.08 2212.0 R R 99
Salvador Perez 31.2% CH 24.0 81.1 0.41 0.49 1741.0 R R 99
Andrew Benintendi 31.4% CH 60.0 90.9 0.41 1.53 1511.0 R L 99
Hunter Dozier 30.3% CH 36.0 81.9 0.84 1.53 1741.0 R R 99
Emmanuel Rivera 29.2% CU 7.0 92.0 0.46 2.55 2224.0 R R 99
Michael A. Taylor 29.0% SL 38.0 84.4 2.76 2.92 2499.0 R R 99

This output shows my pre-game selections, who is most likely to get a hit based on simulated data. If Alec Mills throws pitches very similar to those he has in the past, I use it to see who has a swing path and a hard-hit ability that best matches Mills’ release point (not shown in the output above but in the model), where he locates it, the pitch type, etc., The answer to that question seems to be Emmanuel Rivera?? Let’s check in on what actually happened to see how the model performed.

Cubs/Royals Box Score (08-22-21)

Yes! One step closer to 5.6 Million dollars in this hypothetical world where baseball actually exists! Had I picked Emmanuel Rivera like the model suggested, I would have comfortably made it to the next day’s pick. Here are those two hits. I didn’t run the model to predict for Cubs hitters, but the process would have been exactly the same. I promise I didn’t just find a game where there were lots of hits for this article. While I don’t like the range of prediction percentages (something in the model needs to be tightened up), and the simulated launch angles look a little funny, I do like that it predicted hits to come off of Mills’ fastballs and changeups. I also like that the hit speeds of the predicted hits are reasonable.

I can’t give the model credit for Rivera’s second hit, it came off a sinker from reliever Adrian Sampson 샘슨. But, his first hit did come off a Mills sinker. The model thought a hard hit ball (~106 MPH) with a nine-degree launch angle would likely be a hit, and Rivera smacked a 79 MPH single with a 14-degree launch angle. 27 MPH is a significant difference so maybe the model just got lucky. Likely what happened is the sim data randomly found a hit speed that was nearly Rivera’s max (107.8). However, the two hitters that had the best day on the Royals were Andrew Benintendi and Carlos Santana and the model predicted them as the 2nd and 3rd most likely to hit. The next step for this model (named Jolt after DiMaggio himself) is to automate the process and generate hit picks with my first sip of coffee each day. Oh, yea, and the season needs to start as well.





2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Travis Lmember
2 years ago

Would you mind posting a link to your code? Gist.github.com is great for this!