Creating Synthetic Data In a Data-less World

by Lucas Kelly

March 10, 2022

What will we do without the zeros and ones of spring training? The underground, black market .csv file that comes from the person who knows the person who operates a Rapsodo in a mini-camp? How will we go on without knowing spin rates or the depth of clay infield impression drilled by various brands of signature spikes? I have an idea, let’s make it up.

I won’t pretend to know how video games like Out of the Park Baseball and MLB The Show create their simulations of a starting pitcher’s fastball, but it must be sampling from a pool of data that the pitcher has showcased before, right? Mariano Rivera was known for his high and tight cutter boring in on the hands of left-handed hitters and video games must just take that data and replicate it with a slight variation here and there. That’s what I’ll do in this post, try to create or simulate data.

In my first official post at FanGraphs, I wrote about a model I built to aid in selecting a player to get a hit each day in hopes of beating the streak in the yearly contest, you guessed it, “Beat the Streak“. I am not millions of dollars richer now than I was when I wrote that article, and a little logic will tell you that I did not, in fact, beat the streak. For those unfamiliar with the game, let me explain in one run-on sentence. Each day the contestant (that’s you) must choose a player that will get a hit and you must do this for 57 consecutive days, thus beating Joe DiMaggio’s 56-game hit streak. Sounds simple enough, right? While it is simple to guess who will get a hit, it’s incredibly complicated to use a model to predict or project who will get a hit. This is mostly because it’s difficult to pin down what will happen before the game starts and feed it into a model. But, just like video games create simulations, I’ll try to simulate what should happen based on what has happened already.

I’ll use one specific example by going back in time and pretending I’ve just woken up on August 22nd, 2021, and am using my new approach. Come with me, won’t you? For simplicity’s sake, let’s also pretend there was only one game on that day. You can almost hear the baseball sounds spilling out of Wrigley Field, can’t you? Now, we’re in our pretend time travel world and we’ve decided to play Beat The Streak. We need to make a pick. Alec Mills is pitching for the Cubs, taking on a young Carlos Hernández of the visiting Royals. The model I’ve built has been trained on data for the past two months (start_dt=’2021-06-20′, end_dt=’2021-08-20′) and it’s ready to deploy to predict who will get a hit in this matchup. Here are the starting lineups:

2021-08-22: Royals @ Cubs

Cubs		Royals
Name		Name
Rafael Ortega – CF		Whit Merrifield – 2B
Frank Schwindel – 1B		Nicky Lopez – SS
Ian Happ – LF		Salvador Perez – C
Patrick Wisdom – 3B		Andrew Benintendi – LF
Matt Duffy – 2B		Carlos Santana – 1B
Jason Heyward – RF		Hunter Dozier – RF
Austin Romine – C		Michael A. Taylor – CF
Sergio Alcántara – SS		Emmanuel Rivera – 3B
Alec Mills – P (Righty)		Carlos Hernández – P (Righty)

At this point in the season, Mills has a 4.19 ERA, given up 86 hits for an opponent batting average of 0.279 in 77.1 IP. Hernández on the other hand has a 4.33 ERA, given up 45 hits for an opponent batting average of 0.234 in 52 IP. It looks like we chose a good game to make a pick from. Remember that Beat the Streak only asks you to pick a player to get a hit. Walks don’t count, doubles are just as valuable as singles, and speed, while not in my model, does play a factor. So, go ahead, take your pick. Who would you have chosen in these matchups?

You Aren't a FanGraphs Member

It looks like you aren't yet a FanGraphs Member (or aren't logged in). We aren't mad, just disappointed.

We get it. You want to read this article. But before we let you get back to it, we'd like to point out a few of the good reasons why you should become a Member.

1. Ad Free viewing! We won't bug you with this ad, or any other.

2. Unlimited articles! Non-Members only get to read 10 free articles a month. Members never get cut off.

3. Dark mode and Classic mode!

4. Custom player page dashboards! Choose the player cards you want, in the order you want them.

5. One-click data exports! Export our projections and leaderboards for your personal projects.

6. Remove the photos on the home page! (Honestly, this doesn't sound so great to us, but some people wanted it, and we like to give our Members what they want.)

7. Even more Steamer projections! We have handedness, percentile, and context neutral projections available for Members only.

8. Get FanGraphs Walk-Off, a customized year end review! Find out exactly how you used FanGraphs this year, and how that compares to other Members. Don't be a victim of FOMO.

9. A weekly mailbag column, exclusively for Members.

10. Help support FanGraphs and our entire staff! Our Members provide us with critical resources to improve the site and deliver new features!

We hope you'll consider a Membership today, for yourself or as a gift! And we realize this has been an awfully long sales pitch, so we've also removed all the other ads in this article. We didn't want to overdo it.

Click Here To Become a Member

Rather than choose, I’ll use my model to predict a hit. Now comes the part where I need to create some synthetic data. My model is built to predict a hit based on just a few things; pitch type, pitch location, spin-rate, launch angle, hit speed, and a few other statcast data points. I need to feed this model the same data that it has been trained on, but I don’t know what Alec Mills or Carlos Hernández will throw. I don’t know how hard Carlos Santana will hit the ball. I don’t know what launch angle Nicky Lopez will send a ball into play with. What I can do is collect data points from what they have already done in the past two months to create simulated data. If you are familiar with machine learning, you might be worried about feeding the model data that it has already been trained on. But, the variances that I’ll add into each event should be enough to trick it. Here’s a pseudo code version to what this process looks like:

Pitching data

– create an empty dataframe for pitchers
– fill the dataframe one row at a time with randomly sampled data points by randomly sampled pitches in that pitcher’s arsenal
– compile it up into synthetic pitching data

Hitting data

– create an empty dataframe for hitters
– fill the dataframe on row at a time with randomly sampled data points from previous batted ball data from each player in the lineup that day
– compile it up into synthetic hitting data

Batted ball data

– merge the two data sets together for simulated batted ball data

Now, I can run this hypothetical data through my model to make predictions, sort by the probability of the event occurring, and analyze the event to see if it makes any sense at all. Here’s a version of the output simplified to fit on the page:

Hit Picks by Probability

Batter	Hit Prob	Pitch Type	LA	Hit Speed	Plate X	Plate Z	Spin Rate	P Throws	B Stand	Park Factor
Emmanuel Rivera	44.2%	FF	9.0	105.9	-0.18	2.5	2186.0	R	R	99
Andrew Benintendi	42.1%	FF	54.0	74.1	0.45	1.89	2163.0	R	L	99
Carlos Santana	41.6%	FF	45.0	71.8	0.75	2.98	2163.0	R	R	99
Salvador Perez	33.6%	FF	14.0	90.9	0.33	2.08	2212.0	R	R	99
Salvador Perez	31.2%	CH	24.0	81.1	0.41	0.49	1741.0	R	R	99
Andrew Benintendi	31.4%	CH	60.0	90.9	0.41	1.53	1511.0	R	L	99
Hunter Dozier	30.3%	CH	36.0	81.9	0.84	1.53	1741.0	R	R	99
Emmanuel Rivera	29.2%	CU	7.0	92.0	0.46	2.55	2224.0	R	R	99
Michael A. Taylor	29.0%	SL	38.0	84.4	2.76	2.92	2499.0	R	R	99

This output shows my pre-game selections, who is most likely to get a hit based on simulated data. If Alec Mills throws pitches very similar to those he has in the past, I use it to see who has a swing path and a hard-hit ability that best matches Mills’ release point (not shown in the output above but in the model), where he locates it, the pitch type, etc., The answer to that question seems to be Emmanuel Rivera?? Let’s check in on what actually happened to see how the model performed.

Cubs/Royals Box Score (08-22-21)

Yes! One step closer to 5.6 Million dollars in this hypothetical world where baseball actually exists! Had I picked Emmanuel Rivera like the model suggested, I would have comfortably made it to the next day’s pick. Here are those two hits. I didn’t run the model to predict for Cubs hitters, but the process would have been exactly the same. I promise I didn’t just find a game where there were lots of hits for this article. While I don’t like the range of prediction percentages (something in the model needs to be tightened up), and the simulated launch angles look a little funny, I do like that it predicted hits to come off of Mills’ fastballs and changeups. I also like that the hit speeds of the predicted hits are reasonable.

I can’t give the model credit for Rivera’s second hit, it came off a sinker from reliever Adrian Sampson 샘슨. But, his first hit did come off a Mills sinker. The model thought a hard hit ball (~106 MPH) with a nine-degree launch angle would likely be a hit, and Rivera smacked a 79 MPH single with a 14-degree launch angle. 27 MPH is a significant difference so maybe the model just got lucky. Likely what happened is the sim data randomly found a hit speed that was nearly Rivera’s max (107.8). However, the two hitters that had the best day on the Royals were Andrew Benintendi and Carlos Santana and the model predicted them as the 2nd and 3rd most likely to hit. The next step for this model (named Jolt after DiMaggio himself) is to automate the process and generate hit picks with my first sip of coffee each day. Oh, yea, and the season needs to start as well.

2 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Travis LMember since 2016

4 years ago

Would you mind posting a link to your code? Gist.github.com is great for this!

Lucas KellyFanGraphs Staff

Reply to Travis L

I’m happy to post sections of code, but I won’t be posting the entire process. If you’re curious about a certain aspect, let me know and I’ll post to my GitHub and share a link.

BAL	CHW	ATH
BOS	CLE	HOU
NYY	DET	LAA
TBR	KCR	SEA
TOR	MIN	TEX

ATL	CHC	ARI
MIA	CIN	COL
NYM	MIL	LAD
PHI	PIT	SDP
WSN	STL	SFG