A lesson I Learned This Week About Statcast
We’re in the final days of July, and understandably much of the conversation has centered around the trade deadline. Trades are a great circus in many different ways, touching on a plethora of interests we fans have for the game. The deadline is the genesis of many of the greatest discussions in baseball, and it certainly deserves the mantle it has been given. However, the trade deadline, and the All Star Game for that matter, while fascinating, both occlude another interesting aspect of this game we love: July is the point of the season where both pitching and offense peak. Pitchers are, generally speaking, throwing at their highest velocities, with the greatest movement. Home runs are hit with the highest frequency and the longest distances. We are deep enough into the season that everyone is warmed up and locked in, but not quite deep enough for players to really begin suffering from fatigue. July is a great month for baseball.
It is important to keep in mind that the game goes through this natural peak towards the middle of the season, especially now that we have such a focus on increasingly abstract measures of the game coming from Statcast. We have exit velocities, angles, route efficiency, top speeds, on and on. This is all great information, you know that I love this stuff, but it is important to take a step back and recognize our own limitations when it comes to analyzing this information.
Personally, I’ve struggled with this concept up through today, and I am working on being more careful in the future. It is easy for me to look at average batted ball velocities, distances, and home run rates and forget how important it is to keep in mind the time of year and weather. It comes up pretty often when I am working on my Citi Field Home Run project. Last year, Citi Field had 2.69 home runs per game in July, the highest of any month in the stadium’s history. This year it has 3.31 home runs per game. At one point, it was 4.5 home runs per game for the month, but then they went through a few lean games to knock it down a bit. Now, this is an interesting factoid, perhaps. You can scream “juiced ball!” and perhaps you’d be right. That’s not important, though. Looking at this data, I realized that each season, the months tend to follow a certain pattern. Now, I don’t know if this pattern generalizes to other stadiums or to MLB, but for Citi Field I see the following home run per game rates (normalized to the July rate)
Month | HR Rate |
April | 0.83 |
May | 0.76 |
June | 0.82 |
July | 1.00 |
August | 0.78 |
September | 0.76 |
October | 0.90 |
We all know that weather plays a big part in the game, but when you start looking at abstract stats, I think we have a tendency to start overlooking context and potential flaws. It is that man in a lab coat effect, you feel like someone dressed like a scientist probably knows what they are talking about, even if what they are saying doesn’t quite add up. You’re more willing to subconsciously overlook small details, and it can lead to poor interpretations and mistakes.
Everyone knows balls don’t travel as far during the early spring due to cold weather, and we all know the ball travels farther as the weather warms up. This is often talked about in terms of home runs, which is understandable because home runs are both easy to quantify and have been the only real way we could objectively measure offensive performance in the past. Batted balls in play were a black box for most of baseball history, and since we can largely predict the outcomes of games without them, they were discarded and largely ignored. But now have more and more information about these batted balls, everything from their batted angle and velocity up through fielder positioning, route efficiency and speed. I suspect, we are prone to overlooking the importance of weather when interpreting this data.
For example, I’ve calculated the average distance for batted balls in the 2016 season using the following criteria:
exit velocity between 100 and 102mph, vertical launch angle between 25 and 27 degrees, only including right handed pitchers pitching to right handed batters. I did this to remove as many variables as possible.
2015 | 2016 | |
April | 375 | 382 |
May | 383 | 389 |
June | 387 | 390 |
July | 392 | 393 |
August | 380 | |
September | 374 | |
October | 379 |
The batted ball distances, given a tight range of initial parameters, vary considerably by month, following a pattern you may have suspected. This isn’t ground breaking material here, but it’s important to keep in mind. In 2015, baseballs batted in this manner varied by 18 feet in average distance between July and September. How many times would a ball have been a hit if it traveled just a foot further in September, over the head and just out of the reach of an outfielder?
When you look at batted ball velocity, keep in mind the context in which that game was played. The time of season, the location of the ball park, and the weather. The value that we put on these velocities changes with respect to those other variables.
With my xStats, I have actual home runs and estimated home runs side by side for all pitchers. Overall, these estimated home runs are reasonably close to the actual numbers, many pitchers have roughly equal number of estimated home runs, often with less than a half home run difference. As a result, my scFIP can be nearly identical to FIP in many cases, which is nice, it was designed to be a very small correction to FIP as opposed to a totally independent number. Some pitchers have given up far more or fewer home runs than expected, and those are interesting cases to be examined separately. However, I also have monthly totals of expected and actual home runs, and as you may have guessed, these numbers diverge following the same sort of pattern as the average batted ball distances.
Actual HR | Estimated HR | Difference | |
Total | 3474 | 3443.3 | 30.7 |
April | 740 | 812.57 | -72.57 |
May | 965 | 1001.31 | -36.31 |
June | 1012 | 967.34 | 44.66 |
July | 757 | 662.05 | 94.95 |
For xStats, I assume average success rates for all batted balls. This works well overall, but when you’re looking at numbers on a month to month basis, the issue with weather and temperature tends to rear its ugly head. In April, May and September you see over estimates for success rates. In June, July, and August, you see under estimates. It evens out in the end, but looking at smaller and smaller sample sizes, you may see larger and larger divergences between the average success rates for batted balls given raw data and the actual in game success rates. Context becomes increasingly important with smaller samples.
I’m not sure how I can address this problem with xStats, but brainstorming solutions isn’t the goal here. I merely hope to pass on a lesson I’ve learned this week about Statcast and baseball in general. This is a rich beautiful game with a never ending rabbit hole of complexity and surprises, and I wouldn’t have it any other way.
Andrew Perpetua is the creator of CitiFieldHR.com and xStats.org, and plays around with Statcast data for fun. Follow him on Twitter @AndrewPerpetua.