PCA Outlier Hitters

Baseball collects a lot of data. It’s awesome. FanGraphs is a fun place for data. Exit velocity, spin rate, launch angle; these are fun data points. But, the vast majority of data that is floating around in lakes and clouds is generally not as exciting. Take some kind of machinery for example. Right now there are gears whirling, sensors sensing, detectors detecting, you get the point. This type of data is typically referenced in discussions about the internet of things (IoT). Baseball analytics has always benefited from what is learned in industry and in this post, I’ll be investigating whether a common industry technique, a Principal Component Analysis (PCA), can be useful in baseball analytics. 

When data comes streaming in on a constant basis it can be difficult to analyze, let alone store, in traditional ways. Take for example a sensor in a delivery truck marking the workings of an engine part. The goal when analyzing this data is generally to detect an anomaly or a degradation so that mechanics can be alerted. Most likely, the truck doesn’t have one sensor, it has many, and all those data points are streaming in and getting compounded into a giant snowball of rows and columns. PCA is a technique that will take all that high-dimensional data and explain it in a simpler way, without compromising the data itself. What once was perhaps 1,000 columns of data can become two or three or four columns of data that still does a pretty good job of explaining the original 1,000. 

Now, to the baseball!

Certainly, there are sensors in baseball. I’m sure bat sensors and swing sensors are collecting all kinds of biomechanical data that ends up floating around in some team cloud. I don’t have access to that. However, statcast data is not too far from streaming data. While it’s not collected all day, it does come in nearly every minute of a baseball game. If you’ve ever analyzed statcast data, you know that there’s a lot going on in the raw data. The goal of this process is to make finding outlier, or anomaly batted ball events, easier.

I’m only going to focus on three statcast data points; ‘hit_distance_sc’, ‘launch_speed’, and ‘launch_angle’. I’ve collected all statcast data from 2021 and I’m left with just less than 92,000 (with missing values removed) rows of data. Next, I’ll run the PCA to reduce these three columns down to just one. When I plot the resulting principal component against the observation number, we can start to see outliers:

I’ve simply plotted the principal component against the event number in order to visualize the outliers. The x-axis really means nothing here, but look at the yellow data points at the top. These are batted ball events that I’m considering outliers, anything above 260. Here’s a quick look at the players with the most outliers in this analysis:

Outlier Events
Among all 2021 hitters

Here are a few examples of those events:

Event Data Sorted by Principal Component
Name Hit Distance Exit Velocity Launch Angle Principal Component
Nolan Arenado 484.0 106.8 16.0 326
Ronald Acuña Jr. 481.0 111.9 27.0 325
Marcell Ozuna 479.0 114.3 25.0 323
Ryan McMahon 478.0 109.4 28.0 322
Franchy Cordero 474.0 118.6 29.0 319
Among all 2021 hitters

Does this look like outlier behavior to you?

What about this?


The combination of these data points is consolidated into one, telling us who is standing out among the rest in batted ball events. Statcast data may not be IoT streaming data, but the analytical techniques that are used can be similar. Is there much use here for fantasy managers? Maybe. Outlier detection, in this case, is not much different than just sorting for max exit velocity on leaderboards and then keeping an eye on any new player that pops up. But, this process tells us just a little more than a one-column leaderboard sort does and a PCA can be conducted on any combination of metrics. 

In reality, the technique that I’ve presented here may be better for detecting outliers of interest for MLB as a whole. I can’t really think of anything currently that MLB might be interested in detecting in large sets of data that could be useful for decision making, but maybe something will come up.





8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
ryannicholasparkermember
2 years ago

can you check this for Pete Crow Armstrong? what I’m saying is I want a PCA PCA

rustydudemember
2 years ago

Does he hit in the Pacific Coast League? Then you’d have a PCL PCA PCA.