PCA Outlier Hitters

Baseball collects a lot of data. It’s awesome. FanGraphs is a fun place for data. Exit velocity, spin rate, launch angle; these are fun data points. But, the vast majority of data that is floating around in lakes and clouds is generally not as exciting. Take some kind of machinery for example. Right now there are gears whirling, sensors sensing, detectors detecting, you get the point. This type of data is typically referenced in discussions about the internet of things (IoT). Baseball analytics has always benefited from what is learned in industry and in this post, I’ll be investigating whether a common industry technique, a Principal Component Analysis (PCA), can be useful in baseball analytics. 

When data comes streaming in on a constant basis it can be difficult to analyze, let alone store, in traditional ways. Take for example a sensor in a delivery truck marking the workings of an engine part. The goal when analyzing this data is generally to detect an anomaly or a degradation so that mechanics can be alerted. Most likely, the truck doesn’t have one sensor, it has many, and all those data points are streaming in and getting compounded into a giant snowball of rows and columns. PCA is a technique that will take all that high-dimensional data and explain it in a simpler way, without compromising the data itself. What once was perhaps 1,000 columns of data can become two or three or four columns of data that still does a pretty good job of explaining the original 1,000. 

Now, to the baseball!

Certainly, there are sensors in baseball. I’m sure bat sensors and swing sensors are collecting all kinds of biomechanical data that ends up floating around in some team cloud. I don’t have access to that. However, statcast data is not too far from streaming data. While it’s not collected all day, it does come in nearly every minute of a baseball game. If you’ve ever analyzed statcast data, you know that there’s a lot going on in the raw data. The goal of this process is to make finding outlier, or anomaly batted ball events, easier.

I’m only going to focus on three statcast data points; ‘hit_distance_sc’, ‘launch_speed’, and ‘launch_angle’. I’ve collected all statcast data from 2021 and I’m left with just less than 92,000 (with missing values removed) rows of data. Next, I’ll run the PCA to reduce these three columns down to just one. When I plot the resulting principal component against the observation number, we can start to see outliers:

You Aren't a FanGraphs Member
It looks like you aren't yet a FanGraphs Member (or aren't logged in). We aren't mad, just disappointed.
We get it. You want to read this article. But before we let you get back to it, we'd like to point out a few of the good reasons why you should become a Member.
1. Ad Free viewing! We won't bug you with this ad, or any other.
2. Unlimited articles! Non-Members only get to read 10 free articles a month. Members never get cut off.
3. Dark mode and Classic mode!
4. Custom player page dashboards! Choose the player cards you want, in the order you want them.
5. One-click data exports! Export our projections and leaderboards for your personal projects.
6. Remove the photos on the home page! (Honestly, this doesn't sound so great to us, but some people wanted it, and we like to give our Members what they want.)
7. Even more Steamer projections! We have handedness, percentile, and context neutral projections available for Members only.
8. Get FanGraphs Walk-Off, a customized year end review! Find out exactly how you used FanGraphs this year, and how that compares to other Members. Don't be a victim of FOMO.
9. A weekly mailbag column, exclusively for Members.
10. Help support FanGraphs and our entire staff! Our Members provide us with critical resources to improve the site and deliver new features!
We hope you'll consider a Membership today, for yourself or as a gift! And we realize this has been an awfully long sales pitch, so we've also removed all the other ads in this article. We didn't want to overdo it.

I’ve simply plotted the principal component against the event number in order to visualize the outliers. The x-axis really means nothing here, but look at the yellow data points at the top. These are batted ball events that I’m considering outliers, anything above 260. Here’s a quick look at the players with the most outliers in this analysis:

Outlier Events
Among all 2021 hitters

Here are a few examples of those events:

Event Data Sorted by Principal Component
Name Hit Distance Exit Velocity Launch Angle Principal Component
Nolan Arenado 484.0 106.8 16.0 326
Ronald Acuña Jr. 481.0 111.9 27.0 325
Marcell Ozuna 479.0 114.3 25.0 323
Ryan McMahon 478.0 109.4 28.0 322
Franchy Cordero 474.0 118.6 29.0 319
Among all 2021 hitters

Does this look like outlier behavior to you?

What about this?


The combination of these data points is consolidated into one, telling us who is standing out among the rest in batted ball events. Statcast data may not be IoT streaming data, but the analytical techniques that are used can be similar. Is there much use here for fantasy managers? Maybe. Outlier detection, in this case, is not much different than just sorting for max exit velocity on leaderboards and then keeping an eye on any new player that pops up. But, this process tells us just a little more than a one-column leaderboard sort does and a PCA can be conducted on any combination of metrics. 

In reality, the technique that I’ve presented here may be better for detecting outliers of interest for MLB as a whole. I can’t really think of anything currently that MLB might be interested in detecting in large sets of data that could be useful for decision making, but maybe something will come up.





8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
ryannicholasparker
4 years ago

can you check this for Pete Crow Armstrong? what I’m saying is I want a PCA PCA

rustydudeMember since 2021
4 years ago

Does he hit in the Pacific Coast League? Then you’d have a PCL PCA PCA.