Year Three of xStats–A Review

I have spent the past few years creating a family of stats that I’ve called xStats. These stats use Statcast batted ball metrics to analyze each player, which I then manipulate and export in a manner I hope is useful for fans and analysts.

Exit Velocity and launch angle data are good, and I include those, but they aren’t yet intuitive for more baseball fans so I have set forth to display my data in terms of numbers that are more relatable. Namely the standard slash line numbers. I have expected batting average, on base percentage, slugging percentage, batting average on balls in play, and weighted on base average. For pitchers I have bbFIP, which is an ERA scalar. Today, though I’m only going to be looking at batters.

These stats are available, but they don’t help much unless you know how well they are working. To that end, I have created the following table, which compares the regular, standard slash line to the xStats slash line.

On the top portion of the table you see the Correlation coefficient ( R ) for each respective stat, and at the bottom you have the Mean Squared Error (MSE). I further divided this into 2015 vs 2017 stats  (15-17), 2016 vs 2017 (1yr), and 2015+2016 vs 2017 (2yr). I hope these labels make sense.

Coeficient of Correlation and Mean Squared Error
Regular 1yr 15-17 R .403 .463 .469 .312 .439
xStats 1yr 15-17 R .449 .484 .513 .323 .494
Regular 1yr R .353 .417 .408 .300 .372
xStats 1yr R .423 .467 .490 .380 .465
Regular 2yr R .444 .497 .500 .350 .465
xStats 2yr R .426 .498 .518 .303 .487
Regular 1yr 15-17 MSE .00137 .00163 .00609 .00203 .00190
xStats 1yr 15-17 MSE .00110 .00279 .00544 .00152 .00166
Regular 1yr MSE .00133 .00147 .00563 .00188 .00171
xStats 1yr MSE .00101 .00275 .00453 .00128 .00137
Regular 2yr MSE .00103 .00113 .00429 .00154 .00131
xStats 2yr MSE .00085 .00102 .00362 .00125 .00109

The xStats has a stronger Correlation coefficient for each stat in both single season variations. In the two year version, xStats is weaker in in batting average and BABIP, and roughly tied in terms of OBP. I’ll get back to this in a moment.

The single season variants of xStats have less error in every stat except OBP. For the two year version, xStats is superior in every regard.

First, a comment about OBP. The single season xStats versions do not include a predicted walk component at this time. It is something I should include in the future, but at this point xStats assumes the walk rate and strikeout rate witnessed in that season. The two year variation includes a predicted walk and strikeout rate. As a result, the two year variation is better at predicting OBP.

Second, a comment about the two year variation. These two year numbers come from a sheet I posted preseason, which you can find here under “Batters New, 3/18/2017.” During the course of the baseball season I rewrote much of the code involved with these multi-year xStats variants. During which I found a few errors in the code relating to strikeouts, which are now fixed. I hope the newest versions will fair better going forward.

Batters xStats pegged Correctly

Generally speaking, xStats had less error than the standard stats, especially in terms of wOBA. Using wOBA as a measuring stick I have identified a few players xStats more accurately forecast than the standard metrics.

2016 Standard .298 .357 .500 .361 .365
2016 xStats .261 .354 .432 .306 .323
2017 Standard .270 .350 .424 .299 .332

Andrew Benintendi had a BABIP fueled surge in performance in 2016, which came back down to earth in 2017. Perhaps this wasn’t the greatest surprise in the world, but it is nice to see that xStats just about nailed his stats across the board. A lot of people had named him the most likely player to win Rookie of the Year prior to the season. With his 2016 figures, he may have had a shot, but maybe if you took a look at his xStats numbers you would have felt differently.

2016 Standard .340 .368 .550 .398 .387
2016 xStats .299 .344 .468 .345 .338
2017 Standard .277 .332 .439 .324 .331

In addition to these numbers, I produced a second xStats slash line in the offseason by adjusting his batted ball profile which you can read about here. That slash line was:

Adjusted xStats .277 .313 .458 .352 .348

As it turns out, there is some merit for adjusting batted ball distributions when they are observed to be atypical. For example, when a given batter hits far too many high exit velocity balls, or too many low exit velocity ones. Generally speaking batters tend to follow a predictable distribution of high, medium, and low end batted balls. This distribution is governed by their maximum exit velocity, which is in turn governed by their swing mechanics and technical skill. Once you find a player’s maximum exit velocity, you should be able to draw a distribution of batted balls which will roughly approximate their future performance.

2016 Standard .264 .349 .475 .278 .343
2016 xStats .298 .430 .518 .326 .380
2017 Standard .289 .366 .536 .322 .374

Justin Bour had an injury plagued 2016 season, and a lot of people seemed to cast doubt on his future value as a result. Well, xStats certainly did not do so. Rather, it saw him as a threat in terms of both batting average and power. His OBP number got a bit wacky, but again this version of xStats does not include a predictive measure for walk rates. In every other respect, xStats totally nailed Bour’s 2017 performance.

These are three examples of players xStats performed very well with. Other names are:

Jean Segura, Mookie Betts, Ian Kinsler, Cameron Maybin, Keon Broxton, Jarrett Parker, Yangervis Solarte, Lucas Duda, Ichiro Suzuki, Drew Butera, Tony Wolters, Robbie Grossman, Joey Rickard, Ryan Braun, Gregor Blanco, and Adeiny Hechavarria.

Batters xStats Entirely Missed

Of course there are batters that xStats totally whiffed on. Some of whom had injury plagued seasons, like Miguel Cabrera. Others appear to have dramatically changed their approach at the plate. Still others simply had down years.

2016 Standard .298 .376 .483 .356 .367
2016 xStats .314 .442 .522 .371 .388
2017 Standard .282 .369 .439 .336 .348

In 2016, Yelich had much more of a power stroke, which disappeared for much of the 2017 season. Especially the second half when Yelich appears to have honed in on a line drive approach. Having a line drive approach doesn’t mean power numbers disappear, though, and in fact can mean quite the opposite. Rather, Yelich appears to have a set of swing mechanics that are ideally suited for hitting balls between 10 and 20 degrees, prime line drive real estate. When he focused on hitting balls in this range, his production soared. Continue hitting balls hard in this area, and occasionally your hard hit doubles will clear a fence, but your HR/FB might not be as high.

In other words, Yelich doesn’t appear to be trying to lift pitches in the top half of the zone any longer. Instead, he is focusing on the middle of the zone, where he will hit a bunch of line drives. As a result of this change of approach, his expected slugging was way off the mark.

2016 Standard .277 .305 .448 .292 .319
2016 xStats .250 .292 .369 .272 .277
2017 Standard .284 .321 .479 .286 .335

I don’t have much to say about Didi, other than I believe the xStats are probably accurate. I don’t know if there is a single ballplayer in MLB who is taking better advantage of his home ballpark. Didi is hitting boatloads of extremely low probability home runs (we’re talking 10 percenters) that are just barely clearing the short porch in Yankee stadium. Perhaps you could say the difference in Didi’s xStats stem from Yankee Stadium Park effects, but at the end of the day Didi is not producing contact commensurate with his game production.

2016 Standard .255 .306 .427 .271 .311
2016 xStats .268 .344 .437 .292 .324
2017 Standard .230 .281 .409 .234 .292

Maikel Franco had an absolutely terrible 2017 season. A year in which he was expected to take a step forward and cement his place on the rebuilding Phillies roster. Franco appeared to be playing through a few injuries. Namely a knee injury he suffered early in the season. He also suffered a wrist injury later in the year, but that was long after his performance had tanked.

xStats saw Franco as a guy who was slowly marching forward in terms of value, but went in a very different direction. Hopefully his weak performance was the result of minor injuries and he can recover to full strength for next season.

More players xStats missed:

Kendrys Morales, T. J. Rivera, Martin Maldonado, Byron Buxton, Mark Reynolds, Alex Bregman, Cesar Hernandez, Brett Gardner, Charlie Blackmon, Tim Beckham, and Tyler Flowers.

Disappointment, Changes, Optimism

Generally speaking, projection systems use three or more years worth of data for each player, comparing that data to trends observed throughout the history of the game. Be it recent history or otherwise. With xStats, I haven’t had this luxury, since it has been created on the back of a new technology that debuted in 2015.

Some people (most people) probably would have waited 3-5 years to gather data before publishing this sort of information, but personally I preferred to put it out there as quickly as possible. With that said, the two year “predictions” are a bit disappointing. I was hoping they would be more accurate. But they were better than nothing. Generally speaking, they offered insight that the standard stats did not, and that is a good thing.

Going forward, though, there is a lot of room for growth. Sample size remains a problem, not in terms if batters but in terms of overall batted balls. Three hundred thousand batted balls may seem like a lot, but the number feels smaller and smaller the most you dig into specific scenarios. Like how to determine balls will cross over a certain wall in a given ballpark, or how to treat super high launch angle batted balls and things of that nature. There are some batted balls that are so odd that they may only have 5 similarly hit balls.  With more data you can start dig into more and more specialized areas, such as park effects.

Speaking of park effects, they were a big source of improvement during the course of this 2017 season. Earlier in the season I moved to an 8 point park effect system. I had singles, doubles, triples, and home runs from left handed and right handed batters. In recent weeks I began shifting towards a horizontal angle based park effect. For now, I am using large arcs across the field. But as the dataset grows I hope to move towards smaller and smaller arcs. This is especially important Fenway park, for example, where the Green Monster plays a large role in game performance.

Converting the average MLB success rates into a fully park adjusted success rate is one of the biggest challenges I am currently facing. Additionally, I still need to develop a better way to handle foot speed and I want to include expected Runs Scored and RBI totals. Hopefully I can address the runs and RBI problems during the offseason.

I hope you found value in these stats over the course of the season, and I’m doing everything I can to increase their value going forward.

Andrew Perpetua is the creator of and, and plays around with Statcast data for fun. Follow him on Twitter @AndrewPerpetua.

newest oldest most voted

i’m a big fan of the xStats project. thanks for sharing!