ERA-FIP, and the Importance of Situational Context

I like a lot of pitchers who have unperformed this year. With strikeout and walk rates (K%, BB%) of 20.5 percent and 7.0 percent, respectively, Drew Hutchison delivers everything I want from a mid-rotation fantasy starter. With a 5.19 ERA and a 1.47 WHIP, however, he delivers a flaming bag of feces to my doorstep.

The same can be said for Taijuan Walker who, after a terribly rough start to the season, dazzled for seven straight starts before recently tossing three stinkers. With plate discipline ratios better than Hutchison’s and just 22 years old, Walker demonstrates the skill set and ceiling that have earned him consensus top-20 honors on prospect lists from 2012 through 2014. Yet his 5.06 ERA and 1.29 WHIP have left fantasy owners not only disappointed but also reeling.

Hutchison and Walker share a common trait: their ERAs dwarf their fielding independent pitching (FIP) statistics. FIP was designed to demonstrate a pitcher’s true performance in light of the events he can control — that is, events independent of balls put into play at the mercy of the defense supporting him (among other things).

Meanwhile, I’ve heard (or read) analysts say (write) things like “so-and-so seems like he consistently outperforms his FIP.” It’s easy to accept this as truth and not give it a second thought, especially if the ERA of the pitcher in question truly has outperformed his FIP or xFIP year after year.

Still, I hesitate to accept it as a universal truth, so I set out to do some research.

Disclaimer

There exists online a vast literature chronicling the different ways we should capture a pitcher’s true performance. From Deserved Run Average (DRA) to Contextual FIP (cFIP), sabermetricians are (or, perhaps, Jonathan Judge by himself is) blazing new trails regarding pitcher evaluation. This piece of research is far less painstaking, but it also investigates why what happened doesn’t align with what should have happened rather than better ways to describe what should have happened.

In other words: Sorry if this has been done already. It’s new to me, and I hope it’s new to at least one person reading this, too.

Hypothesis

A pitcher will post an ERA better than his FIP or xFIP if he pitches objectively better with men in scoring position than he does with the bases empty. I measure pitching “objectively better” using xFIP, which rewards strikeout rates and punishes walk and fly ball (FB%) rates. Basically, it aggregates several variables into one easily digestible one.

Data

The data spans 477 observations across five years (2010 through 2014), all of whom qualified for the ERA title in the particular seasons in question. Starting pitchers populate the data primarily, but since I forgot to filter by strictly starters, some relievers may have slipped into the mix as well.

Some Observations and Stuff

Using the data set described above, allow me to list a series of facts, without personal interjection, that will help illuminate the fundamental predicament and validate my methodology.

  • A pitcher’s ERA correlates weakly from year to year (R = .287).
  • FIP correlates pretty strongly from year to year (R = .519).
  • xFIP correlates even more strongly than FIP (R = .628).
  • With that said, a pitcher’s xFIP with the bases empty doesn’t correlate as strongly as overall xFIP, but it still tracks better than overall FIP (R = .569).
  • xFIP with men in scoring position correlates only moderately well (R = .433).
  • The sample demonstrates mean xFIPs of 3.74 with bases empty and 4.13 with men in scoring position, indicating that pitchers’ skills, on average, tend to deteriorate slightly under pressure.

Methodology

I specify two separate dependent variables:

  1. ERA-FIP, or the difference between a pitcher’s end-of-season ERA and his end-of-season xFIP
  2. ERA-xFIP, or the difference between a pitcher’s end-of-season ERA and his end-of-season xFIP

Thus, I capture the magnitude of a pitcher’s over- or under-performance in a particular year in terms of earned runs allowed (or prevented) per nine innings.

I also incorporate my skills-based explanatory variables…

  • xFIP with bases empty (E_xFIP)
  • xFIP with men in scoring position (S_xFIP)

… as well as some luck-based explanatory variables:

  • batting average on balls in play (BABIP) with bases empty (E_BABIP)
  • BABIP with men in scoring position (S_BABIP)
  • ratio of home runs to fly balls (HR/FB) with bases empty (E_HR/FB)
  • HR/FB with men in scoring position (S_HR/FB)

I then run a series of simple linear regressions before providing you with the goods.

The Goods

I first examine how situational xFIPs correlate with the gap between a pitcher’s end-of-season ERA and his various end-of-season FIPs:

ERA-FIP = α*E_xFIP + β*S_xFIP + ε
ERA-xFIP = α*E_xFIP + β*S_xFIP + ε

Pitiful adjusted R-squared statistics and statistically insignificant S_xFIPs in each model quickly ravage my primary hypothesis. At this point, there appears to be no indication a pitcher’s measurable skills in different situations affect his ERA relative to his FIP.

So I turn my attention toward the luck-based situational statistics. One can argue that BABIP and maybe HR/FB, too, are not entirely driven by luck, and I concede the point. However, the evidence supports hitters’ BABIPs and HR/FBs much more strongly than it does for pitchers; the latter do, for whatever reason, seem to be subject to much more so-called luck in regard to these particular metrics.

Therefore, if a pitcher has trouble controlling the outcomes of balls put into play, smaller situational samples could, and perhaps should, exacerbate the trend.

I specify models similar to the previously stated ones, swapping out the situational xFIPs for situational HR/FBs:

ERA-FIP = α*E_HR/FB + β*S_HR/FB + ε
ERA-xFIP = α*E_HR/FB + β*S_HR/FB + ε

Here, the adjusted R-squared statistics differ markedly. In fact, one appears to validate situational luck whereas the other doesn’t. Alarming as it may seem at first glance, it actually makes perfect sense because that’s exactly how FIP and xFIP differ: the former essentially regards a pitcher’s HR/FB as a controllable skill whereas the latter substitutes a pitcher’s HR/FB for the league-average rate, basically assuming it to be luck-based.

Thus, depending on which camp you’re in, pitcher HR/FB rates matter when evaluating pitcher over- or under-performance using xFIP but not when using FIP.

At a crossroads, I turn to the situational BABIPs, substituting them in for the situational HR/FBs:

ERA-FIP = α*E_BABIP + β*S_BABIP + ε
ERA-xFIP = α*E_BABIP + β*S_BABIP + ε

Now we’re getting somewhere. The adjusted R-squared statistics demonstrate strong positive relationships between the dependent and explanatory variables. In other words, situational BABIPs explain a lot of the difference between ERA and the FIPs. (Good band name. Write it down.)

It’s a good time, then, to make some more observations and stuff about not the luck-based, rather than skills-based, metrics.

Some More Observations and Stuff

  • A pitcher’s HR/FB with bases empty correlates very poorly from year to year (R = .174).
  • HR/FB with men in scoring position exhibits almost no year-to-year correlation (R = .045).
  • A pitcher’s BABIP with bases empty correlates more poorly than does his HR/FB (R = .103).
  • BABIP with men in scoring position exhibits no correlation whatsoever (R = .006).
  • On average, a pitcher’s HR/FB drops from 10.1 percent with bases empty to 9.2 percent with men in scoring position.
  • On average, a pitcher’s BABIP drops from .292 with bases empty to .280 with men in scoring position. The latter, however, has a much wider range of observed outcomes and a larger standard deviation than the former. This may lend very modest insight as to how defenses, rather than pitchers, perform under pressure.

Conclusions and Limitations

This wasn’t a particularly rigorous exercise, nor did it reveal any shocking hidden truths. In fact, it more or less confirms something I think is rather intuitive: the difference between a pitcher’s ERA and his FIP is driven largely by luck on balls in play — for xFIP, by luck on balls in play and home runs as a ratio to fly balls — not just generally but situationally.

Moreover, and perhaps most importantly: as explained in the previous section, pitchers have little control over when good or bad luck happens mid-game. A pitcher can experience bad luck on balls in play with no men on and suffer a much more tolerable fate (in terms of ERA) than one who experiences the same exact bad luck with men on base.

Also, the models potentially, if not certainly, suffer from omitted variable bias (at the expense of keeping the models simple). For example, a pitcher may exhibit losses in command under pressure — missing spots, grooving pitches — or he may change his sequencing and become more predictable. These factors would likely create observable differences in a pitcher’s BABIPs and HR/FBs by situation.

Still, I would expect to see those factors reflected in the models specifying xFIP as the explanatory variable given it attempts to capture a pitcher’s inherent skill in a particular situation. Yet, for whatever reason, a meaningful relationship eludes us. It’s also worth mentioning I omit one of the primary situations observed by FanGraphs: “men on base,” which sounds like strictly a man on first base. Including this split would likely add a small degree of explanatory power to the models, but not enough for me to regret excluding it.

For anyone who’s wondering: I also specified a model in which I specified both the skills-based and luck-based metrics as explanatory variables. The fit of the model improved but only very slightly, and BABIP still provided almost all the explanatory power of the models.

Lastly, my work with expected isolated power (xISO) and expected BABIP (xBABIP) leads me to believe that a pitcher can situationally experience large differences in his batted ball profile allowed. If a pitcher consistently allowed more hard contact, fly balls, etc. with men on base, it would undermine his xFIP and reinforce his situational BABIP (and probably HR/FB, too). Thus, batted ball information could strip away some of the luck components inherent to BABIP and HR/FB. However, I can’t confirm or deny any of these aforementioned potential correlations because I haven’t tested them yet. Work for another day, I suppose.

Advice

Honestly, I don’t know if I have any. I want to trust Walker, and everything I know about fielding independent statistics, regression to the mean and all of this here keeps me encouraged about his rest-of-season prospects. But I know there’s a non-zero chance he may, beyond any reasonable explanation, simply continue to allow a .358 BABIP with men on base or allow twice as many home runs per fly ball with men on base (20.4 to 20.8%) than with bases empty (9.3%) despite negligible situational differences in his batted ball profile.

Again, maybe there is a reasonable explanation, but I certainly haven’t provided it here today. A pitcher can only guarantee so much with his fully controllable skill set, and it appears difficult to predict exactly when he will benefit (or suffer) from good (or bad) luck.





Two-time FSWA award winner, including 2018 Baseball Writer of the Year, and 8-time award finalist. Featured in Lindy's magazine (2018, 2019), Rotowire magazine (2021), and Baseball Prospectus (2022, 2023). Biased toward a nicely rolled baseball pant.

16 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Walt
8 years ago

Is it possible that a pitcher like Walker may simply not have good mechanics from the stretch with runners on base, allowing hitters to have a easier time recognizing pitches from him? This would lead to the higher BABIP w/ men on base.

Jeff Zimmermanmember
8 years ago
Reply to  Walt

The Nolan Ryan Syndrome

Mike
8 years ago
Reply to  Walt

My thoughts exactly.