Fixing xFIP, Pt. 2: SP/RP Splits

Last week, I recommended an improvement for expected fielding independent pitching (xFIP) without dismantling the original FIP framework upon which it was built. FIP describes the relationship between ERA and strikeouts, walks, and home runs allowed; xFIP does the same but attempts to remove the luck component from home runs by multiplying the number of fly balls a pitcher allows by the league-average rate of home runs to fly balls (HR/FB) — the rationale being HR/FB is notoriously fickle to project year to year.

The recommendation: change HR/FB to include line drives (LDs) and exclude infield fly balls (IFFBs, aka pop-ups). It’s worth noting our dark overlord David Appelman once explained how removing pop-ups from aggregate fly balls insignificantly affects xFIP. Additionally, less than 1% of line drives result in home runs. The recommendation, then, seems like the merging of two separate but equally fruitless endeavors, given the facts.

Yet changing the HR/FB component in xFIP to be “HR/(oFB + LD)” substantially improved the metric’s correlation with same-year ERA. Adjusted r2, which measure the strength of relationship from 0 to 1, increased from 0.42 to 0.55 using Statcast data (0.44 to 0.53 using FanGraphs data). I hypothesize that, when added to fly balls, line drives (despite resulting in very few home runs) give a more holistic indication of the average contact quality and launch angle a pitcher allows.

Today’s recommendation: account for start/relief splits.

Although I thought of this independently, the idea itself is far from an original one. In fact, when Matt Swartz developed SIERA, he explicitly included a variable for the percentage of innings pitched as a starter (SP%)

It’s no accident Swartz did this. Pitcher performance behaves differently in start and relief roles. Consider this summary of starter and reliever performance the last decade:

SP vs RP (2009-18)
Split ERA FIP xFIP HR/FB
SP 4.18 4.14 4.09 11.4%
RP 3.85 3.92 4.01 10.5%

In aggregate, FIP and xFIP should equal ERA. The differences between ERA and FIP/xFIP above imply either or both of the following:

  • FIP fails to distinguish what appear to be distinct impacts related to strikeouts, walks, and home runs for each split; and
  • Because starter FIP exceeds starter xFIP, it stands to reason xFIP underestimates HR/FB for starters (and vice versa for relievers) — which is confirmed by the table above.

Before diving in farther, I was curious to know if higher HR/FB rates for starters are the product of penalties borne from second and third times through the order (TTO). While there are distinctly defined penalties per TTO, starters still exhibited higher HR/FB rates no matter the scenario…:

SP TTO (2009-18)
TTO ERA FIP xFIP HR/FB
1st 3.40 3.86 3.88 11.0%
2nd 4.17 4.16 4.10 11.5%
3rd-4th 5.23 4.48 4.36 11.8%

… reinforcing the notion that HR/FB, quite simply, behaves fundamentally differently for relievers.

There are a couple of potential approaches for accommodating this difference, one of which being the creation of separate FIP (and xFIP) equations for starters and relievers. This is OK in theory but becomes understandably messy (if you’re calculating by hand, at least) for pitchers in hybrid roles. A computer could quickly compute an average FIP value weighted by the innings thrown as a starter and in relief, but the idea of doing so still seems disagreeable to me.

Instead, I opted for Swartz’s approach: add a variable that calculates the percentage of innings pitched as a starter (SP%). Technically, this compromises the integrity of FIP, arguably more so than a weighted-average FIP, because it adds an entirely new variable. However, this new variable acts more like a constant term than anything else, given most pitchers (especially fantasy-relevant ones) pitch exclusively in the rotation or the bullpen.

Using FanGraphs data for all pitchers who threw at least 60 innings in a season from 2017-18 (n = 547), I specified six regressions — two for FIP, four for xFIP:

  • FIP
    1. The original equation
    2. The original equation, plus SP%
  • xFIP
    1. The original equation
    2. The original equation, plus SP%
    3. The original equation, but with (oFB + LD) instead of FB
    4. The original equation, plus SP%, but with (oFB + LD) instead of FB

The adjusted r2 values are summarized below. There exhibits a small improvement for FIP and more substantial improvements for xFIP:

Adjusted r2 Values
Metric Original SP% oFB + LD Both
FIP 0.66 0.67
xFIP 0.46 0.50 0.56 0.58

Despite a small increase in the goodness of fit for FIP, the SP% variable itself appears to be a statistically significant addition to the model, with a coefficient of 0.29. That means, if given two pitchers — one a full-time starter, the other a full-time reliever — with perfectly identical rates of strikeouts, walks and hits by pitch, and home runs allowed per inning, you can expect the starter to have an ERA roughly three-tenths of a run higher than the reliever. In other words: FIP very slightly overestimates the talent of starting pitchers (and vice versa for relievers).

For xFIP, last week’s recommendation (to include line drives and exclude pop-ups) was more fruitful (adjusted r2 = 0.56) than today’s recommendation (0.50). However, both produce visible improvements in xFIP’s correlation with same-year ERA, with a combination of both recommendations producing the best goodness of fit (0.58). Using the same hypothetical as before, you could expect a starter’s ERA to be more than half a run higher than a reliever’s ERA. Including the recommended change to HR/FB, the difference falls to something closer to four-tenths of a run (because more weight is attributed to the new HR/(oFB + LD) variable).

All told, the composition of this recommendation — to include a SP% variable in xFIP — is not necessarily the best solution to accounting for differences in starter and reliever performance. It very well may be that a weighted-average approach, such that there are two separate FIP equations for starters and relievers, is the best solution. Doing so would produce different coefficients for every variable in the equation. Intuitively, this makes the most sense (to me, at least). For example, if it’s true that a walk or hit by pitch is less harmful to a reliever than a starter, then each model’s unique coefficients would reflect this.

Ultimately, this is less about what we should do with FIP or xFIP as it is how we should interpret FIP or xFIP. Honestly, I could have ended this post after the first table once I made it clear that starter and reliever performance is not the same. It helps to show with rigor that it’s true, but more than anything, just keep in mind that FIP slightly overestimates starter talent and underestimates reliever talent, all else equal.

As an aside: one can argue that if FIP is slightly off in its descriptions of starter and reliever talent, then FIP-based wins above replacement (WAR) might be mischaracterizing pitcher value (in favor of starters, who should have higher FIPs and, thus, lower WARs).





Two-time FSWA award winner, including 2018 Baseball Writer of the Year, and 8-time award finalist. Featured in Lindy's magazine (2018, 2019), Rotowire magazine (2021), and Baseball Prospectus (2022, 2023). Biased toward a nicely rolled baseball pant.

16 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Jimmy von Albademember
4 years ago

Re your last point: Couldn’t one also argue that that should be baked into the replacement level of a reliever or starting pitcher, at least to some extent, thereby offsetting any ERA-FIP differences increasing starter value?

Jimmy von Albademember
4 years ago

Yeah I believe it’s baked into the bbref version but I’m not sure about this one. Either way, I would assume you’re right that it’s pretty much irrelevant to gauging player value.