There is a lot of data in baseball. Last season, there were 184,872 plate appearances. Each one of those plate appearances was made up of as many as 16 pitches. Add to that everything that happens once a ball is put in play, from batted ball speeds and locations to baserunner and fielder movements, and all the different splits you can dream up of seasonal data. It can quickly overwhelm. That is why databases are critical for rigorous research, especially if you want to study trends across multiple seasons or even decades.
But sometimes, a spreadsheet is enough. Perhaps you lack the SQL expertise to quickly solve a problem that you could solve quickly in Excel. Perhaps you lack the access (wink wink) to a database of baseball statistics, and it simply isn’t worth your time to create that infrastructure when there are ready alternatives. Whatever the case, the reality is that modern spreadsheets provide the tools you need to do some sophisticated number-crunching as long as performance is not the bottleneck of the task at hand. And as long as you’re willing to get a bit creative with the built-in functions.
The reason I make the last point is that likely many of you have fallen down the VLOOKUP, INDEX, and MATCH well where you want to apply multiple filters to a dataset. It is simple enough in concept, but the more rules you apply, the more unwieldy your nest of functions becomes. The formula bar is an effective display for simple tasks, but it becomes awfully difficult to discern which function your close-parenthesis closes when you have several dozen of them and your formula has multiple line breaks.
Fortunately, Excel has tonic for just that situation. It is called SUMPRODUCT. As it was originally intended, SUMPRODUCT simply multiplies the values of one array by the corresponding values of a second, equal-sized array, and adds them together. Here is a basic example:
That may make it seem like SUMPRODUCT has a very specialized purpose, but I assure you that SUMPRODUCT can solve absolutely any lookup or splitting problem more simply and more elegantly than any other combination of lookup and reference functions.
SUMPRODUCT derives its vast power from a simple little feature that Excel has that I wish every spreadsheet had, which is that Excel interprets TRUE and FALSE values as 1 and 0, respectively, when arithmetic is applied to them. Fire up the program, type =TRUE into a cell and then multiply that cell by a number. You will return that number in your new cell.
The reason that quirk is particularly powerful when combined with SUMPRODUCT is that it allows you to apply logical operators to every value in a column in order to select only the values that meet your criteria. Since FALSE values present as 0s and since SUMPRODUCT multiplies all the values of a given row together, any cell that evaluates as FALSE based on the logic you apply then cascades through the rest of that row, eliminating that row from your result set. That also makes it really easy to apply multiple criteria to multiple columns in a single step.
It is difficult to succinctly explain what happens in words, so let’s return to the previous example. Pretend that, for some unknown reason, I wanted to add up all of the values in Column B, but only if their corresponding values in Column A were greater than 10. Here is how SUMPRODUCT could solve that problem:
The machine evaluates this problem this way. In cell A1, it asks, is 5 greater than 10? Since the answer is no, it evaluates to FALSE. That FALSE then becomes a 0 when it is multiplied by the value in B1, 8, and so that row returns 0. The same happens for each subsequent row. In A2, is 10 greater than 10? No, so the row returns 0. In A3, is 15 greater than 10? Yes, so TRUE becomes 1 times 24, which equals 24. In A4, is 20 greater than 10? Yes, so TRUE becomes 1 times 32, which equals 32. Add 0, 0, 24, and 32, and the result is 56. That is what cell D1 displays.
There are two important things to note about the functionality of SUMPRODUCT when it uses logical statements. First is that any array that has logic applied to it must be in its own set of parentheses. Second is that you need to combine arrays with logic applied to them with value arrays using an asterisk, the multiplication sign in Excel, rather than with commas. Without diving too deep into the inner-workings, the reason for that is so the Boolean castings as numbers happens before the function can throw a fit because it is reading a value in an unexpected format. If you forget either the parentheses or the asterisk, your SUMPRODUCT will probably return either a 0 or a #VALUE.
Now let’s apply this method to some actual fantasy problems. To start with, I downloaded the ZiPS projections from our Projections pages. For demonstration purposes, I grabbed each hitter position separately and then added in a column where I wrote each position as abbreviated text (e.g. C for catcher, 1B for first baseman). Here is the top of that list sorted by WAR:
I’ll start simple. If I wanted to know how many players were projected to hit at least 30 home runs, I would type this into an open cell:
The answer is 12. If I wanted to know how many first basemen were projected to hit 30 or more home runs, I would type (remember to add double quotes around text cells):
The answer is 5. If I wanted to know how many corner infielders were projected to hit 30 or more home runs, I would type:
The answer is 7. If I wanted to know how many players were projected to hit at least 15 home runs, steal at least 15 bases, and hit at least .290, I would type:
The answer is 2651.
SUMPRODUCT is also a simple way to marry two data sets that may not be easily combined through sorting (e.g. if one set had more records than another or had multiple copies of some records). To illustrate, I’ve pulled in the final statistics for all hitters from 2013 into a second worksheet titled 2013 (the first sheet is named 2014 ZiPS). Here is what the top of that list looks like, also sorted by WAR:
If you are using multiple leaderboards from FanGraphs, you’ll want to rely on the playerid column to join them. However, in a pinch, any unique column (or combination of columns that then create a unique identifier) will work. Just be very careful with names because, sometimes, players share a name (looking at you Chris Carpenters and Ryan Brauns).
As a proof of concept, I’ll start by adding a 2013 BABIP column to my 2014 ZiPS worksheet. In cell Z2 (the first row of the first open column of the sheet), I would type:
For Mike Trout, the top player on the list, the formula returns .376, which matches his listed BABIP in the 2013 sheet. Then, I simply copy that formula down the column for the rest of the rows to fill in everyone’s BABIP from last season.
As a quick tangent: when you do this type of work which requires copying formulas down columns or across rows, pay close attention to your absolute and relative cell references. By default, all references in Excel are relative, so if I copy a cell that refers to cell A1 in a formula to a new cell one column to the right, then the formula in that second cell will now refer to cell B1. If I wanted a formula to always refer to a specific column even if I copied the formula to multiple cells, I would need to set an absolute reference to that column (e.g. $A1). You can do the same for a row (e.g. A$1) or both the row and the column (e.g. $A$1). This is explained in more detail here.
In order to bring in every player’s BABIP from the 2013 sheet, I have to use a combination of both absolute and relative cell references. Since I want each player’s playerid in the 2014 ZiPS sheet compared to the entire list of playerids in the 2013 sheet, I need to lock the column range from 2 to 956. In contrast, I want that comparison made for each row individually, so I need a relative reference on the logical comparison (Y2).
Tangent number two: the absolute column references before the Vs and Ys in the formula and the relative column references before the Ls make it really easy to copy that formula to new rows to retrieve the other stats. If I copy the formula for Mike Trout’s BABIP in cell Z2 into cell AA2, I see .323, which was his batting average from 2013 (the column to the right of BABIP in the 2013 sheet).
Combining those two applications of SUMPRODUCT allows you to combine data in countless arrangements. As an example, I’ve made a third sheet called HR Totals by Pos and set up the following framework:
On the Y-Axis, I have each position. On the X-Axis, I have various home run thresholds. To fill in the data, I would type the following in cell B2:
=SUMPRODUCT((‘2014 ZiPS’!$B$2:$B$1032=$A2)*(‘2014 ZiPS’!$I$2:$I$1032>=B$1)*1)
Then, you can simply copy the formula to all of the cells to create a grid of players projected to reach each home run threshold split by position:
For anyone who is restricted to only freely available software, SUMPRODUCT with logic will not work in Google spreadsheets as it does in Excel. However, there is an effective alternative, so long as you do not mind dipping your toes in database waters.
I copied the 2014 ZiPS worksheet from my Excel example into a Google doc (note that as I selected only the visible rows to be copied, the column labels changed from A to Z to A to J). To run one of my previous examples, which is to find out how many first basemen are projected to hit at least 30 home runs, I will rely on the QUERY command, which uses their Visualization API:
=QUERY(A2:J, “SELECT COUNT(A) WHERE B = ‘1B’ AND C >= 30”)
The first argument establishes a range as your dataset—and it’s smart enough to find the end of that data, which is why there is no trailing row number after the J. The second argument specifies what data you want returned. It works very much like SQL. The machine interprets this one as provide the count of rows in Column A where the label in column B is 1B and the value in Column C is at least 30.
That returns 5, which is the same answer I got when I used SUMPRODUCT in Excel.
Finally, if you want to dynamically query your dataset based on cell values, you can do that by concatenating those values into your query:
=QUERY(A$2:J, “SELECT COUNT(A) WHERE B='”&L2&”‘ AND C>=30”)
That returns 5 when cell L2 contains 1B:
In my experience, SUMPRODUCT and QUERY can seem prohibitively complicated at first, but if you stick with them for a bit, they will eventually click for you and make your analytical life much easier. If you have questions, I’ll do my best to (eventually) answer all of them in the comments.
Scott Spratt is a fantasy sports writer for FanGraphs and Pro Football Focus. He is a Sloan Sports Conference Research Paper Competition and FSWA award winner. Feel free to ask him questions on Twitter – @Scott_Spratt