This is a first view on something we’ve had in the cooker for a while. The statistical history of minor league baseball is erratically documented. Official averages published by leagues themselves were incomplete (when published at all), and even when published were not always particularly accurate. For at least the last four decades, researchers have attempted to fill the gap by compiling league summaries and averages from newspaper accounts and other sources.
Both official and researcher-compiled averages share the same problem: they’re completely undocumented. Newspaper boxscores are themselves not always that reliable or complete; compiling any set of averages from them is necessarily going to involve editorial judgment. Further, as Retrosheet has well-documented, even in Major League Baseball, discrepancies are the rule rather than the exception when looking at historical data.
We have been developing a method for carrying out historical minor league research with a focus on transparency and reproducibility. We’ve piloted this on several leagues, including having worked our way through the 1911 Central League, and we’re now ready to share some preliminary results on our methodology.
First, we’ve developed an efficient method for capturing boxscore data. Here is a sample of the first few days from the 1911 Central League season, captured from Sporting Life. The key idea is to capture the text of the boxscore in a format that is as close to the original as possible. Further, these files are raw: we capture all data – names, statistics, etc. – as given in the source.
Then, we have a set of scripts which parse these files, and then do all of the matching up of player and umpire names and any data cleaning. From that, we can then output any number of potentially useful reports. For example, here is our summary report, which includes a standings analysis, games by position for all players, and then player day-by-days. This gives a great overview of the season’s evolution. We can also export data into other formats, such as CSV game logs, team performance logs, and player logs.
An important point is that the process of producing the report and logs linked here is completely scripted. For example, when a team used two or more pitchers, it was often customary to list hits allowed and innings pitched by all but one of the pitchers, as the balance can be inferred. The script does carry out this inference, but the separation between capturing the original data as-is and the script which does the calculation allows us to know where a particular datum came from. Similarly, if a correction is made to a boxscore, we can separate which data are from the source, and which are editorial judgments.
This is something we’ll continue to develop over the coming months as part of a larger methodology for systematically documenting the statistical history of baseball. In the interests of supporting transparency and reproducibility, we’ll be making available our files (both the raw data input and the processing scripts) under open-source licenses. Watch this blog (and our Facebook and Twitter feeds) for updates.