As part of our mission to document the history of baseball in tidy datasets, we produce or support the dissemination of many Open Data resources. Below is a list of some of our highlighted datasets. We maintain many of these as repositories on our github page.
The Chadwick Register contains an accounting of people associated with baseball, with basic identifying data and cross-references against several identification systems. It is one of our flagship offerings, and so it has its own page with full information.
This is a compilation of historical Major League Baseball statistical and related data, based on earlier work disseminated by Sean Forman, Sean Lahman, and many other contributors. The current version can be obtained from our git repository. There is a related Yahoo discussion group where you can find out the latest information.
In addition to directly downloading the data, you can use Databank data via:
- The Lahman Baseball Database
- Baseball Trivia HQ.
- Baseball Baseline Cherrypicker – a nifty site where you can explore graphically how robust statements like “Jack Morris won more games in the 1980s than anyone else” are to changing the time frame quoted.
Retrosheet is the undisputed king of Open Data projects in baseball. As a service, we provide a git repository which contains all of the downloadable files published by Retrosheet, in one convenient location. Aside from being able to download all Retrosheet files with one click, the repository has other uses:
- Using the repository, you can see what changes in each release Retrosheet makes over time. For example, this page lets you browse the changes between the December 2016 and June 2016 releases, and shows how amazingly chock-full of new information each release is.
- The official Retrosheet files are kept in the branch named official. In addition, we maintain a branch named master, which includes minor corrections to the files which are identified in between Retrosheet releases.
There are many interesting things that can be done with Retrosheet’s data. If you want to generate your own datasets based on it, you can use our eponymous Chadwick command-line tools to extract lots of good stuff from it. For convenience, we have created the retrosplits repository, which contains several common types of reports often generated from Retrosheet data:
- Game-by-game records for players and teams, including all statistical categories for each game.
- Various splits derived from play-level data, including batting vs LHP/RHP, batting by runners on base, batting by current fielding position, pitching vs LHP/RHP, pitching by runners on base, and head-to-head matching statistics for batter/pitcher pairs.
Minor League Averages
We are proud to provide the historical minor league statistical data to baseball-reference.com. As part of this, we are developing a comprehensive accounting of pre-1960 performance data for minor leagues. This is a challenging research task, as the documentation of these leagues is often fragmentary, and not all leagues kept records with as much loving attention as we might have wished.
One component of this is our repository of transcriptions of minor league averages. This dataset contains a structured electronic version of the information included in published league averages, including (in due course) those appearing in Guides, newspapers, and compiled by individual researchers.
This repository is being made possible thanks to the stalwart efforts of a number of dedicated minor league buffs who have operated under the direction of Frank Hamilton, with substantial contributions by (alphabetically) Cliff Blau, Art Cantu, and Jim Sarrantonio.
An important health warning: This repository contains direct transcriptions of the published information. The published information is known to contain errors and has various other limitations. This information is therefore just the first step in producing an edited and corrected account of the statistical history of minor league baseball (at least, as accurately as can be reconstructed from records). We will publish every step in that process from source to the final data that will appear on baseball-reference under Open Data licenses, in due course. (Watch this space!)
Minor League Boxscores
Complementary to the capture and processing of published averages for minor leagues, we also capture some game-level data. Not all leagues have published averages available; and, for most leagues prior to 1960, players with limited appearances (less-thans) are generally not included. We want to document everyone who has played professional baseball; in some cases the only way to do this will be based on data at the level of the individual game. Game-level data is also useful to establish what part of a year a player spent with a particular club, helping to tease apart the records of similarly-named players.
Our repository of minor league boxscores publishes the data we have captured so far. The methodology is similar in spirit to how we approach minor league averages. We have developed a simple, human-readable, text-based format that enables quick capture of the data from a boxscore with no special software required. (We thank Jack Morris for extensively road-testing this format on several leagues; his feedback has been invaluable in refining the process.) A tool extracts the data into a collection of CSV files for further analysis and processing.
The data quality of newspaper boxscores is, of course, even sketchier in most cases than that of published averages. Nevertheless this information has its uses, as long as it is taken with a sufficient dosage of salt.
Obituaries and Necrology
One of the reasons to have datasets of the history of baseball is to record and remember the contributions of men and women in the game over the years. We in particular note and honour their contributions as they pass on. Thanks to the Baseball Necrology e-group, and with special thanks to Jack Morris, we are maintaining a repository of obituaries and death notices. This includes both text or images of the originals where available, and biographical data extracted from the original documents into a parseable format. In addition to providing a suitable memorial to mark their passing, this repository helps us improve the biographical information that we report in the Register and which we provide to baseball-reference.com.
Parallel to the obituaries and necrology repository, we also maintain a collection of vital records pertaining to people in baseball history, which we also use to produce and improve our biographical information. We again thank the Baseball Necrology e-group and in particular Jack Morris for their signal contributions to this collection.