This summer we concluded a new licensing agreement with baseball-reference.com. Jointly with our partners at Hidden Game Sports, we provide them with their historical data for major, minor, and international leagues.
As part of this exciting tie-up, we’ve agreed to develop a methodology for documenting the pre-1960 minor leagues. There are some interesting challenges in constructing this dataset. The documentation of minor leagues is quite fragmentary, and some leagues and years are documented more reliably than others. Mix in aliases, incorrectly spelled names, widespread misrepresentation of dates of birth, and just plain confusion, there is a lot of work to be done. Many researchers have done work on aspects of the historical record for the minor leagues, but to date there is no systematic integration of this information into a well-edited whole.
Given the magnitude of the task, we believe the only way to make meaningful progress is using Open Data principles. Work will need to progress incrementally and iteratively. There is a need to be able to trace information through from original source to what eventually will appear on sites like baseball-reference, and to be able to integrate new information reliably without losing work that has been done before. The project can only be successful if it draws on the accumulated knowledge of experts. Because there are so many sources, many of which require thoughtful corrections to be made, it is essential to have as many eyes looking at the data. (In software engineering, this is sometimes captured in Linus’ (Torvalds) Law: “Given enough eyeballs, all bugs are shallow.”)
We have begun this process with two new repositories. (You can see all of our open data repositories on this summary page.) One is our collection of transcriptions of minor league averages. This will host data from league-level averages, both official and researcher-generated, transcribed directly from the originals with a bare minimum of interpretation. Because historically most league averages do not include performance data for all players, we have a complementary collection of captured minor league boxscores. Both datasets are licensed under the Creative Commons Attribution 4.0 International license. We in particular want to call out thanks to Frank Hamilton for his work in setting up the structure that will underpin the averages, and Jack Morris for patiently road-testing the boxscore methodology. There are many others to thank as well; credits will be listed alongside the materials in the respective repositories.
It should be emphasised that these are collections of rather raw data from sources. The capture is an important first step in producing the most reliable account of the history of the minor leagues – but indeed it is only the first step. In the coming months, as we acquire a critical mass of data in these repositories, we will roll out the process by which these sources will be combined, errors corrected, players’ careers identified, and demographic information like birth and death dates included.
All of these steps will follow Open Data and Open Source principles. The whole process will be transparent, and, just as importantly, everyone can contribute. There are many useful roles to be played, and many tasks that can be done, some even if you only have 15-30 minutes free at a time. There is certainly much transcription to be done. But, as we are making the data and some source code for tools available as well, there are opportunities for contributing to the data cleaning – or, indeed, for coming up with new and interesting views on or uses for the data which we have not as yet anticipated.
This process will take some time to unfold. Retrosheet has been publishing data for two decades now, and still has interesting work yet to be completed. We are certainly hopeful that substantial chunks of data will become available relatively quickly. Equally, though, patience is required. We want to build this project to last.
If you have questions – or want to find out how you can do more to help – please get in touch with Ted Turocy at ted.turocy (aht) gmail (daht) com.