Category Archives: Uncategorized

2016: not so bad a year for baseball passings

Much has been made of the apparently large number of celebrity passings during 2016. We thought we’d have a look back at 2016 in MLB, and see how the year stacked up.  Short answer: In terms of former MLB players who have gotten their next call-up, 2016 was a rather typical year.

As of this writing, 84 former MLB players have been identified as having passed away during the 2016 calendar year. This figure is surely not quite complete, as reports do trickle in over time, but experience of previous years suggests we can make a good estimate two weeks into 2017. This figure compares to 88 each in 2015 and 2014. The last year in which 100 former MLBers passed away was 2011.

These are rather typical figures for recent years. In fact, the bad years for MLB passings were the 1960s: every year from 1958 through 1974 saw at least 100, peaking at 142 in 1969 and 140 in 1970.

A corollary of this is that the number of people living today who played in MLB is larger than ever.  Set against last year’s death toll of 84 are the 258 players who saw their first MLB action. The last years which did not see an increase in the living MLB fraternity were 1976 and 1970 (in which debuts exactly equal known deaths).  The worst single year was 1959, in which the 91 debutants were overshadowed by the 119 passings, for a net population decrease of 28.

We should note here that there are currently 174 who debuted before 1911 for whom no year of death is yet known. Filtering them out, we can estimate that the average age currently of living people who have appeared in MLB is around 49.5. (I have computed this crudely as 2016 – average birth year + 1/2, which should be good enough for present purposes.) That is up from about 49.1 in 2010 and 48.6 in 2000. But in 1990 it was around 49.0 and in 1960 48.4; you have to go back to, for example, 1940, when it was 46.9, or 1930 (44.1) to see a substantial shift.

The combination of expansion and more fluid rosters have more than compensated for the fact that people live longer. The average age of those who passed in 2016 (which again I have crudely estimated as death year – birth year) was 78.6; it has been over 78 almost every year since 2000, but was 76.6 in 1990, 74.4 in 1980, 69.3 in 1970, 70.4 in 1960, 66.7 in 1950… you get the drift.

However, it seems improbable that this trend of debuts outstripping passings can continue indefinitely; one would not think that too much more expansion would be in the cards any time soon, and likewise increasing the debut count would presumably require changes to roster rules.  But it could be quite some time. There are somewhere around 9449 current and former players who debuted since 1937 for whom we don’t have a death date; of them only about 1348 are over 70, 479 are over 80, and 76 are over 90.  It would be interesting for someone a bit more actuarially inclined to estimate when we might next see deaths matching debuts given the current age profile of MLBers.

Meanwhile, for those who are interesting, here are the counts of debuts and (known) deaths of (current or former) players, by year, since 1871:

year debuts deaths net
1871 115 0 115
1872 64 2 62
1873 34 1 33
1874 30 1 29
1875 84 1 83
1876 28 2 26
1877 18 1 17
1878 21 1 20
1879 55 3 52
1880 38 1 37
1881 21 10 11
1882 95 2 93
1883 79 4 75
1884 356 6 350
1885 51 5 46
1886 73 10 63
1887 73 6 67
1888 75 17 58
1889 71 11 60
1890 191 11 180
1891 78 16 62
1892 54 18 36
1893 56 19 37
1894 63 20 43
1895 85 20 65
1896 62 12 50
1897 59 23 36
1898 83 19 64
1899 81 20 61
1900 21 23 -2
1901 131 25 106
1902 138 22 116
1903 98 26 72
1904 92 28 64
1905 114 35 79
1906 109 21 88
1907 99 30 69
1908 124 30 94
1909 158 29 129
1910 155 36 119
1911 188 30 158
1912 221 42 179
1913 201 34 167
1914 235 37 198
1915 178 43 135
1916 100 47 53
1917 93 39 54
1918 76 48 28
1919 99 36 63
1920 137 43 94
1921 119 37 82
1922 137 53 84
1923 149 49 100
1924 123 50 73
1925 116 39 77
1926 96 56 40
1927 108 48 60
1928 112 64 48
1929 109 73 36
1930 98 54 44
1931 102 52 50
1932 93 54 39
1933 70 65 5
1934 113 80 33
1935 115 64 51
1936 101 69 32
1937 110 88 22
1938 109 75 34
1939 127 55 72
1940 104 79 25
1941 120 79 41
1942 110 73 37
1943 146 73 73
1944 153 73 80
1945 124 81 43
1946 107 74 33
1947 98 93 5
1948 114 80 34
1949 92 90 2
1950 105 94 11
1951 110 91 19
1952 114 83 31
1953 100 93 7
1954 121 78 43
1955 137 99 38
1956 89 98 -9
1957 96 82 14
1958 107 101 6
1959 91 119 -28
1960 104 103 1
1961 112 111 1
1962 147 121 26
1963 132 110 22
1964 130 108 22
1965 117 112 5
1966 103 109 -6
1967 115 104 11
1968 108 122 -14
1969 183 142 41
1970 140 140 0
1971 113 102 11
1972 122 100 22
1973 131 108 23
1974 144 110 34
1975 130 91 39
1976 106 106 0
1977 159 102 57
1978 148 90 58
1979 123 94 29
1980 145 117 28
1981 146 90 56
1982 142 100 42
1983 158 85 73
1984 132 88 44
1985 125 74 51
1986 164 99 65
1987 155 72 83
1988 141 83 58
1989 150 80 70
1990 169 84 85
1991 192 95 97
1992 162 76 86
1993 203 106 97
1994 114 101 13
1995 247 81 166
1996 194 100 94
1997 180 97 83
1998 208 76 132
1999 210 79 131
2000 204 78 126
2001 199 92 107
2002 204 98 106
2003 182 91 91
2004 208 86 122
2005 206 79 127
2006 221 94 127
2007 211 80 131
2008 238 83 155
2009 204 79 125
2010 203 96 107
2011 239 100 139
2012 206 78 128
2013 230 95 135
2014 234 88 146
2015 254 88 166
2016 258 84 174

Open Data for pre-1960 minor leagues

This summer we concluded a new licensing agreement with baseball-reference.com. Jointly with our partners at Hidden Game Sports, we provide them with their historical data for major, minor, and international leagues.

As part of this exciting tie-up, we’ve agreed to develop a methodology for documenting the pre-1960 minor leagues. There are some interesting challenges in constructing this dataset.  The documentation of minor leagues is quite fragmentary, and some leagues and years are documented more reliably than others. Mix in aliases, incorrectly spelled names, widespread misrepresentation of dates of birth, and just plain confusion, there is a lot of work to be done. Many researchers have done work on aspects of the historical record for the minor leagues, but to date there is no systematic integration of this information into a well-edited whole.

Given the magnitude of the task, we believe the only way to make meaningful progress is using Open Data principles. Work will need to progress incrementally and iteratively.  There is a need to be able to trace information through from original source to what eventually will appear on sites like baseball-reference, and to be able to integrate new information reliably without losing work that has been done before. The project can only be successful if it draws on the accumulated knowledge of experts. Because there are so many sources, many of which require thoughtful corrections to be made, it is essential to have as many eyes looking at the data. (In software engineering, this is sometimes captured in Linus’ (Torvalds) Law: “Given enough eyeballs, all bugs are shallow.”)

We have begun this process with two new repositories.  (You can see all of our open data repositories on this summary page.)  One is our collection of transcriptions of minor league averages. This will host data from league-level averages, both official and researcher-generated, transcribed directly from the originals with a bare minimum of interpretation.  Because historically most league averages do not include performance data for all players, we have a complementary collection of captured minor league boxscores.  Both datasets are licensed under the Creative Commons Attribution 4.0 International license. We in particular want to call out thanks to Frank Hamilton for his work in setting up the structure that will underpin the averages, and Jack Morris for patiently road-testing the boxscore methodology. There are many others to thank as well; credits will be listed alongside the materials in the respective repositories.

It should be emphasised that these are collections of rather raw data from sources. The capture is an important first step in producing the most reliable account of the history of the minor leagues – but indeed it is only the first step. In the coming months, as we acquire a critical mass of data in these repositories, we will roll out the process by which these sources will be combined, errors corrected, players’ careers identified, and demographic information like birth and death dates included.

All of these steps will follow Open Data and Open Source principles. The whole process will be transparent, and, just as importantly, everyone can contribute. There are many useful roles to be played, and many tasks that can be done, some even if you only have 15-30 minutes free at a time. There is certainly much transcription to be done. But, as we are making the data and some source code for tools available as well, there are opportunities for contributing to the data cleaning – or, indeed, for coming up with new and interesting views on or uses for the data which we have not as yet anticipated.

This process will take some time to unfold. Retrosheet has been publishing data for two decades now, and still has interesting work yet to be completed.  We are certainly hopeful that substantial chunks of data will become available relatively quickly. Equally, though, patience is required. We want to build this project to last.

If you have questions – or want to find out how you can do more to help – please get in touch with Ted Turocy at ted.turocy (aht) gmail (daht) com.

Most consecutive batters retired against one team

Recently on the Retrosheet email list it was asked, what are the longest streaks of consecutive batters retired by one pitcher against one club.  As Retrosheet currently has complete play-by-play back through 1974, we can answer that question reliably that far back.  Here are the list of the pitchers who have retired 30 or more consecutive batters against one club in that span. (All opponents are referred to by their current team ID, so, e.g., David Cone’s feat was against the team currently known as the Washington Nationals, but of course were the Montréal Expos at the time).

Pitcher Surname Use name Opponent Length Started Ended
garcf002 Garcia Freddy ANA 39 20060509 20060913
bakes002 Baker Scott KCA 38 20070730 20070831
browt001 Browning Tom LAN 38 19880911 19880916
bosic001 Bosio Chris BOS 35 19930422 19930704
cainm001 Cain Matt HOU 34 20110828 20120715
hernf002 Hernandez Felix TBA 34 20120430 20140512
ryu-h001 Ryu Hyun-jin CIN 34 20130727 20140526
belct001 Belcher Tim NYN 34 19920830 19920904
bassa001 Bass Anthony SFN 33 20110824 20120428
krukm001 Krukow Mike NYN 32 19860522 19860601
sheeb001 Sheets Ben PIT 32 20060725 20060913
rogek001 Rogers Kenny ANA 32 19940624 19950511
downs001 Downs Scott TEX 32 20090608 20110827
myerr001 Myers Randy SFN 32 19870526 19890519
maddg002 Maddux Greg SFN 32 20060813 20060819
montj101 Montague John ANA 31 19770716 19790404
coned001 Cone David WAS 31 19920616 20000605
buehm001 Buehrle Mark TBA 31 20090723 20100421
sancj002 Sanchez Jonathan SDN 31 20090520 20090710
blylb001 Blyleven Bert ANA 31 19850619 19850624
welld001 Wells David MIN 31 19970809 19980811
bradd002 Braden Dallas TBA 31 20100509 20100822
nomoh001 Nomo Hideo TOR 31 20010525 20010531
darvy001 Darvish Yu HOU 30 20120615 20130402
johnr005 Johnson Randy ATL 30 20030815 20060626
remlm001 Remlinger Mike WAS 30 19970917 19970928
westj001 Westbrook Jake DET 30 20030904 20040425
bibbj001 Bibby Jim ATL 30 19810519 19830525

The aforementioned streak by Cone was the longest in terms of time span – he retired 31 consecutive Expos, but over a span from June 1992 through June 2000. The second longest span was by Billy Wagner, who threw a “perfect game” against Phillies batters between 23 August 2001 and 13 June 2006.

One could also ask the inverse question: What are the longest streaks of not getting a batter out, against one team?  The current holder of this dubious distinction is Luke Hudson, who failed to retire 15 consecutive Indians over two outings in 2005 and 2006:

Pitcher Surname Use name Opponent Length Started Ended
hudsl001 Hudson Luke CLE 15 20050625 20060813
hefnj001 Hefner Jeremy PHI 13 20120920 20130410
lemad101 Lemanczyk Dave SEA 13 19780506 19780718
dierl101 Dierker Larry CIN 12 19750924 19760603
albem001 Albers Matt CHN 12 20080625 20120617
romer002 Romero Ricky TBA 12 20120902 20130508
kiled001 Kile Darryl CIN 11 19980410 19990715
hursb001 Hurst Bruce NYA 11 19870929 19880606
welsc001 Welsh Chris PHI 11 19820501 19830621
burkj001 Burkett John LAN 11 19900928 19910428
joneb003 Jones Bobby COL 11 20010927 20020530
jansc001 Janssen Casey BOS 11 20090818 20100426
brewb001 Brewer Billy ATL 11 19970629 19980404
leitm001 Leiter Mark LAN 11 19960417 19960711
raggb001 Raggio Brady MIA 11 19980515 19980515
hawka001 Hawkins Andy BOS 11 19890926 19900605
nomoh001 Nomo Hideo CHN 11 19970812 19980418
nelsg001 Nelson Gene TEX 11 19910626 19910702
hochl001 Hochevar Luke CHA 11 20100515 20110405
welsc001 Welsh Chris SDN 11 19860817 19861005
violf001 Viola Frank BOS 11 19850901 19860530
venam002 Venafro Mike DET 11 20010508 20010808
richc002 Richard Clayton MIN 11 20090410 20110617
dresr001 Drese Ryan NYA 11 20020713 20020724

The source code we used to generate these results can be found here.

 

 

Chadwick Persons Register, release 2015-04-05

A new release of the Chadwick Persons Register is now available, just in time for MLB’s Opening Day.

Not too many major new goodies this time around; just the steady march of revised and expanded data.

The 1911 Central League, game-by-game

This is a first view on something we’ve had in the cooker for a while.  The statistical history of minor league baseball is erratically documented. Official averages published by leagues themselves were incomplete (when published at all), and even when published were not always particularly accurate. For at least the last four decades, researchers have attempted to fill the gap by compiling league summaries and averages from newspaper accounts and other sources.

Both official and researcher-compiled averages share the same problem: they’re completely undocumented. Newspaper boxscores are themselves not always that reliable or complete; compiling any set of averages from them is necessarily going to involve editorial judgment. Further, as Retrosheet has well-documented, even in Major League Baseball, discrepancies are the rule rather than the exception when looking at historical data.

We have been developing a method for carrying out historical minor league research with a focus on transparency and reproducibility.  We’ve piloted this on several leagues, including having worked our way through the 1911 Central League, and we’re now ready to share some preliminary results on our methodology.

First, we’ve developed an efficient method for capturing boxscore data. Here is a sample of the first few days from the 1911 Central League season, captured from Sporting Life. The key idea is to capture the text of the boxscore in a format that is as close to the original as possible. Further, these files are raw: we capture all data – names, statistics, etc. – as given in the source.

Then, we have a set of scripts which parse these files, and then do all of the matching up of player and umpire names and any data cleaning.  From that, we can then output any number of potentially useful reports.  For example, here is our summary report, which includes a standings analysis, games by position for all players, and then player day-by-days.  This gives a great overview of the season’s evolution. We can also export data into other formats, such as CSV game logs, team performance logs, and player logs.

An important point is that the process of producing the report and logs linked here is completely scripted. For example, when a team used two or more pitchers, it was often customary to list hits allowed and innings pitched by all but one of the pitchers, as the balance can be inferred. The script does carry out this inference, but the separation between capturing the original data as-is and the script which does the calculation allows us to know where a particular datum came from. Similarly, if a correction is made to a boxscore, we can separate which data are from the source, and which are editorial judgments.

This is something we’ll continue to develop over the coming months as part of a larger methodology for systematically documenting the statistical history of baseball. In the interests of supporting transparency and reproducibility, we’ll be making available our files (both the raw data input and the processing scripts) under open-source licenses. Watch this blog (and our Facebook and Twitter feeds) for updates.

 

Useful tips for the data-oriented researcher

There was a nice post yesterday on the Impact of Social Sciences blog by Carly Strasser on data management for the data-oriented research:

http://blogs.lse.ac.uk/impactofsocialsciences/2015/02/09/data-versioning-open-science/

The OP is written primarily for social scientists, and primarily those pursuing an academic career, but many of the points apply to doing work with any kind of data, including sports data, especially:

1. Learn to code in some language. Any language.
2. Stop using Excel. Or at least stop ONLY using Excel.
3. Learn about how to properly care for your data.
4. Write a data management plan.
5. Read Reinventing Discovery by Michael Nielsen.
6. Learn version control.

8. Let everyone watch.

These echo closely the core values of Chadwick Bureau, and how we go about producing and maintaining quality datasets. We hope to come back to some of these points in products and projects we have planned for 2015, so do stay tuned.

Baseball-reference adds winter league stats, Cuban stats

Baseball-reference.com has added winter league and Arizona Fall League stats, provided by Chadwick Bureau:

http://www.sports-reference.com/blog/2015/02/winter-league-statistics-added/

http://www.sports-reference.com/blog/2015/02/arizona-fall-league-stats-added-to-baseball-reference/

In addition, historical Cuban National Series stats are up, thanks to Brian Cartwright; we are pleased to have been part of facilitating their publication.

git repository of Retrosheet data updated

We have now updated our git repository containing all the downloadable Retrosheet files at https://github.com/chadwickbureau/retrosheet.

Advantages of using this repository include:

  1. You can download all the Retrosheet files in one go, rather than individually downloading each archive from the Retrosheet site.
  2. You can see what changes in the files in each release, as the repository has several years of history you can look back at.
  3. In the master branch, we also maintain some patches to the Retrosheet files to correct minor formatting or other data errors.

Enjoy!

Chadwick Persons Register updated 2014-11-03

With the World Series closing out the (North American summer) seasons, we have just posted a new version of the Chadwick Persons Register.

This includes all players to appear in North American affiliated leagues, North American independent leagues, NPB, and the KBO in 2014, as well as identifier cross-references where known.

It also includes provisional baseball-databank IDs for players who made their MLB debuts in 2014. We will post an update over the off-season once Retrosheet identifiers have been confirmed for the debutants.

As always, enjoy!

KBO stats on baseball-reference

http://www.sports-reference.com/blog/2014/06/kbo-stats-back-to-1999-baseball-reference-com/

We’re very pleased to have collaborated with baseball-reference.com, Brian Cartwright, and Patrick Bourgo and SABR’s Korea chapter, to help bring a first version of Korea Baseball Organization stats to the bb-ref website.