Posted on 06/22/2013
One hundred sixty-two games is a long season during which a hitter will go through several hot streaks and slumps. As a manager makes a lineup in the "dog days of summer," wouldn't it be nice to know which solid, consistent hitter to keep in the middle of the lineup? A hitter that one could count on to get a hit or get on base nearly every game. There are advanced metrics that attempt to measure many different aspects of a player's game. However, I found a small gap; there is no metric that attempts to measure how consistent a batter is at getting hits or on base. I have developed two statistics that try to measure this. I call them the "On-Base Consistency Index" (OBCI) and the "Hit Consistency Index" (HCI). What follows is the idea behind the indices, where I got the data, and how I processed that data. The next post will contain examples and analysis of specific players, years, and careers.
Both the OBCI and HCI are constructed using the same fundamental idea: a batter that is "streaky" will have long streaks of games in which they get on base (or get a hit) followed by long streaks of games in which they don't get on base (or don't get a hit). Of course consistent hitters should have long streaks of games in which they do get on base (or get a hit), but have very short streaks of games where they do not. To calculate OBCI and HCI, streak lengths are determined from game log data (which will be described later). Then the streak lengths are compared a few different ways, resulting in one number for each index. The higher the number, the more consistent the hitter - at least according to these stats. Until I figure out exactly what I am going to do with these stats, I won't give away the exact formulas used. I will say that the formulas have changed drastically over the development process, and they are fairly simple; the hardest part of the calculations is gathering the data. The formula for OBCI and HCI are exactly the same using their respective data.
One thing that I needed to be careful with were the extreme values. The record for the longest hitting streak in MLB history is 56 games, set by Joe DiMaggio in 1941. I struggled with how I was going to handle such streaks for several weeks. I was attempting to put a value on what happened most often. With all due respect to Joe DiMaggio and his record - which will likely never be broken - a 56 game hit streak is not something that happens often. However, such long streaks take a significant chunk of the season and should not be disregarded. For these reasons, I decided that the top 10% streaks - both success streaks (getting on base or getting a hit) and failure streaks - should be treated specially. This created a more balanced and understandable number.
Now to the data. The information used here was obtained free of charge from and is copyrighted by Retrosheet. Interested parties may contact Retrosheet at 20 Sunset Rd., Newark, DE 19711 (this statement is required by the Retrosheet license). They have box scores for every game since 1915 and play-by-play data for most games since 1945 available for download. The data comes in "event files." The required data is then extracted from these files using the supplied tools. These tools are
.exe programs. Since I haven't owned a Windows machine in about 8 years, I was forced to use MacPorts and WINE to run them on my iMac. I also wrote a small PHP script to run them all at once. After all the box score and play-by-play data had been extracted, I wrote a Java program to compile the data player by player. The Comparator Interface was instrumental for this step because I needed the games played by each batter to be in chronological order (they are not given that way by Retrosheet). Then I wrote the Consistency Index Object that handled the calculations of the OBCI and HCI. I also wrote a very simple GUI for viewing the indices and related statistics for a desired player. There is also an option to compare two players at a time, or the same player in two different time periods.
As I write this post, my computer is hard at work calculating top OBCIs and HCIs for every year from 1915 to 2012. It is also calculating the top indices for every three and five year period. All three of these programs are written in Java as well. Once I have that data and am able to analyze it, I'll write a new post. I think that data will be very interesting. It should be noted that a single season of data is almost too little data for these indicies. You can learn something by analyzing single season data, but there is much more to learn from three-year, five-year, and career data. If there is any player, year, or years you would like to read about, please leave a comment or contact me and I will be glad to include it.
Allow me to introduce myself. I am a mathematician and programmer. Currently, I am working on a PhD in arithmetic geometry. I like to write about many things including math, sports, programming, education, and technology. If you would like to see my comments more frequently, you can follow me on my social profiles: