How accurate is the data?

After creating the initial data that appears on this site I had a great desire to expand upon it and add extra information to each file, such as power-plays or new ball information. Thankfully sense took hold and I’ve instead spent some time writing scripts to allow me to easily check the data for correctness. Among the checks performed are one to test that the team scores are correct for each innings and another to check that the runs and balls faced for each individual players innings are correct.

After running the new scripts on the initial data files I found a few small errors, generally down to confusion between players with the same surname, such as Brendon and Nathan McCullum. I quickly fixed the errors and now run the scripts on every data file before release. Each available data file is now accurate in containing information on every ball bowled in a match, including who was facing, bowling or at the non-strikers end, any runs scored, whether off the bat or extras, methods of dismissal etc.

Now that I’m confident about the accuracy of the published data I have a number of possible paths to follow. I could expand the data available for each game, properly document the data format, look for other sources for data, or refine the structure of the data. I’m not sure which way to proceed, any thoughts? Regardless of the one I choose I’ll continue to add new international games as they’re played.

Small Steps

This site contains the initial data files I generated for a number of international matches in 2009. They’re accurate, but not in a format I’m completely happy with. Rather than wait until I work out the right format I’m just throwing them out there and will change them as required.

Three things inspired me to do my initial work on generating data files for cricket matches; the first was the book Moneyball by Michael Lewis regarding the efforts of the baseball team the Oakland Athletics to use statistical analysis to build the roster; the second was a post on Pappus plane briefly mentioning a database of cricket data; the third, my discovery of the inspiring work of Aneesh at Against The Spin in providing data for numerous T20 matches.

After a brief discussion with Aneesh, I decided to put some work into trying to expand on his work. Rather than going into mind-numbing detail regarding the process I’ll simply say that I succeeded in adding further details of each wicket, such as who was out, how, and who was involved, better player names, and, non-striker information. These additions are merely the first small steps towards the level of data I would like to see available to statisticians. My thoughts on where this may go will come at a later date.