How accurate is the data?

Posted: 18th of September, 2009

After creating the initial data that appears on this site I had a great desire to expand upon it and add extra information to each file, such as power-plays or new ball information. Thankfully sense took hold and I’ve instead spent some time writing scripts to allow me to easily check the data for correctness. Among the checks performed are one to test that the team scores are correct for each innings and another to check that the runs and balls faced for each individual players innings are correct.

After running the new scripts on the initial data files I found a few small errors, generally down to confusion between players with the same surname, such as Brendon and Nathan McCullum. I quickly fixed the errors and now run the scripts on every data file before release. Each available data file is now accurate in containing information on every ball bowled in a match, including who was facing, bowling or at the non-strikers end, any runs scored, whether off the bat or extras, methods of dismissal etc.

Now that I’m confident about the accuracy of the published data I have a number of possible paths to follow. I could expand the data available for each game, properly document the data format, look for other sources for data, or refine the structure of the data. I’m not sure which way to proceed, any thoughts? Regardless of the one I choose I’ll continue to add new international games as they’re played.