A long overdue addition – Women’s data

I’ve been making data available on this site since 2009 and have gradually increased the number of files I provide to the point that, as I write, 2,780 matches are available. Over time I’ve expanded from just matches involving Full Members, to the Indian Premier League, non-ODI international one-day matches, and international T20s. This gradual expansion means that I’m now providing over 380 matches involving only the Associates and Affiliates, meaning that I’m not just covering the Full Members. This has been an improvement, however there is still one issue, and that is that I’ve only been providing data for Men’s cricket.

I’ve wanted to add data for Women’s cricket for a while. I started the project with the idea of providing cricket data, but I didn’t really think of anything beyond Men’s cricket. Raf Nicholson expressed very well the trap I let myself fall into.

At its heart, it comes down to this: The first “C” in ICC has always, since its formation in 1909, stood for “cricket”, though what it should really have been called, up until it took control of women’s cricket in 2005, was the IMCC – the International Men’s Cricket Council. When a male journalist says, “I am a cricket correspondent”, he means “I am a men’s cricket correspondent.” When a blog refers to itself as an “England cricket blog”, what this generally means is “an England men’s cricket blog”. And when ordinary cricket fans say “cricket”, almost without exception what they really mean is “men’s cricket”. In short, men’s cricket is the default setting.

I very much fell into the trap of viewing Men’s cricket data as cricket data, and not considering Women’s cricket at all. This is unfair, and something I’ve been planning to fix. Men’s sports have awesome data as Allison McCann has noted, while Women’s sport is poorly served.

And just because the data doesn’t exist doesn’t mean we can’t compile it ourselves or make estimates based on what is available. I just think that in addition to praising the virtues of men’s sports data, we need to acknowledge that good women’s sports data is severely lacking.

I’m happy to announce that, as of today, Cricsheet will finally be providing data for Women’s cricket. The initial release consists of 257 matches, comprising 148 T20Is, 69 ODIs, 37 International T20s, and 3 Test Matches, and includes matches from as far back as 2009.

The addition of Women’s data has a practical implication for the data we already provide. The Data Format has just been changed to update the version from 0.6 to 0.7, to allow for the addition of gender as a new field in the info section. Right now this field contains either female or male, but I reserve the right to have other values in the future.

The Downloads page on the site has also been updated to allow users to download Women’s or Men’s matches in all of the variations we previously provided, as well as continuing to download all matches for all genders.

As Raf Nicholson wrote men’s cricket is the default setting, and I’ve been guilty of having that mindset. Today is a small step on the path to changing that, and to stop viewing men’s cricket as the default.

Until that changes, we have a problem. Until that changes, I’m going to keep telling the world that I am a feminist. Cricket needs feminism. End of story.

Version updated to 0.6

The data version included in every data file I provide, and explained on the format page of the site, has just been changed from 0.5 to 0.6. This actually reflects a relatively minor change, and is the first time I’ve bumped the version number since February 2013.

In the 1st Test of 2014 between Pakistan and Australia, Sarfraz Ahmed was dismissed and play stopped for tea. After the break Zulfiqar Babar, who had been batting with Ahmed, didn’t come back out and retired hurt. This meant that in the data I needed to record 2 dismissals related to a single delivery. A complication had arisen.

As I’d never even considered multiple wickets on a single delivery as a possibility, and since it had never occurred in the previous 31,271 wickets I provide data for, I’ve had to tweak the data format, along with numerous scripts, to allow for this possibility. The change I’ve implemented allows the wicket entry on a delivery to contain a list of wickets, rather than always assuming just one. Balls where only a single wicket fell (all 31,271 of them thus far) are unchanged, this tweak simply allows for the possibility of something different.

If you’ve written code that uses the data I provide you should make a small tweak to check for the existence of multiple wickets on a delivery, however, if you don’t, you’ll probably be fine apart from when you try to process that single Test match where this issue.

There will be substantial changes to the data format coming in the next number of months, which will add new information for many of the matches currently covered. These may require tweaks to some of your code, but I will be providing parallel versions of the data files for a period of time, allowing users to continue to use the older version while updating their code. More details on these changes soon.

9 new countries, 91 new matches

3 months ago, in October, we said that we would “need to give some further thought as to how we will deal with T20 matches that aren’t regarded by the ICC as ‘T20 Internationals’“.

Well we’ve reached a conclusion, and implemented it. We’ve just updated the site with 91 new data files, including 53 international T20 matches (not T20s internationals), and added an extra 9 countries to those we have some data for, namely Denmark, Hong Kong, Italy, Namibia, Nepal, Papua New Guinea, Uganda, United Arab Emirates, and the USA. All 53 of these matches come from the World T20 Qualifier in November 2013. The match_type used for these matches is IT20, meaning “International T20”.

What is the difference between a T20 International and an International T20?

“T20 International” and “International T20” sound like they refer to the same type of match, but there is a subtle difference. A “T20 International” is a match that is recognised by the ICC as being a full international. This means a match between Full Members or those Associates and Affiliates to whom the ICC has “granted” T20 status. An “International T20” is the name we’re using to cover all other international T20 matches, such as those involving a country that hasn’t been granted T20 status.

Confusingly a country can play both types of T20. Ireland did so during the World T20 Qualifiers, playing a “T20 international” against Canada in the group stage while playing “International T20s” in the other group matches.

We don’t think the distinction should exist. If a match is played between any two countries and follows the T20 rules we think it should have the same status as any other similar match. The ICC disagree sadly and we note the difference for accuracy.

What about One-day matches that aren’t ODIs?

We don’t currently have any non-ODI one-day matches on the site. We’ll start adding these when the World Cup Qualifiers start in a weeks time. These will slowly appear on the site with a match_type of ODM.

We will look into adding older data for these types of match too, however finding the data will be the main problem as always. It’s rare that ball-by-ball commentary is provided for these matches, and it’s even more rare for it to be accurate when it is done, sadly. We’re quite good at fixing errors by now (practice will do that) but some sources are so catastrophically bad that we have to just throw our hands up and walk away.

When is an ODI not an ODI?

A person without knowledge of the cricket world when asked to define a “one-day international” might say that it would be a match between two countries played on a single day under an approriate set of rules. This apparently reasonable explanation would be incorrect. This illogical position is currently causing uncertainty as we investigate a forthcoming change of emphasis, and is one of many small signs of the inequality in the world of cricket.

As we write there are 106 nations listed as members of the ICC, 10 Full Members, 37 Associates, and 59 Affiliates. Of those 106 members only the Full Members have the permanent right to play ODIs. 6 of the Associates/Affiliates are “granted” the right to play ODIs for a limited period, depending on how they do in the World Cricket League. The other 90 members don’t get to play ODIs, but they can play one-day matches against other countries (although rarely, if ever, a Full Member).

In June, Andrew Nixon made an observation on Twitter regarding “the lack of variety at the top of international cricket. Same teams playing each other over and over”. We responded that “It’s disturbing how many people have no idea more than 10 countries play the game, and that they don’t see that as a problem”. At that point we realised that we were were falling into a variation of that mistake.

At the moment we provide data for Tests, ODIS, T20 Internationals, and IPL matches. This means that for international cricket we’re focussing exclusively on 16 out of the 106 ICC members. The vast majority of our data files feature only Full Members. We hadn’t even been attempting to include any other international matches. We’ve now decided this must change to include matches featuring all countries. This causes a dilemma for us as we have to decide how to refer to these matches on the site and within the data we provide. It may seem like a minor issue, but it’s one that is standing in the way of this expansion.

We were originally tempted to call all one-day matches between countries ODIs and to be done with it, however we feel that there should be a way to indicate that the matches were viewed as distinct from ODIs at the time they were played. In an ideal world there would be no such distinction but our opinion on this shouldn’t cloud our work in creating reliable data, so that option is off the table. We’re gradually coming to the view that we’ll just call each of these matches a “One-day match” and use the short code of ‘ODM’ for the match_type in the data files. It’s a small difference but it seems to fulfill our requirements.

We haven’t yet implemented this addition in any data file. We still have a number of issues to deal with first, mainly on the website, so that we can adequately distinguish between ODIs and ODMs, but also so that we can have useful coverage figures. We also need to give some further thought as to how we will deal with T20 matches that aren’t regarded by the ICC as ‘T20 Internationals’, as well as dealing with multi-day matches such as the Intercontinental Cup. Once we deal with some, if not all, of these issues we’ll look into adding matches involving the rest of the cricket world to the site. The main problem there will be in finding any source of ball-by-ball data, but we’ll worry about that problem when we can.

Expanded coverage data

Today we updated coverage data page to provide extra information on the level of coverage we’re providing on the site. Previously we were providing total coverage figures, now we’re providing extra breakdowns by Full Members, and Associates and Affiliates, as well as showing coverage figures for matches between the two groups. Our overall coverage figure stays at 92.02%, as it was before, but it can now be seen that we have coverage for 98.24% of matches we’ve attempted to cover between the Full Members. We’ve also broken the yearly figures down by the same criteria.

This change is partly motivated by a desire to see more accurate figures ourselves. The overall coverage has been floating at around 92% for a number of months now, and we felt that the number was being dragged down by the difficulty in sourcing the data for the Associates and Affiliates. We were surprised at how high our coverage of Full Members matches is, and it has shown us how much work needs to be done for the Affiliates and Associates.

The other motivation we have for splitting the coverage data is that we’re going to start trying to add more international matches. This will include any full international match for a country for which we can source data, such as the recent Italy vs Denmark T20 match. Inevitably this will have the effect of reducing our overall coverage figure, so it will be useful to have more detailed figures so that we can see a more nuanced picture.

We’ll go into more detail regarding our plan to expand our coverage in the near future.

Now available: Zip files

I’ve had a couple of requests over the years for zip files of different groups of matches, particularly international T20s, and Indian Premier League matches. I’ve generally created a file for those groups and send them a link to that file, and left it at that. I’ve now made a slight tweak to that process. You’ll now find a section called *Zip Files* on the homepage which contains links to 5 zip files, one each for Test matches, One-day internationals, T20 internationals, IPL matches, and one zip file of all 1,182 matches we currently have. The zip files contain the same data files you can still download individually, however if you’re after a number of matches one of the zip files might make more sense for you.

I’ve written a script to generate the zip files based on the criteria I provide, so if anyone wants/needs a different subset feel free to ask.

A new data version

I’ve been adding matches to the site on a fairly regular for the last few years, despite the lack of new articles on the site. Now however period of silence is finally over as there is a new data version to announce. Today I’ve moved all of the data files to version 0.5 and made a few other small changes to the site. First of all we’ll deal with the data format changes for version 0.5. These are fairly minor for the most part.

The first change is the addition of a revision field to the meta section of the file. This is set to 1 for every file at the moment and will increment any time there is a revision to the file. This replaced the updated field which I’ve decided was of little use.

The second change is the addition of new fields to deal with the situation where a match is decided by a bowl-out. The first field is the addition of a bowl_out to the outcome part of the info section which indicates which team won the match by the bowl-out. The second bowl_out, an addition to info, is an array containing details of the details of the actual bowl-out. It lists each ball bowled showing the bowler and the outcome. An example of a bowl-out can be seen in the file for the first West Indian T20 international in 2006.

The final change is the addition of a supersub entry to any delivery in which a super-substitution was made. This will be an array containing an entry for each substitution, containing in, out, and team fields showing which player came in, who was replaced, and which team made the substitution. You can see the only example on the site at this time in a South Africa vs New Zealand T20 match from 2005.

A number of changes are already in the works for version 0.6 of the data. More details on what those will be will come in the next few weeks.

How accurate is the data?

After creating the initial data that appears on this site I had a great desire to expand upon it and add extra information to each file, such as power-plays or new ball information. Thankfully sense took hold and I’ve instead spent some time writing scripts to allow me to easily check the data for correctness. Among the checks performed are one to test that the team scores are correct for each innings and another to check that the runs and balls faced for each individual players innings are correct.

After running the new scripts on the initial data files I found a few small errors, generally down to confusion between players with the same surname, such as Brendon and Nathan McCullum. I quickly fixed the errors and now run the scripts on every data file before release. Each available data file is now accurate in containing information on every ball bowled in a match, including who was facing, bowling or at the non-strikers end, any runs scored, whether off the bat or extras, methods of dismissal etc.

Now that I’m confident about the accuracy of the published data I have a number of possible paths to follow. I could expand the data available for each game, properly document the data format, look for other sources for data, or refine the structure of the data. I’m not sure which way to proceed, any thoughts? Regardless of the one I choose I’ll continue to add new international games as they’re played.

Small Steps

This site contains the initial data files I generated for a number of international matches in 2009. They’re accurate, but not in a format I’m completely happy with. Rather than wait until I work out the right format I’m just throwing them out there and will change them as required.

Three things inspired me to do my initial work on generating data files for cricket matches; the first was the book Moneyball by Michael Lewis regarding the efforts of the baseball team the Oakland Athletics to use statistical analysis to build the roster; the second was a post on Pappus plane briefly mentioning a database of cricket data; the third, my discovery of the inspiring work of Aneesh at Against The Spin in providing data for numerous T20 matches.

After a brief discussion with Aneesh, I decided to put some work into trying to expand on his work. Rather than going into mind-numbing detail regarding the process I’ll simply say that I succeeded in adding further details of each wicket, such as who was out, how, and who was involved, better player names, and, non-striker information. These additions are merely the first small steps towards the level of data I would like to see available to statisticians. My thoughts on where this may go will come at a later date.