The Marine Corps Marathon – cleaning the data

The 2014 Marine Corps Marathon took place on October 27th on a course that runs through Washington DC and Arlington, VA. This year my son ran in the race so I was there for the excitement. But being an observer introduced me to some to the data quality issues. During the race I was set up to get a text message as he passed the 10k, 20k, 30k, and 40k point in the race. The system failed miserably. I got the texts, but some were so late to be meaningless. After the race the more significant data quality issue become apparent as they had the wrong start time for him.

Now step back. The Marine Corps Marathon uses an RFI chip on each runner to track where they are in the race. It also is used to capture the individual runners start and finish times. With over 20,000 finishers in the race it is impossible for all to cross the start line at the same time. Also with 20,000 runners there are certain to be problems with the system. When one views the results page for the race there are two time markers for each runner – the net time and the clock time. The clock time is the actual time that the runner crossed the finish line. The net time is the time that it took the runner from the time he/she crossed the start line until the time they crossed the finish line. So, for example the first runner listed on the results page crossed the finish line in 3:53, but took 3:50 to run the race. For my son the net time was in error. He crossed the start line about three minutes into the race, but the initial results showed that he crossed at the gun. That is what got me looking at the data.

An attempt is made to put the faster runners at the front of the queue so that the faster runners do not have to pass the slower runners during the race. One of my hopes is to do some analysis on how well the predictions worked. But that is for a future post.

The results for the race were available on race day, but were labeled as unofficial. I downloaded those results and did a bit of analysis. The official results were posted about a week ago. So I download those result and looked again at some of the issues. At it turns out not all of the problems with the start time were resolved. It actually looks like very few were corrected.

The official results listed 23,468 finishers. It is not apparent until you look at the finish times that the list includes 88 runners who are in what are called the “rim” and “wheel” categories. Their results are also hidden on a separate page. I decided for my purposes to delete them from the list. One can argue if that is appropriate or not. This left 23,380 runners.

While downloading the data I discovered that at least one runner was in the file twice. A quick check of the file found two other cases. There were 23,377 runners remaining.

Next up was examination of the distribution of net time vs clock time for the runners. The first problem was that one runner, at least according to the data, started the race five minutes before the gun. Perhaps they were standing on the wrong side of the start line when the race started. You can watch the start of the race and the finish line videos here. The count of how many runners crossed the start line by seconds into the race was informative. It looks very much like the issue of getting the correct start time was not addressed for most runners. In the first second of the race 303 runners crossed the start line. In the next second this dropped to 54 runners. After that is quickly leveled off to 30-40 runners per second. Given the experience with my son I very much doubt that the 303 number is correct. Also looking at the run time of those who according to the database crossed the line in the first second I very much doubt that the count of 303 runners is anywhere near correct.

To put this in perspective the error rate is actually very low. Likely given the size of the crowd the RFI chip did not register when some of the runners crossed the start line. The system must have defaulted to placing them at the start line when the race started. This is a failure rate of around one percent. Runners who brought up the issues had their times corrected. My son’s was changed. I do not have any idea how they made the correction. For him they did ask him to provide any indication of where he was in the crowd. From the photos his friends took he was able to give them some bib numbers of other runners near him.

Next up, assuming I get to it is some analysis of the run time of those who ran the race and how well the placement of the runners in the queue worked.

2 Comments

  1. Chris Bosken says:

    Just a guess, but is it possible that some of those 303 runners who started at the same time were disabled/partially disabled starters who started before the gun? I think in some of my races disabled participants are sent out before everyone.

    Alternatively, perhaps they were ‘professionals’ who were categorized differently than the rest of the amateurs? That is, because they are professionals they agree to have their time measured from the gun, regardless of their actual start?

    Or the data is just wrong.

  2. Larry says:

    Thanks for the comment Chris.

    Good points. Professionals, perhaps – I could look at the list of top finishers and match them up with the group of 303 and see where they come in.

    As for the disabled. Two thoughts. First those in wheelchairs, those being pushed and the like are identified on a separate and somewhat hidden web page. There were 88 in that group. I deleted them from my analysis. So at least those are not in the group of 303. I do think the some of those are usually sent off early as they tend to be faster than the best runners. I don’t know what the MCM does for that group. There were 51 of the 88 who finished with a quicker time than did the winner of the marathon.

    At least in the unofficial data some of the times were wrong. There were corrections between the unofficial file and the official file. So I expect some errors remain.

    I think I’ll do a bit more looking at the group of 303 and see what there race pace was compared to the winners. I’m guessing that somewhere between 50 and 100 belong in the group.

    Larry…..

(Comments are closed)
  • Subscribe to Blog via Email

    Enter your email address to subscribe to this blog and receive notifications of new posts by email.

  • September 2019
    S M T W T F S
    « Jan    
    1234567
    891011121314
    15161718192021
    22232425262728
    2930  
  • Recent Posts