Thursday, September 7, 2017

Developing an Analytical Database of the Boston Marathon with Power BI


As I’ve mentioned in my recent blog posts, I try to combine my passion for data and insights, with my interests in other areas. Over the past half-dozen years, I have become an avid distance runner. At first, I took up running to lose weight and stay in somewhat decent shape. However, I soon learned to enjoy the fresh air, scenery, and escape from a hectic life style (and the near constant interruption of electronic devices). I find that distance running is a great stress reliever and helps provide a sense of balance in my life.

I have been fortunate enough to have completed two marathons and nearly 20 half-marathons. However, I am not yet fast enough to qualify for the most prestigious distance running event of all time, the Boston Marathon, the oldest of the organized marathons, with competitions every April on Patriots’ Day, since 1897.

I have completed two Boston Half Marathons and have found the Boston Athletic Association (BAA), which coordinates both Boston Half Marathon and Marathon, to be an amazingly dedicated organization. I therefore assumed that the BAA must have a comprehensive and searchable database of all Boston Marathon finishers since 1897. I was wrong. The BAA does have a searchable database for finishers from 2001 through 2016. However, the only searchable source provided for finishers prior to 2001 is a PDF file which happens to have 24,808 pages. As anyone trying to search a PDF has found, it can be very tedious and not suitable for analysis.

So, I wondered… could I create a complete searchable data base using Power BI and if so how long would it take? The answer definitely yes, and it took about 16 hours in total. The resulting report is shown below, along with a summary of how I developed the database and Power BI report.

Sourcing Data and Creating Calculations

The primary data source for the database of finishers was the PDF file consisting of the 24,808 pages provided by the BAA on its website. This file has a list of finishers from 1897 through 2016. Since Power BI does not have the ability to read PDF files directly, I used my Adobe subscription to save the file to Excel. This was helpful, but Adobe was not able to produce a delimited data set, so the Excel file had one row for each finisher for 1897 through 2016 with all attributes combined into one column. Fortunately, the column did include a space between each attribute of the PDF, so I was able to use the “Split Column (by space delimiter)” feature of Power Query to create separate columns for Finisher Name, Gender, Year, Country, and Finishing Time. A separate file for 2017 finishers is provided by the BAA, and I used the same technique to convert this file in Power Query and then appended the 2017 finishers table to the 1897-2016 finishers table.

Now with the data set complete, I needed to form a few calculations using the Power BI DAX language to provide finisher counts, fastest time, and average finisher time.

Creating the Report

I used a variety of out-of-the-box and custom visuals to create the report. I used slicers for Year and Gender, and a searchable slicer for Name so that I could search on any finisher. I used cards to display the Fastest Time and Average Finisher Time, and a table to provide a detailed list of results for the selected finisher.

I leveraged tables to display top finishers, the winner, and country ranking. Finally, I used a custom visual, the Enlighten World Flags slicer, to provide a visually engaging way to filter by country. As far as I can determine, this is the only database that provides a Boston Marathon historical ranking by country.

Caveats, Insights and Summary

Obviously, a report is only as good as the accuracy of the data. In this case, I would say that the data is approximately 95% accurate. I did notice that a few of the rows came over to Excel missing key data attributes, and it turns out that there was some data missing (data appears to be incomplete for 2005, 2008, and 2009) in the source PDF file. Furthermore, it appears that in the early years of the Boston Marathon, if a participant did not finish in under 4 hours, their time was not recorded (which seems quite harsh).

So, what insights did I glean from analyzing the data?

  • Despite the fact that women were not allowed to officially participate in the Boston Marathon until 1972, more than 170,000 have finished with an average finishing time of 4:02:50.
  • More than 35% of the all winners are from the USA. However, the USA has had only one winner (Meb Keflezighi in 2014) since 1994. The last USA female winner was Lisa Larsen-Weidenbach in 1985. The fastest American in history was Yannick Benjamin who ran 2:04:24 in 2011, finishing third. Amazingly, this was 29 minutes faster than the time he recorded in 2010. It is hard to comprehend someone reducing their time by that margin in one year at such an elite level. It must have taken its toll, as Yannick never finished another Boston Marathon.
  • 2011 was by far the fastest race as six finishers finished under 2:05. I can’t imagine running a sub 2:05 marathon and not winning!
  • Despite an excellent distance running legacy, the United Kingdom has only one champion, Ron Hill, who won in 1970.
  • Since 2008, Kenyans and Ethiopians have captured all titles but one.
  • More than 16,000 runners had official finishing times in 2013, the year of the Boston Marathon bombing.

In summary, I think the most basic learning for me was the ability to take a PDF file and actually use the magic of Power BI to shape it into a usable database, suitable for analytics and reporting.

Here is a link to the report. Feel free to share it with your friends and colleagues, especially if they have finished the Boston Marathon! Maybe someday, my name will be in this report!

Let us help you drive data and find insights. Learn more about our Business Intelligence & Analytics capabilities.

Neal Levin is the Vice President of Business Intelligence & Analytics at Rightpoint. Follow Neal on Twitter and LinkedIn.