Tuesday, April 28, 2015

Are You Smarter Than a Rhode Islander? Analyzing Jeopardy! Data

Jeopardy! is an American trivia game show that has been running for over 30 seasons now. When it comes to game shows, it is a true battle of the brains: contestants have few seconds to answer difficult questions spanning from Greek mythology to Shakespeare to sports and entertainment.

Jeopardy! promo containing its infamous host, Alex Trebek


So what makes a good Jeopardy! player? If you had to choose the perfect contestant in order to win, say, a bet with your friends about the outcome of the next show, who or what would you choose? To replace such thoughts with hard numbers, I dove into historical data on the past 31 seasons of Jeopardy!:

Step 1: Find and extract data
Luckily, a dedicated fan-run website keeps meticulous tabs each show, including the contestants (name, origin, and occupation), the scores at the end of each round, and even the questions themselves. Copying and pasting this would be a chore even an unpaid intern couldn't finish. To speedily extract the information, i used import.io , an incredibly intelligent tool which can scrape data requested by the user over multiple pages automatically

Step 2: Cleaning the data
The fans aren't getting paid to do this, so naturally they missed a show here and there, or cut off a contestant's name or hometown. What do you do with empty or null values? What about data from different show variations such Teen Jeopardy! or Jeopardy! Kids Edition? My point is, data isn't perfect, and we have to make some choices about how to prune the bonsai tree of possible outliers.

Step 3: Asking questions
There are numerous potential variables which may determine a good contestant, but our choices are limited to the data collected. Interesting variables to consider which are currently untracked may be age, sex, or income, for example. I chose to separate Jeopardy! contestants by two factors: state of residence and occupation.

Step 4: Presenting data visually
So, does a state determine whether or not you'll be a winner? Let's take a look at these graphics, which I created using Tableau:


Figure 1: Average Jeopardy! score by contestant's state

It's quite easy to see Utah emerge as the top performer among the show's contestants. That may not be a fair comparison, considering that a huge majority of those data points came from one person - Ken Jennings, a teacher from Salt Lake City who strung together 74 consecutive wins. However, we can see color-coded data on the average score from each state. Note that the overall average winnings for a contestant on Jeopardy! is about $9500.

Let's explore another two charts:

Figure 2: Winning ratio for contestants from a given state


In this graphic, deep red color indicates a low chance of winning. The hypothetical chance for each contestant is 1/3, or 0.333, and you'll see only slight variation among most states. However, Utah again comes up ahead with a 78% chance to win(!), while a contestant hailing from Rhode Island has only a 15% chance to win, less than half the theoretical possibility. Is something in the water over there?

For completeness, we'll show a heat map of where Jeopardy! contestants come from. Counting repeat appearances, there have been over 12,000 guests on the show. 
Figure 3: Geographical frequency distribution of contestants
Unsurprisingly, there are large clusters of contestants from major US metro areas such as Los Angeles, New York, and Washington DC. It should be noted that the number of contestants in Utah and Rhode Island are relatively small (100 and 47, respectively, compared to over 2000 in California and 1200 in New York), but still relevant enough to draw conclusions, depending on the degree of confidence we want. Interestingly, the Law of Large Numbers seems to hold here: states with larger contestant populations approach the theoretical mark mentioned before of a 0.333 chance of winning. 


Lastly, I've identified a few major categories which have comprised the occupations listed for Jeopardy! contestants over the years. The average score for each as well as the count of appearances is listed below:

Figure 4: Average contestant winnings by occupation group. The number of contestants is labeled to the right of the bar.


Students and engineers at the top, managers and moms at the bottom. Is this what you expected? What other metrics do you think would be good for comparing Jeopardy! performance?

In any case, I hope you've enjoyed reading, and I really, really hope you're smarter than a Rhode Islander.

--
Tech notes: for those interested, I have the data files available for sharing in both .mdf (SQL Server) and .xlsx (Excel)

Happy New Year and Don't Blow Yourself Up: Exploring National Injury Data with Python

Every year, hundreds of thousands of Americans go to the Emergency Room for various injuries resulting from common consumer products. Do you...