Everyday Data with Yoav

Wednesday, June 7, 2017

It's Always Sunny in Miami, and Not So Much in Boston

My old roommate and I were involved in a vicious feud. You see, he liked to open the windows when it was warm out; however, he always neglected to go back and close them when it inevitably got cold again. This resulted in me shivering for a few hours before I realize that, while it was 75 degrees only a day ago, it's now 55 degrees and I should really put on some pants.

The essence of the problem can be described in visual form:

This is a common problem for anyone who's lived in New England, where any Uber trip between April and June is likely to involve complaints about the weather.

Just how common is this problem? I set out to find just how fickle the daily temperature is among the top 25 US cities. In choosing these cities, I used a combination of metrics: population, GDP, and net migration. The analysis was done using the statistical software R, with weather records from Weather Underground. (For a more detailed, technical view of how I worked with the data, click here)

I first set out to find the variance of the weather for these cities as a measure of volatility. I used daily high temperature data for all 366 days of 2016 (a leap year). I ranked the cities by their temperature variance, in descending order, below:

But what does this really mean? Variance is defined as the squared sum of deviation from the mean. In other words, a high variance in this context would mean that a city has many days in which the temperature is much different than the average temperature for the entire year. This could mean many days of ping-pong hot and cold weather, or simply a very pronounced changing of the seasons. Let's look at a temperature graph of a few selected cities. I've ordered them from high to low variance:

We see that cities with high variance, such as Minneapolis and Denver, have temperatures that reach a wide range throughout the year, and while there's a definite pattern (cold winter, warm summer), there are still sharp week-to-week and sometimes day-to-day changes - see Boston's dip to a high of about 60 degrees in July and then a jump to the 90s a week after. On the flip side, cities with extremely low variance such as Miami and San Francisco have much flatter, lower-amplitude graphs. Amazingly, Miami's daily high temperature hovers between 72 and 92 degrees Fahrenheit a full 90% of the year!

This doesn't fully answer my original question - in which cities are you most likely to experience the frustration of making a completely new daily evaluation of your window openings, how many layers to wear, and whether or not to finally put away your winter clothes for good? To answer this, I've calculated the temperature swings of each day over the previous day. I've chosen a difference of +/- 10 degrees as an arbitrary metric for "pain-in-the-butt-weather". Let's see the results:

So, almost every third day in Boston seems to be a jump in temperature. However, in Miami, you can paint a pretty accurate weather forecast of tomorrow if you've been outside today. I guess now we know why Pitbull is so happy.

PS -I also calculated the number of days in a year with a 20+ degree temperature swings. Denver leads with 21, but almost all cities have 10 or less of these days (Miami and San Francisco have 0). So, if you wake up one morning and it's 20 degrees colder or hotter than the day before, do note that this is pretty rare, and therefore definitely merits complaining (I'll be here to listen!)

Tuesday, April 28, 2015

Are You Smarter Than a Rhode Islander? Analyzing Jeopardy! Data

Jeopardy! is an American trivia game show that has been running for over 30 seasons now. When it comes to game shows, it is a true battle of the brains: contestants have few seconds to answer difficult questions spanning from Greek mythology to Shakespeare to sports and entertainment.

Jeopardy! promo containing its infamous host, Alex Trebek

So what makes a good Jeopardy! player? If you had to choose the perfect contestant in order to win, say, a bet with your friends about the outcome of the next show, who or what would you choose? To replace such thoughts with hard numbers, I dove into historical data on the past 31 seasons of Jeopardy!:

Step 1: Find and extract data
Luckily, a dedicated fan-run website keeps meticulous tabs each show, including the contestants (name, origin, and occupation), the scores at the end of each round, and even the questions themselves. Copying and pasting this would be a chore even an unpaid intern couldn't finish. To speedily extract the information, i used import.io , an incredibly intelligent tool which can scrape data requested by the user over multiple pages automatically

Step 2: Cleaning the data
The fans aren't getting paid to do this, so naturally they missed a show here and there, or cut off a contestant's name or hometown. What do you do with empty or null values? What about data from different show variations such Teen Jeopardy! or Jeopardy! Kids Edition? My point is, data isn't perfect, and we have to make some choices about how to prune the bonsai tree of possible outliers.

Step 3: Asking questions
There are numerous potential variables which may determine a good contestant, but our choices are limited to the data collected. Interesting variables to consider which are currently untracked may be age, sex, or income, for example. I chose to separate Jeopardy! contestants by two factors: state of residence and occupation.

Step 4: Presenting data visually
So, does a state determine whether or not you'll be a winner? Let's take a look at these graphics, which I created using Tableau:

Figure 1: Average Jeopardy! score by contestant's state

It's quite easy to see Utah emerge as the top performer among the show's contestants. That may not be a fair comparison, considering that a huge majority of those data points came from one person - Ken Jennings, a teacher from Salt Lake City who strung together 74 consecutive wins. However, we can see color-coded data on the average score from each state. Note that the overall average winnings for a contestant on Jeopardy! is about $9500.

Let's explore another two charts:


Figure 2: Winning ratio for contestants from a given state

In this graphic, deep red color indicates a low chance of winning. The hypothetical chance for each contestant is 1/3, or 0.333, and you'll see only slight variation among most states. However, Utah again comes up ahead with a 78% chance to win(!), while a contestant hailing from Rhode Island has only a 15% chance to win, less than half the theoretical possibility. Is something in the water over there?

For completeness, we'll show a heat map of where Jeopardy! contestants come from. Counting repeat appearances, there have been over 12,000 guests on the show.

Figure 3: Geographical frequency distribution of contestants

Unsurprisingly, there are large clusters of contestants from major US metro areas such as Los Angeles, New York, and Washington DC. It should be noted that the number of contestants in Utah and Rhode Island are relatively small (100 and 47, respectively, compared to over 2000 in California and 1200 in New York), but still relevant enough to draw conclusions, depending on the degree of confidence we want. Interestingly, the Law of Large Numbers seems to hold here: states with larger contestant populations approach the theoretical mark mentioned before of a 0.333 chance of winning.

Lastly, I've identified a few major categories which have comprised the occupations listed for Jeopardy! contestants over the years. The average score for each as well as the count of appearances is listed below:


Figure 4: Average contestant winnings by occupation group. The number of contestants is labeled to the right of the bar.

Students and engineers at the top, managers and moms at the bottom. Is this what you expected? What other metrics do you think would be good for comparing Jeopardy! performance?

In any case, I hope you've enjoyed reading, and I really, really hope you're smarter than a Rhode Islander.

--
Tech notes: for those interested, I have the data files available for sharing in both .mdf (SQL Server) and .xlsx (Excel)

Tuesday, September 2, 2014

Brute-force: introduction to hacking

In late August of 2014, a large set of celebrity information was hacked, with the most "newsworthy" material being nude or explicit photos. Per this article, the security hole existed in Apple's iCloud (specifically, the Find My iPhone feature) which allowed potential hackers to use "brute-force" attacks to gain entry to user accounts.

So, what is brute-force? Stated simply, if you are trying to open a numerical combination lock with 4 digits (0-9 making up 10 possibilities) and you don't know the code, you can try any combination until it opens: 1111, 2918, 3345, etc
The number of possibilities, by using the concept of permutations, is 10*10*10*10 = 10^4 = 10,000
Meaning that given enough time and finger strength, you WILL break the code in 10,000 tries or less (5000 on average).
Brute-force hacking is the most simple form of hacking there is, and usually takes the longest. Other methodologies may or may not be detailed in the future.

10,000 tries is quite a lot - which is why bike thieves usually use a hammer instead

If this code were a digital password, one could use a computer program or internet script to automatically input the 10,000 different combinations to gain access to the protected content. A computer, being much more powerful and fast than the average typing human, could knock this task out in a few hours (a maximum of 10,000 seconds or about 2 hrs and 45 mins), if we assume 1 second per try. However, per Wikipedia, good "cracking" programs can submit attempted passwords at a rate of 100+ million per second.

Consider most websites which require you to have a password of a minimum of 8 characters, using lowercase (26), uppercase (26), digits (10) and special characters such as % ^ & @ * etc (let's say 15 - it can vary per website). Note that this is assuming the English/Latin language alphabet base. The amount of password combinations for a password of exactly 8 characters is thus:

(26+26+10+15)^8 = 1.2 x 10^15 combinations. Dividing by 100 million, or 1x10^8 =
1.2 x 10^7 seconds to break the combination = 143 days. This number further increases if you have the option of using 9, 10, 11 etc characters. Likewise, if you limit yourself to only 8 lowercase letters and no digits or special symbols, your password will take 35 minutes to crack, given that the program attempts only lowercase letters first. This underscores the need for a "strong password".

So does this mean every password can be hacked given enough time? Well, yes. But, like your normal phone screen lock, trying too many wrong passwords results in the user being locked out from trying again - an important security feature. Unfortunately, this feature was neglected in just ONE Apple application which required a sign-in. So, given a celebrity's AppleID (usernames and email address are not exactly private most of the time), the hackers went to work.

So, what have we learned here?
1. Buy Android
2. Use strong passwords
3. Read my blog

(See - Apple's rebuttal)

Sunday, August 10, 2014

Blood, sweat, tears, and cryptography

AES (Advanced Encryption Standard) is quite a bit more powerful than the methods described in my previous blog post. It is widely used among private companies and governments to encrypt text, passwords, and file contents. There are several other encryption standards, which I won't delve into. This website provides a free tool to code and decode messages using AES, and so do many others.

On a personal note, those close to me will know that the past two years have brought some of the biggest challenges in my life (don't mean to be dramatic, I swear!), and also some fun memorable moments To all who have been there in some way or another - I would like to thank you in the form of an encrypted message. All you have to do is use the links with your initials to find my individualized salute to you. Oh, and you all have to ask for your individualized link & passwords

JH
FP
PN
TS
JV
LH
NT
JG
ISO
VT

Cryptography, part one

In an alternate universe, after the British used careful analysis of social networks (great short article, do read) to narrow in on Paul Revere, the colonial hero did not have much time to deliver his famous message. Knowing he may be caught at any moment, he decided to encrypt his message. That is, to turn words into code

Paul Revere - silversmith, patriot, and amateur code boy

His first attempt was rather weak - turning all letters to corresponding numbers.

A	B	C	D	E	F	G	H	I	J	K	L	M	N	O	P	Q	R	S	T	U	V	W	X	Y	Z	*
1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27

Figure 1 His message now looked like this

T	H	E	*	B	R	I	T	I	S	H	*	A	R	E	*	C	O	M	I	N	G
20	8	5	27	2	18	9	20	9	19	8	27	1	18	5	27	3	15	13	9	14	7

Figure 2

This "code" could easily be intercepted by anyone who read the numbers. So he tried again, with a slightly more complex replacement - he split the alphabet into 3, and gave each letter a number and a symbol, going backwards:

Figure 3

The message could now be written and passed on more discreetly, but it was still not secure. If the police caught Revere and intercepted his letter, it would be only a short matter of time until they figure out the one-to-one correspondence between letter and code. How could Revere ensure a more random coding and translation? Since he did not have access to modern computing, the answer was matrices.

Matrices can be used as a mathematical basis for cryptology. Using a numerical message, such as in Figure 2, we can use matrix multiplication to "jumble up" numbers to a substantial degree of randomness, making coded messages more difficult to decipher. Besides the message itself, the process requires an encoding matrix, which must be square in shape and invertible (if you are rusty on matrix algebra, don't worry about this part). The larger the square encoding matrix, the more secure the encryption. We will use the following 3x3 matrix below, with message matrix split into columns of 3 for multiplication purposes.

7 2 1

0 3 -1

-3 4 -2

With encoding matrix above, multiply by message matrix below. The message matrix is the original numerical message, [20,8,5,27,2,18...] transposed into columns of 3

20 27 9 19 1 27 13 7

8 2 20 8 18 3 19 27

5 18 9 27 5 15 14 27

The following matrix results:

161 211 112 176 48 210 143 130

19 -12 51 -3 49 -6 43 54

-38 -109 35 -79 59 -99 9 33

Paul Revere can now write these numbers down on paper, and the code won't be as obvious. For one thing, we are not using 1-27 anymore, and there is no one-to-one correspondence for letters. However, to solve the code, Revere's compatriots will have the key, or the original 3x3 encoding matrix, and multiply its inverse by Revere's transformed new matrix. I won't bother showing these steps - the result, as we've said before, will alert them just in time:

T	H	E	*	B	R	I	T	I	S	H	*	A	R	E	*	C	O	M	I	N	G
20	8	5	27	2	18	9	20	9	19	8	27	1	18	5	27	3	15	13	9	14	7

Monday, July 14, 2014

Find love and success with the help of SQL

This blog post took a while to get off the ground. When it comes to data analysis, the first step is finding valid data, and the next is putting it into a usable format. Afterwards, it's all a walk in the park. I chose to play around with SQL (Structured Query Language, an efficient programming language that is used with databases - organized storage places of large amounts of data) using the data set from the 2010 United States Census.

Why? Mostly because I'm sick and tired of seeing articles such as "Best 20 Cities to Live in Your 20s". So, I decided to do my own version with some basic database wizardry.

Step 1: Finding the data
This was fairly simple using a Google search, as the data is readily available from a government website. However, I could not find it in SQL Server, MySQL, or even PostgreSQL format, and had to settle for an Access database (see here). Also unfortunately, I had to download each state's files separately and load it onto the database. This was manual work, but I had help from a blog post I found with some great instructions.

Step 2: Getting the data into the tool you want
I chose SQL Server 2012. The import wizard made this pretty easy.

Step 3: Manipulate the data and extract useful information. See rest of blog post, but it's pretty much summarized in this picture

Step 4: Profit (one hopes)

Alright, let's get to it. Assume you are a bright-eyed, 22-year old male college graduate looking to relocate for your first real job. You have the following requirements on where you want to live:

- High concentration of Hispanic population, because you love nothing more than a good Cuban sandwich
- A high female-to-male ratio in your age group, because dating is important
- You absolutely MUST live in Texas, because everything is bigger there. Preferably, you wish to live in a city (for our purposes, population > 250,000)

Let's go ahead and crunch that into SQL Server:

As you can see, young women are the most plentiful in Forth Worth, but not by that much - merely a 21:20 ratio. You may have better luck trying another state. However, you don't have to go far to find a heavily Hispanic area, as only Plano is under 25% Hispanic of the major Texas cities. Note that we can only order the results by one criterion, and I chose to order by descending Female-to-Male ratio. A more novel approach would be assigning weights to each category (let's say you care about the opposite gender only 3 on a scale of 10 and about the ethnicity of your neighborhood about 6 on a scale of 10) and computing a total score that more accurately reflects your needs. Unfortunately, the US Census either does not ask or does not make readily available other important social markers which would really be of use. Some examples include median household income, job availability, air & water quality, or perhaps even happiness index.

Databases store any sort of useful data, and SQL helps us retrieve it. This can be anywhere from stock market history to advanced sports statistics.

If you ever need to make a complex life decision, crunching the numbers might not seem sexy, but you never know when it could be helpful

Tuesday, July 1, 2014

An introduction to Linux - what and why?

If you're like me, you may have done some projects on an Amazon's cloud services, Amazon EC2. Amazon's web services are increasing rapidly in popularity, mostly because of the large availability of cheap hosting and computer workstations they offer. Most of these servers that can be rented run on Linux. So that begs the question - what is Linux? What is Unix? Heck, is it Unix or UNIX? Did someone mention Ubuntu? Okay, let's dive in.

UNIX is an operating system. An operating system, put simply, is a software that manages how the computer hardware is managed and interacts with other software. This includes scheduling tasks, resource management, and security features. A common example of this is Microsoft's Windows 8 OS for computers or the iOS system for mobile phones by Apple. While UNIX is almost non-existent in the consumer realm for personal computers, it has many features and applications that have made it widely-used in business computing, especially with servers and mainframes. Later operating systems that were based on UNIX include, among others, Linux.

Linux is open source, so it is free and works on a wide variety of systems, and users around the world can share and modify code for their own purposes, creating a huge community of developers. Compared to the dominant Windows platforms, some claim Linux also has superior performance speed and less proliferation of viruses and other threats. Different flavors of Linux include Ubuntu, Debian, and Red Hat, to name a few... but Ubuntu is the most popular.

Many times you'll be working in the Linux terminal. Refer to the picture below -

Anyone born after 1990 will see this for the first time and think: "shit, now what?"

Linux OS does have a GUI you can work out of, but the terminal remains popular. It works like the Windows command line, but with a much more exhaustive command vocabulary. This is also called "the shell". For my fellow struggling young developers, if this is a lot of information to take in all at once, don't worry. I was so confused by this at first that I thought Unix was the command line language for Linux, and that "bash" was something you did to the keyboard when too many errors come up.

I will explore more of this in a future post, including some popular terminal commands. If you have any comments let me know!

PS - shoutout to a good friend who helped explain some of this all to me earlier this week