Monday, July 14, 2014

Find love and success with the help of SQL

This blog post took a while to get off the ground. When it comes to data analysis, the first step is finding valid data, and the next is putting it into a usable format. Afterwards, it's all a walk in the park. I chose to play around with SQL (Structured Query Language, an efficient programming language that is used with databases - organized storage places of large amounts of data) using the data set from the 2010 United States Census.

Why? Mostly because I'm sick and tired of seeing articles such as "Best 20 Cities to Live in Your 20s". So, I decided to do my own version with some basic database wizardry.

Step 1: Finding the data
This was fairly simple using a Google search, as the data is readily available from a government website. However, I could not find it in SQL Server, MySQL, or even PostgreSQL format, and had to settle for an Access database (see here). Also unfortunately, I had to download each state's files separately and load it onto the database. This was manual work, but I had help from a blog post I found with some great instructions.

Step 2: Getting the data into the tool you want
I chose SQL Server 2012. The import wizard made this pretty easy.

Step 3: Manipulate the data and extract useful information. See rest of blog post, but it's pretty much summarized in this picture


Step 4: Profit (one hopes)

Alright, let's get to it. Assume you are a bright-eyed, 22-year old male college graduate looking to relocate for your first real job. You have the following requirements on where you want to live:

- High concentration of Hispanic population, because you love nothing more than a good Cuban sandwich
- A high female-to-male ratio in your age group, because dating is important
- You absolutely MUST live in Texas, because everything is bigger there. Preferably, you wish to live in a city (for our purposes, population > 250,000)

Let's go ahead and crunch that into SQL Server:



As you can see, young women are the most plentiful in Forth Worth, but not by that much - merely a 21:20 ratio. You may have better luck trying another state. However, you don't have to go far to find a heavily Hispanic area, as only Plano is under 25% Hispanic of the major Texas cities. Note that we can only order the results by one criterion, and I chose to order by descending Female-to-Male ratio. A more novel approach would be assigning weights to each category (let's say you care about the opposite gender only 3 on a scale of 10 and about the ethnicity of your neighborhood about 6 on a scale of 10) and computing a total score that more accurately reflects your needs. Unfortunately, the US Census either does not ask or does not make readily available other important social markers which would really be of use. Some examples include median household income, job availability, air & water quality, or perhaps even happiness index.

Databases store any sort of useful data, and SQL helps us retrieve it. This can be anywhere from stock market history to advanced sports statistics.

If you ever need to make a complex life decision, crunching the numbers might not seem sexy, but you never know when it could be helpful

No comments:

Post a Comment

Happy New Year and Don't Blow Yourself Up: Exploring National Injury Data with Python

Every year, hundreds of thousands of Americans go to the Emergency Room for various injuries resulting from common consumer products. Do you...