Monday, July 14, 2014

Find love and success with the help of SQL

This blog post took a while to get off the ground. When it comes to data analysis, the first step is finding valid data, and the next is putting it into a usable format. Afterwards, it's all a walk in the park. I chose to play around with SQL (Structured Query Language, an efficient programming language that is used with databases - organized storage places of large amounts of data) using the data set from the 2010 United States Census.

Why? Mostly because I'm sick and tired of seeing articles such as "Best 20 Cities to Live in Your 20s". So, I decided to do my own version with some basic database wizardry.

Step 1: Finding the data
This was fairly simple using a Google search, as the data is readily available from a government website. However, I could not find it in SQL Server, MySQL, or even PostgreSQL format, and had to settle for an Access database (see here). Also unfortunately, I had to download each state's files separately and load it onto the database. This was manual work, but I had help from a blog post I found with some great instructions.

Step 2: Getting the data into the tool you want
I chose SQL Server 2012. The import wizard made this pretty easy.

Step 3: Manipulate the data and extract useful information. See rest of blog post, but it's pretty much summarized in this picture


Step 4: Profit (one hopes)

Alright, let's get to it. Assume you are a bright-eyed, 22-year old male college graduate looking to relocate for your first real job. You have the following requirements on where you want to live:

- High concentration of Hispanic population, because you love nothing more than a good Cuban sandwich
- A high female-to-male ratio in your age group, because dating is important
- You absolutely MUST live in Texas, because everything is bigger there. Preferably, you wish to live in a city (for our purposes, population > 250,000)

Let's go ahead and crunch that into SQL Server:



As you can see, young women are the most plentiful in Forth Worth, but not by that much - merely a 21:20 ratio. You may have better luck trying another state. However, you don't have to go far to find a heavily Hispanic area, as only Plano is under 25% Hispanic of the major Texas cities. Note that we can only order the results by one criterion, and I chose to order by descending Female-to-Male ratio. A more novel approach would be assigning weights to each category (let's say you care about the opposite gender only 3 on a scale of 10 and about the ethnicity of your neighborhood about 6 on a scale of 10) and computing a total score that more accurately reflects your needs. Unfortunately, the US Census either does not ask or does not make readily available other important social markers which would really be of use. Some examples include median household income, job availability, air & water quality, or perhaps even happiness index.

Databases store any sort of useful data, and SQL helps us retrieve it. This can be anywhere from stock market history to advanced sports statistics.

If you ever need to make a complex life decision, crunching the numbers might not seem sexy, but you never know when it could be helpful

Tuesday, July 1, 2014

An introduction to Linux - what and why?

If you're like me, you may have done some projects on an Amazon's cloud services, Amazon EC2. Amazon's web services are increasing rapidly in popularity, mostly because of the large availability of cheap hosting and computer workstations they offer. Most of these servers that can be rented run on Linux. So that begs the question - what is Linux? What is Unix? Heck, is it Unix or UNIX? Did someone mention Ubuntu? Okay, let's dive in.

UNIX is an operating system. An operating system, put simply, is a software that manages how the computer hardware is managed and interacts with other software. This includes scheduling tasks, resource management, and security features. A common example of this is Microsoft's Windows 8 OS for computers or the iOS system for mobile phones by Apple. While UNIX is almost non-existent in the consumer realm for personal computers, it has many features and applications that have made it widely-used in business computing, especially with servers and mainframes. Later operating systems that were based on UNIX include, among others, Linux.

Linux is open source, so it is free and works on a wide variety of systems, and users around the world can share and modify code for their own purposes, creating a huge community of developers. Compared to the dominant Windows platforms, some claim Linux also has superior performance speed and less proliferation of viruses and other threats. Different flavors of Linux include Ubuntu, Debian, and Red Hat, to name a few... but Ubuntu is the most popular.

Many times you'll be working in the Linux terminal. Refer to the picture below -
Anyone born after 1990 will see this for the first time and think: "shit, now what?"

Linux OS does have a GUI you can work out of, but the terminal remains popular. It works like the Windows command line, but with a much more exhaustive command vocabulary. This is also called "the shell". For my fellow struggling young developers, if this is a lot of information to take in all at once, don't worry. I was so confused by this at first that I thought Unix was the command line language for Linux, and that "bash" was something you did to the keyboard when too many errors come up.

I will explore more of this in a future post, including some popular terminal commands. If you have any comments let me know!

PS - shoutout to a good friend who helped explain some of this all to me earlier this week

Happy New Year and Don't Blow Yourself Up: Exploring National Injury Data with Python

Every year, hundreds of thousands of Americans go to the Emergency Room for various injuries resulting from common consumer products. Do you...