Open Data is a fairly new, and even foreign, concept in many establishments. Organizations may have an excel document/spreadsheet/file that they haven't found necessary to change since 2005. This works for them and there's nothing wrong with that but once you have external stakeholders who are expected to use this data it's absolutely crucial to ensure that this data is easy to work with.
At Develop Digitally, Open Data is literally our business. We know the value of Open Data and thus do our best to ensure that all of the data that we share adheres to Open Data standards.
Sadly most organisations, especially older ones, don't adhere to Open Data standards. With all the simple to use, well documented APIs, CSVs, XLS and other developer friendly formats we use on an almost daily basis it's very easy to take this for granted.
I'll happily admit, I'm definitely guilty of being spoiled by the proliferation of API's currently available. So when I was tasked with converting an old unwieldy XLS into a svelte CSV I was excited (and a little intimidated).
This will be the first of a two part blog looking at the steps I took (and in some cases wish I took) to wrangle the data. For this blog I'll be talking specifically about what you should do before you even write a single line of code.
Before we jump in we should probably define data wrangling right? Data Wrangling is the process of cleaning and unifying messy and complex data sets.
Okay, so you finally got that dataset, huh? You're excited. I get it. But slow down. Breathe. Don't open your text editor, not yet. Before we even get there here's what you should do before you write a single line of code. Make notes of whatever you observe in any of these steps. Once you start coding they’ll prove handy.
1 - Identify Column Headings and Data
Take a look at the data and identify what exactly is being shown paying special attention to what will become your column headings and what will fall under each heading. Sometimes this will be pretty straightforward but it's important to make sure you look over the dataset with a critical eye to ensure you don't miss anything. These column headings will serve as the foundations for parsing our Data.
2 - Look for Patterns
Now that you've identified your column headings and data look at the provided data for patterns. Humans use patterns to help make things more readable at a glance. These patterns can also help us as we start writing our parsing algorithm. For example, in the datasets I was given some patterns I observed were: the first column in a row is the month. If the starting column in a row is empty it indicates the end of a year and the next starting column will be the new year.
3 - Look for Anomalies
Chances are the data you got isn't perfect. Sure, you could fix this with a couple lines of code but you could also save yourself the time and energy and just do it the good old fashioned manual way. (Make sure you're not deleting anything crucial!)
Okay, we’re almost at the fun part! Look out for Part 2 where we’ll look at actually writing code to wrangle this into some useful data.