A typical Big Data architecture is able to handle structured data and unstructured data. Databases are great to handle large amounts of structured data, which can be queried very efficient and fast. But databases are not an appropriate store for unstructured data like text, xml, logs, etc. The challenge with unstructured data is, that every row could looks a little bit differently which makes it very hard to handle during ETL and for modeling. Hadoop is a much more flexible store, which allows us to store and query this kind of data more efficiently. So a very efficient and scalable Big Data architecture could look like this:
But do we really need every time a Hadoop cluster, when we just work on a subset of our data in order to find trends and pattern in it?
Which car Brand is more popular?
The answer is NO. Microsoft Excel in combination with the free Add-in Power Query is also capable to do that. So let’s try to find some trends in Twitter data about car brands. I collected some data of several days about different car brands like Audi, BWM, Mercedes, Porsche, etc. This data is stored in text files. The dataset can be downloaded here.
Choose the data source
First of all we start Excel (Excel 2013 in my case) and we build a data mash up on the Twitter data. I choose import data “From Folder”…
… and I fill in the folder path in the next window.
Power Query starts to analyze which content is stored in my source folder.
Design the data mashup
In order to structure the content of our text files we need to click the small arrows next to “Content”.
In the next screen we see our content as only one column with no further structure. Since we know the structure of our text files we can use Power Query do define it. We need to right-click on the column “Column1” and choose “Split Column” and then “By Delimiter…”. As delimiter we choose “Semicolon”.
Now we can see our column structure and just need to rename the columns to some speaking names like “Tweet”, “Tweeter”, “Date” and “Brand”.
The column “Brand” contains values like Audi A1, Audi A3, VW Golf, etc. In order to be able to analyze our data by brand and by sub-brand we need to split this column into two columns, the separator is “Space”. The final result is shown in the picture below.
The next column we need to adjust is “Date”. When we right-click on the “Date” column and choose “Change Type”, we can see that the data type is currently “Text”. In order to transform it into a date column we choose “Change Type” and than “Using Locale…”. We pick “Date, Time and Timezone” as data type and because the date in our text column is represented in a English format, we pick “English (United States)” as “Locale”.
Now the “Date” column is defined in a date format, but we don’t want to analyze the data on a timestamp-level, day-level is good enough. So we right-click on the “Date” column and choose “Transform” and than “Date”.
Note: If you want to have both information, date and time, create a copy of the “Date” column by choosing “Duplicate Column”. You can name then one column “Date” and one “Time” and transform it appropriately.
In the last step we just filter some error rows with “null” values in column values. This can be done very easily when you left-click on the small arrow next to the “Tweeter” column and unselect the “(null)” value.
Now we are done. We can review our transformation steps in the pane on the right side. I think this exercise was very easy to understand and to accomplished. Compared to ETL tools like SSIS this is much less effort and you don’t need further skills.
Analyze the data
All our data is in good shape, now we can start analyze it. In order to do this we load our data into the data model. If we choose this all data is stored in the xVelocity in-memory engine of Excel called Power Pivot. Power Pivot is a high performance columnstore in-memory engine that is already build-in Excel 2013. It’s works like SAP HANA on your desktop.
Now let’s create a pivot chart to visualize the data. What we see is that more people talk about brands like Audi and VW and not so much about Mercedes. The reason could be, that Mercedes cars are often bought by older people that probably don’t use new medias like Twitter or Facebook so much. So if you are a marketing guy at Mercedes, you should think about, if Twitter is really the right marketing channel for you.
As we have seen Excel and Power Query are fantastic and very powerful tools to do data Mashups also on unstructured data, like text, sensor data, twitter, xml, etc. You don’t need to setup a Hadoop cluster to get started. This perfectly fits into the Self Service BI strategy of Microsoft since Power User are able to do this very easily.