A typical Big Data architecture is able to handle structured data and unstructured data. Databases are great to handle large amounts of structured data, which can be queried very efficient and fast. But databases are not an appropriate store for unstructured data like text, xml, logs, etc. The challenge with unstructured data is, that every row could looks a little bit differently which makes it very hard to handle during ETL and for modeling. Hadoop is a much more flexible store, which allows us to store and query this kind of data more efficiently. So a very efficient and scalable Big Data architecture could look like this:
But do we really need every time a Hadoop cluster, when we just work on a subset of our data in order to find trends and pattern in it?
Which car Brand is more popular?
The answer is NO. Microsoft Excel in combination with the free Add-in Power Query is also capable to do that. So let’s try to find some trends in Twitter data about car brands. I collected some data of several days about different car brands like Audi, BWM, Mercedes, Porsche, etc. This data is stored in text files. The dataset can be downloaded here.
Choose the data source
First of all we start Excel (Excel 2013 in my case) and we build a data mash up on the Twitter data. I choose import data “From Folder”…
… and I fill in the folder path in the next window.
Power Query starts to analyze which content is stored in my source folder.
Design the data mashup
In order to structure the content of our text files we need to click the small arrows next to “Content”.
In the next screen we see our content as only one column with no further structure. Since we know the structure of our text files we can use Power Query do define it. We need to right-click on the column “Column1” and choose “Split Column” and then “By Delimiter…”. As delimiter we choose “Semicolon”.
Now we can see our column structure and just need to rename the columns to some speaking names like “Tweet”, “Tweeter”, “Date” and “Brand”.
The column “Brand” contains values like Audi A1, Audi A3, VW Golf, etc. In order to be able to analyze our data by brand and by sub-brand we need to split this column into two columns, the separator is “Space”. The final result is shown in the picture below.
The next column we need to adjust is “Date”. When we right-click on the “Date” column and choose “Change Type”, we can see that the data type is currently “Text”. In order to transform it into a date column we choose “Change Type” and than “Using Locale…”. We pick “Date, Time and Timezone” as data type and because the date in our text column is represented in a English format, we pick “English (United States)” as “Locale”.
Now the “Date” column is defined in a date format, but we don’t want to analyze the data on a timestamp-level, day-level is good enough. So we right-click on the “Date” column and choose “Transform” and than “Date”.
Note: If you want to have both information, date and time, create a copy of the “Date” column by choosing “Duplicate Column”. You can name then one column “Date” and one “Time” and transform it appropriately.
In the last step we just filter some error rows with “null” values in column values. This can be done very easily when you left-click on the small arrow next to the “Tweeter” column and unselect the “(null)” value.
Now we are done. We can review our transformation steps in the pane on the right side. I think this exercise was very easy to understand and to accomplished. Compared to ETL tools like SSIS this is much less effort and you don’t need further skills.
Analyze the data
All our data is in good shape, now we can start analyze it. In order to do this we load our data into the data model. If we choose this all data is stored in the xVelocity in-memory engine of Excel called Power Pivot. Power Pivot is a high performance columnstore in-memory engine that is already build-in Excel 2013. It’s works like SAP HANA on your desktop.
Now let’s create a pivot chart to visualize the data. What we see is that more people talk about brands like Audi and VW and not so much about Mercedes. The reason could be, that Mercedes cars are often bought by older people that probably don’t use new medias like Twitter or Facebook so much. So if you are a marketing guy at Mercedes, you should think about, if Twitter is really the right marketing channel for you.
Summary
As we have seen Excel and Power Query are fantastic and very powerful tools to do data Mashups also on unstructured data, like text, sensor data, twitter, xml, etc. You don’t need to setup a Hadoop cluster to get started. This perfectly fits into the Self Service BI strategy of Microsoft since Power User are able to do this very easily.
July 30, 2013 at 2:27 pm
I couldn’t resist commenting. Well written!
September 24, 2013 at 10:47 pm
Nice article, I will like to try it out. Where can i find the data.
September 25, 2013 at 5:30 pm
If you like I can mail it to you, just let me know. It’s not a big data set.
September 25, 2013 at 5:37 pm
Yes please do, i will like to try it out. my email is mike.amegashie at gmail.com
October 2, 2013 at 4:28 pm
Excellent article!
One quick question – what did you use to capture the Twitter feed?
Thanks
Paul
October 2, 2013 at 11:05 pm
Hi Paul,
we used SSIS and the Twitter API to extract the data into csv files first. Hope that helps.
Cheers,
Marcel
October 13, 2013 at 10:27 am
Great post! Please share yor raw data.
Thanks in advanced.
April 17, 2014 at 7:39 pm
You can download the data here: https://onedrive.live.com/redir?resid=EDD20A747F7B8FDE!39586&authkey=!AGm0hy4eXvkCuno&ithint=folder%2c.zip
April 4, 2014 at 12:03 pm
Nice tutorial! Could you please mail me the data set to blog@derevabuddy.com so that i can try it too? Thanks
April 17, 2014 at 7:39 pm
You can download the dataset here: https://onedrive.live.com/redir?resid=EDD20A747F7B8FDE!39586&authkey=!AGm0hy4eXvkCuno&ithint=folder%2c.zip
January 4, 2016 at 6:55 pm
Thanks for the nice article! May I know How many rows you have catches?
I am worry about the Excel might run out of memory.