In my post about what makes a Data Scientist I talked about the role of a Data Scientist and the kind of work that these people do. So next I want to talk about the Data Science laboratory and how a typical workplace could look like.
I started to create a Virtual Machine where I can play and experiment with different tools. So how does my laboratory looks like?
- Virtual Machine with Windows Server 2012 as my OS
- SQL Server 2012 for relational data sets
- Hadoop cluster based on HDInsight for unstructured data set (logs, twitter, text, sensor data, etc.). It can be downloaded as HDInsight Server or used as HDInsight Azure Service in the cloud.
- Excel 2013 with Power Query
- R environment and R Studio for advanced analytics
Is this platform able to handle large data sets and complex analysis? Of course, SQL Server is a very high scalable database, the Hadoop cluster in Windows Azure can be extended up to 32 nodes, Excel ships with a column store in-memory engine called Power Pivot, that can handle million of records in a highly compressed format and R can be scaled by solutions providers like Revolution Analytics.
So now we have a good laboratory for more advanced analytics. Let’s see what we can do with it in my next post.