PASS Business Analytics Conference 2015 – Day 1

Pic1This is my first time to visit the PASS Business Analytics Conference and so I want to take the chance to give a wrap up of my impressions of the first day. The conference takes place in Santa Clara and I personally think that California as a great place to be. The sunny weather makes it very easy to combine interesting and challenging sessions and an open mood of people for networking and get to know each other. I really like the very friendly and open nature of the people here.

Maybe I give you some more details on the conference itself. The clear focus of that conference is Data Analytics. It’s a very good mixture of technical, business and strategic sessions. The audience is very different to the typical and more technical PASS Conference in Seattle. Altogether there are 600 participants out of 24 countries and more than 60 hours of sessions with great speakers. Most of them I never heard before, but they really give you some new ideas or food for thought that you never be thought of before.



The keynote started with the PASS president Thomas LaRock, followed by a great talk of Carlo Ratti, the Director of the MIT Senseable City Lab. The MIT Senseable City Lab is focused on who sensors are changing cities into real-time cities with lot’s of new opportunities. Today cities are already covered with digital sensors, so that people can react on it. There is an estimation that by 2020  about 50 billion connections need to handled in the world. Most of these connections will be sensors and devices than mobile phones. He showed several examples how cities are covered with digital sensors, react on it and become more smart cities these days. There are several research projects that the MIT is doing in that area like:

Mobility and Smart Cities

The Hub Cab is a project started on analyze Taxi Peaks in New York and try to bring in shared mobility. Beside the MIT also Audi and GE are part of that project, which makes me, as a German, also proud of that we are part of that innovation.  He presented some interesting numbers that although safe driving cars are on the way only 20% of the cars are need today for our mobility. An average car is only used 5% of the time so there is a huge opportunity to make mobility more efficient, especially in large city, also in order enhance the quality of live. So sharing is much more intelligent and what if we take the self-driving experience and the the sharing concepts together and even think about how traffic is organized. Do we still then need traffic lights? There are some nice concepts and researches on self-driving intersection that organizes traffic much more intelligent and efficient as traffic lights did for the last 100 years. This is a big opportunity for all of us.

The Copenhagen wheel

Another interesting and fascinating thing that came out of a MIT research project is the Copenhagen Wheel. This is a complete new way of bicycle mobility and a great example how sensors and technology can help us to innovate new experiences. And the best way to understand it, is to watch the video:

Follow the trash

One last interesting project to me was the “moma-followup”. The idea is to follow obsolete electronics, that is normally just thrown away, across the world for recycle or reuse. In order to do this the MIT developed two different types of self-reporting technologies that are coupled with the obsolete electronics and than see what happened:


For me the first day really gave me some new ideas what is going on in Data Analytics. I think it is an amazing time to see, what can be achieved with technologies in these days and this is not always just to earn more money. It is interesting to see how fast the technology adoption happens and what role these great Universities like the MIT, Boston, Harvard, Stanford and many others play in that area. So a really great day one and I will follow up on day 2.


Big Data = Big Fun – The worldwide first Microsoft carnival session

imageI think today was a very historic day. Ralph, de Bräd,  Kemperdick (Microsoft Germany), organized the worldwide first carnival session at Microsoft in Cologne. Together with some other Jecks (Hilmar, Christos, Tillman and me) we presented some funny sessions about Big Data and we had a lot of fun together with our guests and visitors. Thanks to Ralph and Microsoft for organizing this and to all the guests who joined. We all agreed that this year was only the small start of a tradition that will hopefully fill pavilions and arenas in the next years. And for the rest of you who couldn’t join here are some impressions.

In this sense…Kölle Alaaf!







A call to all Data Engineers and BI workers

In the last two years I had the chance to get my hands on some very exciting Data Analytics projects and I wanted to take the chance to recap and to reach out to Data Engineers and BI Consultants. Why?

In the area of IT we see lot’s of trends coming up every year. Some are going, some are staying, but sometimes we see also a paradigm shift. These shifts have a tremendous impact on the way we worked before and how we will work in the future, for example the rise of the Internet, the area of Business Intelligence and Data Warehouse and the whole E-Commerce shift. And now we can see new shift coming up:

The era of Big Data

The difficult thing with a paradigm shift is we need to rethink certain ideas, the way we did business before and we will do in the future. And if we don’t do it, others will do it and we will not be as successful as we have been in the past.  So let me get to that story in more detail.

Big Data vs. Data Analytics


Big Data is now out there for a while and people already understand that storing large amount of data is not good enough. There was a big hype about it and we are now at a point that the words “Big Data” already got a negative touch. It’s very exciting to see the big progress in new technologies like Hadoop, where customers can store nearly all their data. But in 90% of the cases it is totally useless to throw all your data into a data analysis problem. Just talking about technologies does not meet the needs of users and customers anymore.

I also don’t like to talk about Big Data, because it’s misleading, instead I’d like to talk about Data Analytics and that’s what it’s all about. So the focus is clearly on analyzing data and creating value out of it. This is also not big news, but we were told that only a specific type of people with a specific knowledge can do this: Data Scientists.

These guys are currently seen as heroes in the analytics market and everybody is looking for someone with little or no luck. So here’s my point: Analyzing data is not a new venture, in the area of Business Intelligence and Data Mining people did this all the time for years. But what has changed and where does the big shift happens?

We can clearly say that we can’t get around Data Analytics anymore. If you talk with customers and you just want to talk about Data Warehouses and BI you are missing half of the discussion. All the companies I talk to clearly think about Big Data or Data Analytics and how they can combine it with their Data Warehouse and BI solutions. But technology has become secondary in these discussions. Don’t get me wrong, Data Warehouses are still necessary and in use but the focus clearly has changed. We see new types of data that are interesting to analyze like streaming, social, logs, sensor data and there are also new ways to analyze data like pattern recognition, predictions, clustering, recommendations, etc. So the operational data that is typically stored in Data Warehouses is still necessary, but it has to be combined with the other types of data, I mentioned before. But in today’s discussions with customer, it’s all about use cases and solutions.

And in order to close the loop let me quickly come back to the Data Scientists. I agree that we need statistical and mathematical skills to solve problems like customer segmentation, next best offers and recommendations, predictions, data correlations etc. but we need much more skills to provide whole solutions to customers, so a good team mix is much more important.

New skills and approaches

With the era of Big Data and the new analytical possibilities we can also see new solution approaches. Data Analytic projects are much more iterative and evolutionary because research on your data is a big part of your work. Companies discover new use cases and sometimes they change their whole business model, because they find competitive advantages or new possibilities for revenue.

google acquisition of nest, thermostat, ProtectA good example for this are Smart Homes. We can see that the digitalization is now arriving at our homes. In the near future devices in our home are fully connected with each other and share data between each other. When I set my  weak up alarm for the next morning, an app will tell this to my heating system. My heating system then knows when I want to take a shower and need warm water or when I want to drive with my electric car.

Energy providers are highly interested in this information and in my daily behavior of energy consumption. Why?

Because when they better understand my energy consumption, they can better predict their energy sales and also how much energy is consumed at a certain time. And when they better predict the energy consumption of their customers, they can better handle their purchase of power energy at the energy exchange market.

The challenge with these new business models, and there are plenty of others, is that they are new. And for energy companies that have offered power supply in a very classical way for decades, this is a big change. So that’s why also technology providers like Google enter the market. They know how to handle the data, how to analyze it and how to use it for business models to provide additional services. Should you not accept these changes in business models, even when they take some time before they settle on the market, you wake up, when it is too late. Because applying changes need some time and companies need the experience in order to apply these changes step by step

And I think this is the most important learning in the last years. You can stick with you old business models if they work, but if an industry is changing you need to adapt. And Data Analytics happens in several industries and the most successful companies are those, that start small, get their experiences very quickly and are able to adopt the changes. There are very good examples in Germany like the Otto Group in Retail, Yello Strom in the Energy sector and also some new Startups.

As I mentioned before Data Analytic projects need to be very iterative in their approach. A lot of projects start with an idea or a use case or a feeling and we need to quickly understand if there is a business case behind it or not. In order to support those projects we need a different approach, which I call “Laboratory and Factory”.


The Laboratory

The Laboratory is for experiments. Here we can test all our use cases, ideas or just discover patterns in data. The important thing is, it must be cheap. We don’t want to spend much money on experiments, if we don’t know the business case behind them yet. The work can be compared to „panning for gold“. There is plenty of gold to be found, but for every gold nugget a multiple of sand needs to be panned. So from a technology perspective I would use whatever fits to the use case. In the laboratory we should be more flexible on different technologies like SQL, Hadoop, Storm, Pig, R or Python, D3 or other tools which help solve our problems.

From a data perspective we can work on a subset of the data. What we probably want to avoid is data sampling, which is often times very time consuming, so we prefer real data first. So the main goal of the laboratory is to…


The Factory

After we proved our business cases in the laboratory we can than apply our data analytic models to the factory. The factory means that we operate these models on a daily base and couple them to our business processes. Here we typically use the existing enterprise platforms and we often see mixed solutions of classical data analytics platforms combined with Open Source technologies. A new requirement in the last years is that we want to apply our analytical models to the whole data history. Technologies and servers are now capable to make this possible. So our factory gives us the integration of our analytical business models to our daily business at enterprise scale, on the whole data set and probably enriched with external data.

New technologies


Some month ago I had the chance to visit the European Hadoop conference in Amsterdam and it was a great chance to get another view on Big Data and Data Analytics from an Open Source perspective. It has become very obvious that Hadoop and NoSQL based technologies drive the Big Data and Analytics market. There is a whole industry behind companies like Cloudera, Hortonworks, MapR and others that push new innovations. The Stinger initiative for example was a team project of around 45 companies with 140 developers that improved the performance of the Hive technology by a factor 100 within 1 year. Imagine the power of innovation that these companies like Google, Yahoo, Facebook, Hortonworks and also Microsoft can bring to these technologies when they combine the skills. Clearly when you come from a traditional BI solution like SQL Server, Teradata, Oracle or SAP you would say that there are still some gaps in the usability and ease of use. But on the other side these technologies are built just for Big Data solutions. They offer fantastic capabilities and some of them are great technologies, also if it is sometimes harder to work with them.

And  when you see that all big players in the market like IBM, Oracle, Teradata, Microsoft and SAP have partnerships with Hadoop platform providers, then it is very clear, that there is no way around these technologies anymore. It is just a question how to combine them best. Microsoft for example has a nice offering with the Analytical platform system (APS), which is a scale-out box where you can mix Hadoop and SQL Server in a highly parallel and very high performing way.


I personally believe in the new paradigm shift of Big Data and Data Analytics. I already had  the chance to enjoy several projects in that area and I’m very happy to start on new ones in the next weeks. But that is also the reason why I wanted to write this article. In order to stay competitive we need to accept changes in the market and start to deal with them. What does that mean?

We have to keep learning new technologies, different approaches, new business models, etc. Old traditional BI projects will be also done in the future but the really interesting and challenging projects will all deal with Data Analytics. There are lots of really fascinating use cases, which due to non-disclosure agreements I can’t talk in more detail about them. But what I can say is that a little bit of ETL, SQL and building a Data Warehouse and some reports is not good enough anymore. Technologies these days can do much more and customers are starting to understand this. The demand and the expectation is increasing especially for analytical models in combination with business process optimizations. So the time is very exciting and I can encourage everybody to get started and if you don’t know where, let me know…



Data Science Labs – Data Analytics with SQL Server and R… (Part 1)


…how well do they play together?

I continue my Data Science Lab series with a new story about Data Analytics based on SQL Server and R. As I already described in the Data Science workplace R is a typical and very often used platform for more advanced analytics like regressions, predictions or recommendations. And the good thing is, it’s free. But what is also typical is, that the data we use for those analytics is stored in relational databases like SQL Server. So in order to apply analytics on our data we need to first connect R to our SQL Server database. All test and examples you will find later on have been done with R version 3.0.2 and RStudio version 0.98.501, which I can highly recommend, when you work with R.

Connecting R and SQL Server

In order to connect R to SQL Server we need to use a connection first. Since there are no special drivers for SQL Server available we have to use the normal ODBC package. An ODBC connection to SQL Server can be made in the following way:

#load ODBC library
#create connection
cn <- odbcDriverConnect(connection="Driver={SQL Server 
Native Client 11.0};server=localhost;database=TPCH;

You need to replace the values for server and database according to your settings. The variable cn stored the connection object that we need for our queries.

Query a table

After we have create a connection we can now query our table. In order to query a table we have certain options:

Read with sqlFetch

The function sqlFetch reads some or all of a table from a database into a data frame. Here is an example (I reuse the connection object cn from the previous example):

#load data
data <- sqlFetch(cn, 'myTable', colnames=FALSE, rows_at_time=1000)

The variable data stores the result of our table in a data frame. I will come to details and advantages of a data frame in my next post.

Read with sqlQuery

By using sqlQuery we are able to submit a SQL query to a database and get the results. Example:

#load data
data <- sqlQuery(cn, "select * from myTable")
status <- sqlGetResults(cn, = FALSE, errors = TRUE,
			max = 0, buffsize = 1000000,
			nullstring = NA_character_, na.strings = "NA",
			believeNRows = TRUE, dec = getOption("dec"),
			stringsAsFactors = default.stringsAsFactors())

Also in this case the result of our query is stored in the variable data as a data frame. sqlGetResults is a mid-level function. It is called after a call to sqlQuery or odbcQuery to retrieve waiting results into a data frame. Its main use is with max set to non-zero when it will retrieve the result set in batches with repeated calls. This is useful for very large result sets which can be subjected to intermediate processing.

Read with odbcQuery

The last option is to use odbcQuery. This is a low-level function that talks directly with the ODBC interface and has some advantages, that I will explain later on. Here is an example:

status  <- odbcQuery(cn, "select * from myTable")
data <- odbcFetchRows(cn, max = 0, buffsize = 10000,
			nullstring = NA_character_, believeNRows = TRUE)
error <- odbcGetErrMsg(cn)

odbcFetchRows returns a data frame of the pending rowset, limited to max rows if max is greater than 0. buffsize may be increased from the default of 1000 rows for increased performance on a large dataset. This only has an effect when max = 0 and believeNRows = FALSE (either for the ODBC connection or for this function call), in which case buffsize is used as the initial allocation length of the R vectors to hold the results. (Values of less than 100 are increased to 100.) If the initial size is too small the vector length is doubled, repeatedly if necessary.
odbcGetErrMsg returns a (possibly zero-length) character vector of pending messages.

What about Performance?

In order to compare the different R function we need a test case and a baseline first. I will just have a look at read performance. So lets define our test case first.

My laboratory


To give you an idea about the test environment I will quickly give you some numbers of my machine. Since I don’t have any high performance servers I can use I just picked my tweaked notebook.

Having said that I also want to point out that I was not so interested in the absolute loading times. I bet that you can run faster tests with more expensive hardware. Here I want to show you the different approaches and how they affect the performance relatively to the absolute numbers.

  • HP EliteBook 8470p, 4 Cores, 2.9 GHz, 16 GR RAM, SSD Card
  • SQL Server 2014, CTP2
  • R (64 bit) and RStudio

The test case

Since it does really matter what kind of data we use for testing I picked the TPC-H data model and data generator as my test data set. The good thing with the TPC-H data model is, that you can download it also on your own and do the same tests on your machine. I created the LINEITEM table in my SQL Server with 1 million rows which is about 140 MB table storage. Here is a screenshot of the table structure:


Setting the stage – our baseline

In order to have some comparable results I first want to see what the baseline numbers and some reference numbers of my test lab are. To get to those numbers I tested the physical read speed of my SSD drive to exclude hardware bottlenecks from testing. I used SQLIO for that. Secondly I also tested BCP and SSIS (SQL Server Integration Services) to see how fast I can read the data with these tools. Here are the results:


As you can see I get a really good speed of my physical SSD drive which will not be a bottleneck in our further tests. The speed of SSIS is as fast as we would expect. If you what to understand how we can tune this, please have a look at my SSIS Performance Guidelines:

Recap and next steps

So far we have seen the different ways how we can connect R with SQL Server. In order to understand how well they perform for data analytics scenarios we need to create a test scenario and get some base values of performance. In my next post I will show how fast the different R functions are, how we can tune them and how we can fix some bottlenecks in order to speed up our read performance.


Conference time – From data to business advantage

imageI’m happy to say that I will present another session on Data Analytics together with Rafal Lukawiecki on the 21st of March in Zurich. This event is organized by Microsoft and Rafal and is focused on Data Analytics with the Microsoft platform.

I already met Rafal on the SQL Server Conference 2014 where he was giving the key not and he is a fantastic speaker. Beside conferences Rafal Lukwiecki is also a strategic Consultant at Project Botticelli Ltd ( and he focuses on making advanced analytics easy, insightful, and useful, helping clients achieve better organizational performance. I’m a big fan of and I can encourage everybody who want’s to learn about Data Analytics to use his trainings and videos.

So if you are interested and you have some time, please join. You can register yourself here:

Analytic Powerhouse–Parallel Data Warehouse and R

It’s already some weeks ago since I presented this session about PDW and R at the SQL Server Conference in Darmstadt. The conference itself was very nice, we had about 350 participants, mostly from Germany. I wanted to take the chance to say thank you to organization team, especially to PASS and Microsoft.

For everybody who missed my talk, I uploaded my slides to slideshare:

All slides are still in German but I will write another blog post in English about the R integration and findings. If you have any questions or feedback, please let me know.


Data Science Labs – Analyze Twitter data with Excel and Power Query

A typical Big Data architecture is able to handle structured data and unstructured data. Databases are great to handle large amounts of structured data, which can be queried very efficient and fast. But databases are not an appropriate store for unstructured data like text, xml, logs, etc. The challenge with unstructured data is, that every row could looks a little bit differently which makes it very hard to handle during ETL and for modeling. Hadoop is a much more flexible store, which allows us to store and query this kind of data more efficiently. So a very efficient and scalable Big Data architecture could look like this:


But do we really need every time a Hadoop cluster, when we just work on a subset of our data in order to find trends and pattern in it?

Which car Brand is more popular?

The answer is NO. Microsoft Excel in combination with the free Add-in Power Query is also capable to do that. So let’s try to find some trends in Twitter data about car brands. I collected some data of several days about different car brands like Audi, BWM, Mercedes, Porsche, etc. This data is stored in text files. The dataset can be downloaded here.

Choose the data source

First of all we start Excel (Excel 2013 in my case) and we build a data mash up on the Twitter data. I choose import data “From Folder”…


… and I fill in the folder path in the next window.


Power Query starts to analyze which content is stored in my source folder.


Design the data mashup

In order to structure the content of our text files we need to click the small arrows next to “Content”.


In the next screen we see our content as only one column with no further structure. Since we know the structure of our text files we can use Power Query do define it. We need to right-click on the column “Column1” and choose “Split Column” and then  “By Delimiter…”. As delimiter we choose “Semicolon”.


Now we can see our column structure and just need to rename the columns to some speaking names like “Tweet”, “Tweeter”, “Date” and “Brand”.


The column “Brand” contains values like Audi A1, Audi A3, VW Golf, etc. In order to be able to analyze our data by brand and by sub-brand we need to split this column into two columns, the separator is “Space”. The final result is shown in the picture below.


The next column we need to adjust is “Date”. When we right-click on the “Date” column and choose “Change Type”, we can see that the data type is currently “Text”. In order to transform it into a date column we choose “Change Type” and than “Using Locale…”. We pick “Date, Time and Timezone” as data type and because the date in our text column is represented in a English format, we pick “English (United States)” as “Locale”.


Now the “Date” column is defined in a date format, but we don’t want to analyze the data on a timestamp-level, day-level is good enough. So we right-click on the “Date” column and choose “Transform” and than “Date”.


Note: If you want to have both information, date and time, create a copy of the “Date” column by choosing “Duplicate Column”. You can name then one column “Date” and one “Time” and transform it appropriately.

In the last step we just filter some error rows with “null” values in column values. This can be done very easily when you left-click on the small arrow next to the “Tweeter” column and unselect the “(null)” value.


Now we are done. We can review our transformation steps in the pane on the right side. I think this exercise was very easy to understand and to accomplished. Compared to ETL tools like SSIS this is much less effort and you don’t need further skills.


Analyze the data

All our data is in good shape, now we can start analyze it. In order to do this we load our data into the data model. If we choose this all data is stored in the xVelocity in-memory engine of Excel called Power Pivot. Power Pivot is a high performance columnstore in-memory engine that is already build-in Excel 2013. It’s works like SAP HANA on your desktop. Smile


Now let’s create a pivot chart to visualize the data. What we see is that more people talk about brands like Audi and VW and not so much about Mercedes. The reason could be, that Mercedes cars are often bought by older people that probably don’t use new medias like Twitter or Facebook so much. So if you are a marketing guy at Mercedes, you should think about, if Twitter is really the right marketing channel for you.



As we have seen Excel and Power Query are fantastic and very powerful tools to do data Mashups also on unstructured data, like text, sensor data, twitter, xml, etc. You don’t need to setup a Hadoop cluster to get started. This perfectly fits into the Self Service BI strategy of Microsoft since Power User are able to do this very easily.