Conference time – SQL Server Conference 2015

728x90_ImSpeaking

Although everybody prepares for Christmas and the holidays, some people already prepare the first conferences starting early next year. And so I’m very happy that the SQL Server Conference 2015 will take place in Darmstadt again. We will have 2 days full of great sessions with international speakers. So also for English speaking people there are lot’s of sessions to follow. And there is a full day only for PreCon Trainings.

I’m also very happy to say that I have the chance to support the community with a session about “SAP Hana, Power Pivot and SQL Server – In-Memory technologies compared”.

So register yourself, take a seat and enjoy the sessions.

What’s your passion in life?

This morning I have seen an interesting interview with Microsoft’s CEO Satya Nadella at the University of Berlin (Satya Nadella as guest at TU Berlin) and he was talking about the new generation, technologies, students, changing business models and passion. So it made me think about my passion and what makes me happy every day, even if I need to stand up at 5 in the morning (of course Smile)?

20140221_pmONE_Scharmuetzelsee_Veranstaltung-53

So here are my 4 major passions:

  • It’s about my customers: I really like the time in which we live now. We can see a big paradigm shift in the area of Digital Enterprises, Big Data, Data Analytics and the Internet of things. So a lot of fascinating projects happens these days and I feel very lucky to be part of some of them.
  • It’s about my team: Because this new world is so fast changing and challenging it’s so important to have a great team in order to be successful. And what I like in my team is, that we share the same passion. And if you share the same passion, you can achieve fantastic things and have a lot of fun and success.
  • It’s about technologies: I have a good technical background and focused my research in the last month on Data Analytics, Big Data, Machine Learning and new technologies that came up with the evolution of the cloud technologies. And there is a great video I always remember from Lars Thomsen about “520 weeks of future”: 520 Wochen Zukunft — die zweite Dekade der grossen Chancen. Unfortunately this video is in German, but it makes me think.
  • It’s about education: I started to learn to self educate myself at my first days at university. All of my colleagues were familiar with the Internet, websites, mp3 and all the great things in the year of 2000. So I had to catch up. And after some years in consulting with lot’s of experiences I got in so many projects, I felt that it’s now time to share these experiences with others and help educating our new generation. That’s why I started to give trainings, talks on conferences, published whitepapers and started to write a blog.

So what are yours? What makes you excited in your day to day job?

Have fun…

Impressions from the SQL Saturday #313

Also this year we had another great SQL Saturday here in Germany. I want to take the chance to say thank you to the organization team, to all the speakers and the volunteers for the big effort and time they spent to make this happen. From my opinion it was again a great success. We used the same location as last year, at the Campus in Augustin, and I personally like the combination of a university campus and a technology event like this. This year we also had more participants than last year and the team was able to deliver very good sessions in 5 parallel tracks (http://sqlsaturday.com/313/schedule.aspx).

I also had the chance to give some of my time to the SQL community and had a session about the different In-Memory technologies in SAP HANA, Power Pivot and SQL Server. If you are interested in my slides, I share them via Slideshare.

 

If you have any feedback or questions don’t hesitate to reach out. Finally I also want to share some impressions from the conference with you. Thank you to our photograph Dirk, who put together some more photos on OneDrive: VCSL3p

SQLSaturday313_Rheinland_449SQLSaturday313_Rheinland_267SQLSaturday313_Rheinland_416SQLSaturday313_Rheinland_423

A call to all Data Engineers and BI workers

In the last two years I had the chance to get my hands on some very exciting Data Analytics projects and I wanted to take the chance to recap and to reach out to Data Engineers and BI Consultants. Why?

In the area of IT we see lot’s of trends coming up every year. Some are going, some are staying, but sometimes we see also a paradigm shift. These shifts have a tremendous impact on the way we worked before and how we will work in the future, for example the rise of the Internet, the area of Business Intelligence and Data Warehouse and the whole E-Commerce shift. And now we can see new shift coming up:

The era of Big Data

The difficult thing with a paradigm shift is we need to rethink certain ideas, the way we did business before and we will do in the future. And if we don’t do it, others will do it and we will not be as successful as we have been in the past.  So let me get to that story in more detail.

Big Data vs. Data Analytics

image

Big Data is now out there for a while and people already understand that storing large amount of data is not good enough. There was a big hype about it and we are now at a point that the words “Big Data” already got a negative touch. It’s very exciting to see the big progress in new technologies like Hadoop, where customers can store nearly all their data. But in 90% of the cases it is totally useless to throw all your data into a data analysis problem. Just talking about technologies does not meet the needs of users and customers anymore.

I also don’t like to talk about Big Data, because it’s misleading, instead I’d like to talk about Data Analytics and that’s what it’s all about. So the focus is clearly on analyzing data and creating value out of it. This is also not big news, but we were told that only a specific type of people with a specific knowledge can do this: Data Scientists.

These guys are currently seen as heroes in the analytics market and everybody is looking for someone with little or no luck. So here’s my point: Analyzing data is not a new venture, in the area of Business Intelligence and Data Mining people did this all the time for years. But what has changed and where does the big shift happens?

We can clearly say that we can’t get around Data Analytics anymore. If you talk with customers and you just want to talk about Data Warehouses and BI you are missing half of the discussion. All the companies I talk to clearly think about Big Data or Data Analytics and how they can combine it with their Data Warehouse and BI solutions. But technology has become secondary in these discussions. Don’t get me wrong, Data Warehouses are still necessary and in use but the focus clearly has changed. We see new types of data that are interesting to analyze like streaming, social, logs, sensor data and there are also new ways to analyze data like pattern recognition, predictions, clustering, recommendations, etc. So the operational data that is typically stored in Data Warehouses is still necessary, but it has to be combined with the other types of data, I mentioned before. But in today’s discussions with customer, it’s all about use cases and solutions.

And in order to close the loop let me quickly come back to the Data Scientists. I agree that we need statistical and mathematical skills to solve problems like customer segmentation, next best offers and recommendations, predictions, data correlations etc. but we need much more skills to provide whole solutions to customers, so a good team mix is much more important.

New skills and approaches

With the era of Big Data and the new analytical possibilities we can also see new solution approaches. Data Analytic projects are much more iterative and evolutionary because research on your data is a big part of your work. Companies discover new use cases and sometimes they change their whole business model, because they find competitive advantages or new possibilities for revenue.

google acquisition of nest, thermostat, ProtectA good example for this are Smart Homes. We can see that the digitalization is now arriving at our homes. In the near future devices in our home are fully connected with each other and share data between each other. When I set my  weak up alarm for the next morning, an app will tell this to my heating system. My heating system then knows when I want to take a shower and need warm water or when I want to drive with my electric car.

Energy providers are highly interested in this information and in my daily behavior of energy consumption. Why?

Because when they better understand my energy consumption, they can better predict their energy sales and also how much energy is consumed at a certain time. And when they better predict the energy consumption of their customers, they can better handle their purchase of power energy at the energy exchange market.

The challenge with these new business models, and there are plenty of others, is that they are new. And for energy companies that have offered power supply in a very classical way for decades, this is a big change. So that’s why also technology providers like Google enter the market. They know how to handle the data, how to analyze it and how to use it for business models to provide additional services. Should you not accept these changes in business models, even when they take some time before they settle on the market, you wake up, when it is too late. Because applying changes need some time and companies need the experience in order to apply these changes step by step

And I think this is the most important learning in the last years. You can stick with you old business models if they work, but if an industry is changing you need to adapt. And Data Analytics happens in several industries and the most successful companies are those, that start small, get their experiences very quickly and are able to adopt the changes. There are very good examples in Germany like the Otto Group in Retail, Yello Strom in the Energy sector and also some new Startups.

As I mentioned before Data Analytic projects need to be very iterative in their approach. A lot of projects start with an idea or a use case or a feeling and we need to quickly understand if there is a business case behind it or not. In order to support those projects we need a different approach, which I call “Laboratory and Factory”.

image

The Laboratory

The Laboratory is for experiments. Here we can test all our use cases, ideas or just discover patterns in data. The important thing is, it must be cheap. We don’t want to spend much money on experiments, if we don’t know the business case behind them yet. The work can be compared to „panning for gold“. There is plenty of gold to be found, but for every gold nugget a multiple of sand needs to be panned. So from a technology perspective I would use whatever fits to the use case. In the laboratory we should be more flexible on different technologies like SQL, Hadoop, Storm, Pig, R or Python, D3 or other tools which help solve our problems.

From a data perspective we can work on a subset of the data. What we probably want to avoid is data sampling, which is often times very time consuming, so we prefer real data first. So the main goal of the laboratory is to…

image_thumb14

The Factory

After we proved our business cases in the laboratory we can than apply our data analytic models to the factory. The factory means that we operate these models on a daily base and couple them to our business processes. Here we typically use the existing enterprise platforms and we often see mixed solutions of classical data analytics platforms combined with Open Source technologies. A new requirement in the last years is that we want to apply our analytical models to the whole data history. Technologies and servers are now capable to make this possible. So our factory gives us the integration of our analytical business models to our daily business at enterprise scale, on the whole data set and probably enriched with external data.

New technologies

image_thumb16

Some month ago I had the chance to visit the European Hadoop conference in Amsterdam and it was a great chance to get another view on Big Data and Data Analytics from an Open Source perspective. It has become very obvious that Hadoop and NoSQL based technologies drive the Big Data and Analytics market. There is a whole industry behind companies like Cloudera, Hortonworks, MapR and others that push new innovations. The Stinger initiative for example was a team project of around 45 companies with 140 developers that improved the performance of the Hive technology by a factor 100 within 1 year. Imagine the power of innovation that these companies like Google, Yahoo, Facebook, Hortonworks and also Microsoft can bring to these technologies when they combine the skills. Clearly when you come from a traditional BI solution like SQL Server, Teradata, Oracle or SAP you would say that there are still some gaps in the usability and ease of use. But on the other side these technologies are built just for Big Data solutions. They offer fantastic capabilities and some of them are great technologies, also if it is sometimes harder to work with them.

And  when you see that all big players in the market like IBM, Oracle, Teradata, Microsoft and SAP have partnerships with Hadoop platform providers, then it is very clear, that there is no way around these technologies anymore. It is just a question how to combine them best. Microsoft for example has a nice offering with the Analytical platform system (APS), which is a scale-out box where you can mix Hadoop and SQL Server in a highly parallel and very high performing way.

Summary

I personally believe in the new paradigm shift of Big Data and Data Analytics. I already had  the chance to enjoy several projects in that area and I’m very happy to start on new ones in the next weeks. But that is also the reason why I wanted to write this article. In order to stay competitive we need to accept changes in the market and start to deal with them. What does that mean?

We have to keep learning new technologies, different approaches, new business models, etc. Old traditional BI projects will be also done in the future but the really interesting and challenging projects will all deal with Data Analytics. There are lots of really fascinating use cases, which due to non-disclosure agreements I can’t talk in more detail about them. But what I can say is that a little bit of ETL, SQL and building a Data Warehouse and some reports is not good enough anymore. Technologies these days can do much more and customers are starting to understand this. The demand and the expectation is increasing especially for analytical models in combination with business process optimizations. So the time is very exciting and I can encourage everybody to get started and if you don’t know where, let me know…

 

References:

http://hadoopilluminated.com/hadoop_illuminated/Hardware_Software.html

http://www.myandroiddaily.com/2014/01/why-you-should-care-about-googles-new.html

http://www.devlounge.net/friday-focus/031111-in-the-lab

http://datasciencebe.com/datalab/

http://picturethepursuitofhappiness.blogsport.de/2009/07/19/neugier/

http://blogs.microsoft.com/blog/2014/04/15/a-data-culture-for-everyone/

June is my conference month

It was quite a busy time for me in the last month, with lot’s of very interesting project and exciting customers, especially in there area of Big Data and Data Analytics. That’s why it’s a little bit quite currently on my blog. Apologizes for this, but I will keep up writing. Additionally June seems to be my conference month for this year and I will be happy if I could see and talk to some of you. So here is my current schedule:

TDWI Conference 2014, 23. – 25.06.

I will have a session together with Ralph Kemperdick from Microsoft on “Analytical Platform System (former known as PDW) – Real World Experiences”

 

SQL Saturday 313, 28.6.

I’m very happy that I have again the chance to present on the SQL Saturday in Germany. Also this year the SQL Saturday will be at the Hochschule Bonn-Rhein-Sieg. And my Session will be about “Comparing SAP HANA, Power Pivot and SQL Server – In-Memory-Technologies”.

 

Datalympics 2014, 02.07.

This is a new conference to me, where I’m very excited about and happy to give a speech. My session will be on “Analytical Powerhouse − Data Analytics based on Microsoft”.

 

I think this will be a very exciting weeks and after my vacation I will follow up on my blogs.

Data Science Labs – Data Analytics with SQL Server and R… (Part 1)

image

…how well do they play together?

I continue my Data Science Lab series with a new story about Data Analytics based on SQL Server and R. As I already described in the Data Science workplace R is a typical and very often used platform for more advanced analytics like regressions, predictions or recommendations. And the good thing is, it’s free. But what is also typical is, that the data we use for those analytics is stored in relational databases like SQL Server. So in order to apply analytics on our data we need to first connect R to our SQL Server database. All test and examples you will find later on have been done with R version 3.0.2 and RStudio version 0.98.501, which I can highly recommend, when you work with R.

Connecting R and SQL Server

In order to connect R to SQL Server we need to use a connection first. Since there are no special drivers for SQL Server available we have to use the normal ODBC package. An ODBC connection to SQL Server can be made in the following way:

#load ODBC library
library(RODBC)
#create connection
cn <- odbcDriverConnect(connection="Driver={SQL Server 
Native Client 11.0};server=localhost;database=TPCH;
trusted_connection=yes;")

You need to replace the values for server and database according to your settings. The variable cn stored the connection object that we need for our queries.

Query a table

After we have create a connection we can now query our table. In order to query a table we have certain options:

Read with sqlFetch

The function sqlFetch reads some or all of a table from a database into a data frame. Here is an example (I reuse the connection object cn from the previous example):

#load data
data <- sqlFetch(cn, 'myTable', colnames=FALSE, rows_at_time=1000)

The variable data stores the result of our table in a data frame. I will come to details and advantages of a data frame in my next post.

Read with sqlQuery

By using sqlQuery we are able to submit a SQL query to a database and get the results. Example:

#load data
data <- sqlQuery(cn, "select * from myTable")
status <- sqlGetResults(cn, as.is = FALSE, errors = TRUE,
			max = 0, buffsize = 1000000,
			nullstring = NA_character_, na.strings = "NA",
			believeNRows = TRUE, dec = getOption("dec"),
			stringsAsFactors = default.stringsAsFactors())

Also in this case the result of our query is stored in the variable data as a data frame. sqlGetResults is a mid-level function. It is called after a call to sqlQuery or odbcQuery to retrieve waiting results into a data frame. Its main use is with max set to non-zero when it will retrieve the result set in batches with repeated calls. This is useful for very large result sets which can be subjected to intermediate processing.

Read with odbcQuery

The last option is to use odbcQuery. This is a low-level function that talks directly with the ODBC interface and has some advantages, that I will explain later on. Here is an example:

status  <- odbcQuery(cn, "select * from myTable")
data <- odbcFetchRows(cn, max = 0, buffsize = 10000,
			nullstring = NA_character_, believeNRows = TRUE)
error <- odbcGetErrMsg(cn)

odbcFetchRows returns a data frame of the pending rowset, limited to max rows if max is greater than 0. buffsize may be increased from the default of 1000 rows for increased performance on a large dataset. This only has an effect when max = 0 and believeNRows = FALSE (either for the ODBC connection or for this function call), in which case buffsize is used as the initial allocation length of the R vectors to hold the results. (Values of less than 100 are increased to 100.) If the initial size is too small the vector length is doubled, repeatedly if necessary.
odbcGetErrMsg returns a (possibly zero-length) character vector of pending messages.

What about Performance?

In order to compare the different R function we need a test case and a baseline first. I will just have a look at read performance. So lets define our test case first.

My laboratory

image

To give you an idea about the test environment I will quickly give you some numbers of my machine. Since I don’t have any high performance servers I can use I just picked my tweaked notebook.

Having said that I also want to point out that I was not so interested in the absolute loading times. I bet that you can run faster tests with more expensive hardware. Here I want to show you the different approaches and how they affect the performance relatively to the absolute numbers.

  • HP EliteBook 8470p, 4 Cores, 2.9 GHz, 16 GR RAM, SSD Card
  • SQL Server 2014, CTP2
  • R (64 bit) and RStudio

The test case

Since it does really matter what kind of data we use for testing I picked the TPC-H data model and data generator as my test data set. The good thing with the TPC-H data model is, that you can download it also on your own and do the same tests on your machine. I created the LINEITEM table in my SQL Server with 1 million rows which is about 140 MB table storage. Here is a screenshot of the table structure:

image

Setting the stage – our baseline

In order to have some comparable results I first want to see what the baseline numbers and some reference numbers of my test lab are. To get to those numbers I tested the physical read speed of my SSD drive to exclude hardware bottlenecks from testing. I used SQLIO for that. Secondly I also tested BCP and SSIS (SQL Server Integration Services) to see how fast I can read the data with these tools. Here are the results:

image

As you can see I get a really good speed of my physical SSD drive which will not be a bottleneck in our further tests. The speed of SSIS is as fast as we would expect. If you what to understand how we can tune this, please have a look at my SSIS Performance Guidelines:

Recap and next steps

So far we have seen the different ways how we can connect R with SQL Server. In order to understand how well they perform for data analytics scenarios we need to create a test scenario and get some base values of performance. In my next post I will show how fast the different R functions are, how we can tune them and how we can fix some bottlenecks in order to speed up our read performance.

References

Conference time – From data to business advantage

imageI’m happy to say that I will present another session on Data Analytics together with Rafal Lukawiecki on the 21st of March in Zurich. This event is organized by Microsoft and Rafal and is focused on Data Analytics with the Microsoft platform.

I already met Rafal on the SQL Server Conference 2014 where he was giving the key not and he is a fantastic speaker. Beside conferences Rafal Lukwiecki is also a strategic Consultant at Project Botticelli Ltd (projectbotticelli.com) and he focuses on making advanced analytics easy, insightful, and useful, helping clients achieve better organizational performance. I’m a big fan of projectboticelli.com and I can encourage everybody who want’s to learn about Data Analytics to use his trainings and videos.

So if you are interested and you have some time, please join. You can register yourself here: https://msevents.microsoft.com/cui/EventDetail.aspx?EventID=1032573660&culture=de-CH