PASS Business Analytics Conference 2015 – Day 1

Pic1This is my first time to visit the PASS Business Analytics Conference and so I want to take the chance to give a wrap up of my impressions of the first day. The conference takes place in Santa Clara and I personally think that California as a great place to be. The sunny weather makes it very easy to combine interesting and challenging sessions and an open mood of people for networking and get to know each other. I really like the very friendly and open nature of the people here.

Maybe I give you some more details on the conference itself. The clear focus of that conference is Data Analytics. It’s a very good mixture of technical, business and strategic sessions. The audience is very different to the typical and more technical PASS Conference in Seattle. Altogether there are 600 participants out of 24 countries and more than 60 hours of sessions with great speakers. Most of them I never heard before, but they really give you some new ideas or food for thought that you never be thought of before.

Pic2

Keynote

The keynote started with the PASS president Thomas LaRock, followed by a great talk of Carlo Ratti, the Director of the MIT Senseable City Lab. The MIT Senseable City Lab is focused on who sensors are changing cities into real-time cities with lot’s of new opportunities. Today cities are already covered with digital sensors, so that people can react on it. There is an estimation that by 2020  about 50 billion connections need to handled in the world. Most of these connections will be sensors and devices than mobile phones. He showed several examples how cities are covered with digital sensors, react on it and become more smart cities these days. There are several research projects that the MIT is doing in that area like:

Mobility and Smart Cities

The Hub Cab is a project started on analyze Taxi Peaks in New York and try to bring in shared mobility. Beside the MIT also Audi and GE are part of that project, which makes me, as a German, also proud of that we are part of that innovation.  He presented some interesting numbers that although safe driving cars are on the way only 20% of the cars are need today for our mobility. An average car is only used 5% of the time so there is a huge opportunity to make mobility more efficient, especially in large city, also in order enhance the quality of live. So sharing is much more intelligent and what if we take the self-driving experience and the the sharing concepts together and even think about how traffic is organized. Do we still then need traffic lights? There are some nice concepts and researches on self-driving intersection that organizes traffic much more intelligent and efficient as traffic lights did for the last 100 years. This is a big opportunity for all of us.

The Copenhagen wheel

Another interesting and fascinating thing that came out of a MIT research project is the Copenhagen Wheel. This is a complete new way of bicycle mobility and a great example how sensors and technology can help us to innovate new experiences. And the best way to understand it, is to watch the video: https://www.youtube.com/watch?v=S10GMfG2NMY

Follow the trash

One last interesting project to me was the “moma-followup”. The idea is to follow obsolete electronics, that is normally just thrown away, across the world for recycle or reuse. In order to do this the MIT developed two different types of self-reporting technologies that are coupled with the obsolete electronics and than see what happened: https://www.youtube.com/watch?v=5fHZYyOZU4c

Summary

For me the first day really gave me some new ideas what is going on in Data Analytics. I think it is an amazing time to see, what can be achieved with technologies in these days and this is not always just to earn more money. It is interesting to see how fast the technology adoption happens and what role these great Universities like the MIT, Boston, Harvard, Stanford and many others play in that area. So a really great day one and I will follow up on day 2.

Sources:

A call to all Data Engineers and BI workers

In the last two years I had the chance to get my hands on some very exciting Data Analytics projects and I wanted to take the chance to recap and to reach out to Data Engineers and BI Consultants. Why?

In the area of IT we see lot’s of trends coming up every year. Some are going, some are staying, but sometimes we see also a paradigm shift. These shifts have a tremendous impact on the way we worked before and how we will work in the future, for example the rise of the Internet, the area of Business Intelligence and Data Warehouse and the whole E-Commerce shift. And now we can see new shift coming up:

The era of Big Data

The difficult thing with a paradigm shift is we need to rethink certain ideas, the way we did business before and we will do in the future. And if we don’t do it, others will do it and we will not be as successful as we have been in the past.  So let me get to that story in more detail.

Big Data vs. Data Analytics

image

Big Data is now out there for a while and people already understand that storing large amount of data is not good enough. There was a big hype about it and we are now at a point that the words “Big Data” already got a negative touch. It’s very exciting to see the big progress in new technologies like Hadoop, where customers can store nearly all their data. But in 90% of the cases it is totally useless to throw all your data into a data analysis problem. Just talking about technologies does not meet the needs of users and customers anymore.

I also don’t like to talk about Big Data, because it’s misleading, instead I’d like to talk about Data Analytics and that’s what it’s all about. So the focus is clearly on analyzing data and creating value out of it. This is also not big news, but we were told that only a specific type of people with a specific knowledge can do this: Data Scientists.

These guys are currently seen as heroes in the analytics market and everybody is looking for someone with little or no luck. So here’s my point: Analyzing data is not a new venture, in the area of Business Intelligence and Data Mining people did this all the time for years. But what has changed and where does the big shift happens?

We can clearly say that we can’t get around Data Analytics anymore. If you talk with customers and you just want to talk about Data Warehouses and BI you are missing half of the discussion. All the companies I talk to clearly think about Big Data or Data Analytics and how they can combine it with their Data Warehouse and BI solutions. But technology has become secondary in these discussions. Don’t get me wrong, Data Warehouses are still necessary and in use but the focus clearly has changed. We see new types of data that are interesting to analyze like streaming, social, logs, sensor data and there are also new ways to analyze data like pattern recognition, predictions, clustering, recommendations, etc. So the operational data that is typically stored in Data Warehouses is still necessary, but it has to be combined with the other types of data, I mentioned before. But in today’s discussions with customer, it’s all about use cases and solutions.

And in order to close the loop let me quickly come back to the Data Scientists. I agree that we need statistical and mathematical skills to solve problems like customer segmentation, next best offers and recommendations, predictions, data correlations etc. but we need much more skills to provide whole solutions to customers, so a good team mix is much more important.

New skills and approaches

With the era of Big Data and the new analytical possibilities we can also see new solution approaches. Data Analytic projects are much more iterative and evolutionary because research on your data is a big part of your work. Companies discover new use cases and sometimes they change their whole business model, because they find competitive advantages or new possibilities for revenue.

google acquisition of nest, thermostat, ProtectA good example for this are Smart Homes. We can see that the digitalization is now arriving at our homes. In the near future devices in our home are fully connected with each other and share data between each other. When I set my  weak up alarm for the next morning, an app will tell this to my heating system. My heating system then knows when I want to take a shower and need warm water or when I want to drive with my electric car.

Energy providers are highly interested in this information and in my daily behavior of energy consumption. Why?

Because when they better understand my energy consumption, they can better predict their energy sales and also how much energy is consumed at a certain time. And when they better predict the energy consumption of their customers, they can better handle their purchase of power energy at the energy exchange market.

The challenge with these new business models, and there are plenty of others, is that they are new. And for energy companies that have offered power supply in a very classical way for decades, this is a big change. So that’s why also technology providers like Google enter the market. They know how to handle the data, how to analyze it and how to use it for business models to provide additional services. Should you not accept these changes in business models, even when they take some time before they settle on the market, you wake up, when it is too late. Because applying changes need some time and companies need the experience in order to apply these changes step by step

And I think this is the most important learning in the last years. You can stick with you old business models if they work, but if an industry is changing you need to adapt. And Data Analytics happens in several industries and the most successful companies are those, that start small, get their experiences very quickly and are able to adopt the changes. There are very good examples in Germany like the Otto Group in Retail, Yello Strom in the Energy sector and also some new Startups.

As I mentioned before Data Analytic projects need to be very iterative in their approach. A lot of projects start with an idea or a use case or a feeling and we need to quickly understand if there is a business case behind it or not. In order to support those projects we need a different approach, which I call “Laboratory and Factory”.

image

The Laboratory

The Laboratory is for experiments. Here we can test all our use cases, ideas or just discover patterns in data. The important thing is, it must be cheap. We don’t want to spend much money on experiments, if we don’t know the business case behind them yet. The work can be compared to „panning for gold“. There is plenty of gold to be found, but for every gold nugget a multiple of sand needs to be panned. So from a technology perspective I would use whatever fits to the use case. In the laboratory we should be more flexible on different technologies like SQL, Hadoop, Storm, Pig, R or Python, D3 or other tools which help solve our problems.

From a data perspective we can work on a subset of the data. What we probably want to avoid is data sampling, which is often times very time consuming, so we prefer real data first. So the main goal of the laboratory is to…

image_thumb14

The Factory

After we proved our business cases in the laboratory we can than apply our data analytic models to the factory. The factory means that we operate these models on a daily base and couple them to our business processes. Here we typically use the existing enterprise platforms and we often see mixed solutions of classical data analytics platforms combined with Open Source technologies. A new requirement in the last years is that we want to apply our analytical models to the whole data history. Technologies and servers are now capable to make this possible. So our factory gives us the integration of our analytical business models to our daily business at enterprise scale, on the whole data set and probably enriched with external data.

New technologies

image_thumb16

Some month ago I had the chance to visit the European Hadoop conference in Amsterdam and it was a great chance to get another view on Big Data and Data Analytics from an Open Source perspective. It has become very obvious that Hadoop and NoSQL based technologies drive the Big Data and Analytics market. There is a whole industry behind companies like Cloudera, Hortonworks, MapR and others that push new innovations. The Stinger initiative for example was a team project of around 45 companies with 140 developers that improved the performance of the Hive technology by a factor 100 within 1 year. Imagine the power of innovation that these companies like Google, Yahoo, Facebook, Hortonworks and also Microsoft can bring to these technologies when they combine the skills. Clearly when you come from a traditional BI solution like SQL Server, Teradata, Oracle or SAP you would say that there are still some gaps in the usability and ease of use. But on the other side these technologies are built just for Big Data solutions. They offer fantastic capabilities and some of them are great technologies, also if it is sometimes harder to work with them.

And  when you see that all big players in the market like IBM, Oracle, Teradata, Microsoft and SAP have partnerships with Hadoop platform providers, then it is very clear, that there is no way around these technologies anymore. It is just a question how to combine them best. Microsoft for example has a nice offering with the Analytical platform system (APS), which is a scale-out box where you can mix Hadoop and SQL Server in a highly parallel and very high performing way.

Summary

I personally believe in the new paradigm shift of Big Data and Data Analytics. I already had  the chance to enjoy several projects in that area and I’m very happy to start on new ones in the next weeks. But that is also the reason why I wanted to write this article. In order to stay competitive we need to accept changes in the market and start to deal with them. What does that mean?

We have to keep learning new technologies, different approaches, new business models, etc. Old traditional BI projects will be also done in the future but the really interesting and challenging projects will all deal with Data Analytics. There are lots of really fascinating use cases, which due to non-disclosure agreements I can’t talk in more detail about them. But what I can say is that a little bit of ETL, SQL and building a Data Warehouse and some reports is not good enough anymore. Technologies these days can do much more and customers are starting to understand this. The demand and the expectation is increasing especially for analytical models in combination with business process optimizations. So the time is very exciting and I can encourage everybody to get started and if you don’t know where, let me know…

 

References:

http://hadoopilluminated.com/hadoop_illuminated/Hardware_Software.html

http://www.myandroiddaily.com/2014/01/why-you-should-care-about-googles-new.html

http://www.devlounge.net/friday-focus/031111-in-the-lab

http://datasciencebe.com/datalab/

http://picturethepursuitofhappiness.blogsport.de/2009/07/19/neugier/

http://blogs.microsoft.com/blog/2014/04/15/a-data-culture-for-everyone/

Data Science Labs – Data Analytics with SQL Server and R… (Part 1)

image

…how well do they play together?

I continue my Data Science Lab series with a new story about Data Analytics based on SQL Server and R. As I already described in the Data Science workplace R is a typical and very often used platform for more advanced analytics like regressions, predictions or recommendations. And the good thing is, it’s free. But what is also typical is, that the data we use for those analytics is stored in relational databases like SQL Server. So in order to apply analytics on our data we need to first connect R to our SQL Server database. All test and examples you will find later on have been done with R version 3.0.2 and RStudio version 0.98.501, which I can highly recommend, when you work with R.

Connecting R and SQL Server

In order to connect R to SQL Server we need to use a connection first. Since there are no special drivers for SQL Server available we have to use the normal ODBC package. An ODBC connection to SQL Server can be made in the following way:

#load ODBC library
library(RODBC)
#create connection
cn <- odbcDriverConnect(connection="Driver={SQL Server 
Native Client 11.0};server=localhost;database=TPCH;
trusted_connection=yes;")

You need to replace the values for server and database according to your settings. The variable cn stored the connection object that we need for our queries.

Query a table

After we have create a connection we can now query our table. In order to query a table we have certain options:

Read with sqlFetch

The function sqlFetch reads some or all of a table from a database into a data frame. Here is an example (I reuse the connection object cn from the previous example):

#load data
data <- sqlFetch(cn, 'myTable', colnames=FALSE, rows_at_time=1000)

The variable data stores the result of our table in a data frame. I will come to details and advantages of a data frame in my next post.

Read with sqlQuery

By using sqlQuery we are able to submit a SQL query to a database and get the results. Example:

#load data
data <- sqlQuery(cn, "select * from myTable")
status <- sqlGetResults(cn, as.is = FALSE, errors = TRUE,
			max = 0, buffsize = 1000000,
			nullstring = NA_character_, na.strings = "NA",
			believeNRows = TRUE, dec = getOption("dec"),
			stringsAsFactors = default.stringsAsFactors())

Also in this case the result of our query is stored in the variable data as a data frame. sqlGetResults is a mid-level function. It is called after a call to sqlQuery or odbcQuery to retrieve waiting results into a data frame. Its main use is with max set to non-zero when it will retrieve the result set in batches with repeated calls. This is useful for very large result sets which can be subjected to intermediate processing.

Read with odbcQuery

The last option is to use odbcQuery. This is a low-level function that talks directly with the ODBC interface and has some advantages, that I will explain later on. Here is an example:

status  <- odbcQuery(cn, "select * from myTable")
data <- odbcFetchRows(cn, max = 0, buffsize = 10000,
			nullstring = NA_character_, believeNRows = TRUE)
error <- odbcGetErrMsg(cn)

odbcFetchRows returns a data frame of the pending rowset, limited to max rows if max is greater than 0. buffsize may be increased from the default of 1000 rows for increased performance on a large dataset. This only has an effect when max = 0 and believeNRows = FALSE (either for the ODBC connection or for this function call), in which case buffsize is used as the initial allocation length of the R vectors to hold the results. (Values of less than 100 are increased to 100.) If the initial size is too small the vector length is doubled, repeatedly if necessary.
odbcGetErrMsg returns a (possibly zero-length) character vector of pending messages.

What about Performance?

In order to compare the different R function we need a test case and a baseline first. I will just have a look at read performance. So lets define our test case first.

My laboratory

image

To give you an idea about the test environment I will quickly give you some numbers of my machine. Since I don’t have any high performance servers I can use I just picked my tweaked notebook.

Having said that I also want to point out that I was not so interested in the absolute loading times. I bet that you can run faster tests with more expensive hardware. Here I want to show you the different approaches and how they affect the performance relatively to the absolute numbers.

  • HP EliteBook 8470p, 4 Cores, 2.9 GHz, 16 GR RAM, SSD Card
  • SQL Server 2014, CTP2
  • R (64 bit) and RStudio

The test case

Since it does really matter what kind of data we use for testing I picked the TPC-H data model and data generator as my test data set. The good thing with the TPC-H data model is, that you can download it also on your own and do the same tests on your machine. I created the LINEITEM table in my SQL Server with 1 million rows which is about 140 MB table storage. Here is a screenshot of the table structure:

image

Setting the stage – our baseline

In order to have some comparable results I first want to see what the baseline numbers and some reference numbers of my test lab are. To get to those numbers I tested the physical read speed of my SSD drive to exclude hardware bottlenecks from testing. I used SQLIO for that. Secondly I also tested BCP and SSIS (SQL Server Integration Services) to see how fast I can read the data with these tools. Here are the results:

image

As you can see I get a really good speed of my physical SSD drive which will not be a bottleneck in our further tests. The speed of SSIS is as fast as we would expect. If you what to understand how we can tune this, please have a look at my SSIS Performance Guidelines:

Recap and next steps

So far we have seen the different ways how we can connect R with SQL Server. In order to understand how well they perform for data analytics scenarios we need to create a test scenario and get some base values of performance. In my next post I will show how fast the different R functions are, how we can tune them and how we can fix some bottlenecks in order to speed up our read performance.

References

Conference time – From data to business advantage

imageI’m happy to say that I will present another session on Data Analytics together with Rafal Lukawiecki on the 21st of March in Zurich. This event is organized by Microsoft and Rafal and is focused on Data Analytics with the Microsoft platform.

I already met Rafal on the SQL Server Conference 2014 where he was giving the key not and he is a fantastic speaker. Beside conferences Rafal Lukwiecki is also a strategic Consultant at Project Botticelli Ltd (projectbotticelli.com) and he focuses on making advanced analytics easy, insightful, and useful, helping clients achieve better organizational performance. I’m a big fan of projectboticelli.com and I can encourage everybody who want’s to learn about Data Analytics to use his trainings and videos.

So if you are interested and you have some time, please join. You can register yourself here: https://msevents.microsoft.com/cui/EventDetail.aspx?EventID=1032573660&culture=de-CH

Analytic Powerhouse–Parallel Data Warehouse and R

It’s already some weeks ago since I presented this session about PDW and R at the SQL Server Conference in Darmstadt. The conference itself was very nice, we had about 350 participants, mostly from Germany. I wanted to take the chance to say thank you to organization team, especially to PASS and Microsoft.

For everybody who missed my talk, I uploaded my slides to slideshare: http://de.slideshare.net/marcelfranke/analytic-powerhouse-parallel-data-warehouse-und-r

All slides are still in German but I will write another blog post in English about the R integration and findings. If you have any questions or feedback, please let me know.

 

Create a Data Science Lab with Microsoft and Open Source Tools

…this was the title of my session I had at the SQL Rally Nordic. Together with my friends Gerhard, Alexei and Thomas and a crowd of the best speakers in the PASS community we had the chance to present in front of 500 attendees from more than 20 different countries. If you missed the conference you can find some impressions here: http://www.flickr.com/photos/passevents/

So finally a very good event and my respect to the organization team, which can be very proud of what they achieved. The last thing I can do is to share my slides, so have fun.

Data Science Labs – Cluster analysis with Microsoft BI and R

As I pointed out in my post about a Data Science workplace I ‘m going to write more about how we can leverage tools like R in combination with Microsoft BI. In my last post I shortly introduced a typical Big Data architecture and showed how we can leverage Excel and Power Query to analyze unstructured data.

The Scenario

This post will focus more on the analytical side of Data Science. We will start with the AdventureWorks Data Warehouse, which is already installed in my workplace image. AdventureWorks is a demo case created by Microsoft, that represents a company that sells bicycles on different channels. In our scenario we only look at Internet Sales. The data model looks like this:

image

It’s a kind of snowflake schema that has dimensions like Product, Customer, Promotion, Date, Curreny, etc.

What we want to analyze is, how can we group customers by income or sales amount into different cluster in order to optimize our marketing.

Cluster analysis is a very common practice and very good explained in Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis.

Data Preparation

So first of all we need to prepare our dataset, that we want to analyze. Therefor I create a view, which combines sales and income data per customer.

CREATE VIEW [dbo].[vCustomerSalesIncome]
AS
SELECT
    fact.[CustomerKey]
    ,sum([SalesAmount]) as SalesAmount
    ,dimC.YearlyIncome
  FROM [dbo].[FactInternetSales] fact
  INNER JOIN dbo.DimCustomer dimC 
        on fact.CustomerKey = dimC.CustomerKey
  where CurrencyKey = 100 --USD
  group by fact.[CustomerKey], dimC.YearlyIncome

The result of the view looks like the following table:

image

No we can start using R in order to find clusters.

Data Analysis

R is a language and software environment for statistical computing and graphics. It’s a free software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It runs under UNIX, Linux, Windows and MacOS and is very often used in the Analytics area. More information can be found here: http://www.r-project.org/

I prefer to work with RStudio which can be downloaded here: http://www.rstudio.com/

Load the data into R

All the calculations in R happen in-memory. So first of all we need to load our basic dataset. To do this I import the ODBC library, create a new database connection to my AdventureWorks database in SQL Server and read the view “vCustomerSalesIncome” (the one we created before) via the sqlFetch command. At the end we close the database connection.

#load ODBC library
library(RODBC)
#create connection
cn<-odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=localhost;database=AdventureWorksDW2012;trusted_connection=yes;")
#load customer data
custSalesInc<-sqlFetch(cn,'vCustomerSalesIncome')
odbcClose(cn)

We can have a look in RStudio how our dataset now looks like. 11,818 rows have been imported, the table structure looks like in our database.

image

Data Density

Next we want to get a better feeling about our data. Plots in R are very good for data visualizations. So let’s create two plots that visualize the data density of the sales amount and the yearly income of our customers.

Therefor we call the plot function and apply the density function on our dataset columns “SalesAmount” and “YearlyIncome” and the main attribute defines the header of our chart.

par(mfrow=c(1,2))
plot(density(custSalesInc$SalesAmount), main="Desity Sales Amount")
plot(density(custSalesInc$YearlyIncome), main="Desity Yearly Income")

The result is shown in the next picture. What we see is, that the average sales amount per customer is around 1,200 USD and the average yearly income is around 56,000 USD. These information help us to better understand our customer sales behavior and to define the cluster more precisely.

Rplot

Define the income cluster

Based on our data density we now define the income clusters for low, medium and high income. Therefor we use an ifelse-function, that assigns a score based on the yearly income:

  • 3 if the yearly income is > 60,000 USD
  • 2 if the yearly income is <= 60,000 USD and > 30,000 USD
  • 1 if the yearly income is <= 30,000 USD
#define income cluster
IncHigh.cluster <- ifelse(custSalesInc$YearlyIncome > 60000, 3, 0)
IncMid.cluster <- ifelse(custSalesInc$YearlyIncome <= 60000 & custSalesInc$YearlyIncome > 30000, 2, 0)
IncLow.cluster <- ifelse(custSalesInc$YearlyIncome <= 30000, 1, 0)

Now we need to combine these vectors together to one data set.

#bind clusters together
cat <- cbind(IncLow.cluster, IncMid.cluster, IncHigh.cluster)

The result looks like the following picture. For every customer (represented by each row) we get a score for the yearly income.

image

Nice, but what we really want is only one column that tells us in which cluster a certain customer is. Therefor we need to create a new object “IncCategoryKey” where we just sum the 3 variables for our clusters and bind the result to our custSalesInc dataset.

IncCategoryKey <- IncLow.cluster + IncMid.cluster + IncHigh.cluster
#bind data set and category keys
custSalesInc <- cbind(custSalesInc,IncCategoryKey)

As a result we get a structure like this, that shows the income category score for each customer.

image

We can also assign some friendly names to the categories, to make it more readable. The factor function in R allows us to define objects as category variables.

#assign labels to data set
IncCategoryLabel = factor(IncCategoryKey, labels = c("Low", "Mid", "High"))
#bind data set and category labels
custSalesInc <- cbind(custSalesInc,IncCategoryLabel)

The first parameter defines the name of the variable that needs to be converted, the second parameter contains our labels, defined as a vector (defined by the c() function) of “Low”, “Mid” and “High”. The last step binds the labels to our original dataset.

image

Visualize the income clusters

In our last step we want to visualize the income cluster with the help of a plot. Therefor we define a boxplot that shows the Income in USD on the x-axis and the customer cluster on the y-axis.

par(mfrow=c(1,1))
boxplot(custSalesInc$YearlyIncome~custSalesInc$IncCategoryLabel,main="Boxplot Income", xlab="Income in USD",ylab="Income Cluster",horizontal=TRUE)

As a result we get this nice graphic:

image

What we can see is, that in the lower cluster the minimum income is 10,000 USD, the maximum is 30,000 USD and the average is around 20,000 USD (thick black line). The medium cluster reaches from 40,000 USD up to 60,000 USD, whereas the average is around 50,000 USD. In the highest cluster values are spread from 60,000 USD up to 100,000 USD and we have some outliners in the range of 150,000 USD.

Save the results in our Data Warehouse

In the previous examples we used R for statistic calculations and visualizations. Now that we are done with the cluster analysis, we can also save our results back to our Data Warehouse to use it for further analytics.

cn<-odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=localhost;database=AdventureWorksDW2012;trusted_connection=yes;")
sqlSave(cn,custSalesInc,rownames=FALSE,tablename="InternetSalesCluster",colname=FALSE,append=FALSE)
odbcClose(cn)

We create a new ODBC connection to the AdventureWorks database and use the sqlSave function to store the results, which gives us an additional table with the income cluster per customer.

image

Summary

This post showed how easy we can use data stored in our Data Warehouse and use it in R to do more sophisticated analytics. R is a great tool to do this, also for data visualization. Important is, that these analytics are not a one-way road and a new data silo. Therefor I also showed how we can save the results back into our Data Warehouse so that other users are able to work with them.

References