As I pointed out in my post about a Data Science workplace I ‘m going to write more about how we can leverage tools like R in combination with Microsoft BI. In my last post I shortly introduced a typical Big Data architecture and showed how we can leverage Excel and Power Query to analyze unstructured data.

The Scenario

This post will focus more on the analytical side of Data Science. We will start with the AdventureWorks Data Warehouse, which is already installed in my workplace image. AdventureWorks is a demo case created by Microsoft, that represents a company that sells bicycles on different channels. In our scenario we only look at Internet Sales. The data model looks like this:

image

It’s a kind of snowflake schema that has dimensions like Product, Customer, Promotion, Date, Curreny, etc.

What we want to analyze is, how can we group customers by income or sales amount into different cluster in order to optimize our marketing.

Cluster analysis is a very common practice and very good explained in Wikipedia: http://en.wikipedia.org/wiki/Cluster_analysis.

Data Preparation

So first of all we need to prepare our dataset, that we want to analyze. Therefor I create a view, which combines sales and income data per customer.

CREATE VIEW [dbo].[vCustomerSalesIncome]
AS
SELECT
    fact.[CustomerKey]
    ,sum([SalesAmount]) as SalesAmount
    ,dimC.YearlyIncome
  FROM [dbo].[FactInternetSales] fact
  INNER JOIN dbo.DimCustomer dimC 
        on fact.CustomerKey = dimC.CustomerKey
  where CurrencyKey = 100 --USD
  group by fact.[CustomerKey], dimC.YearlyIncome

The result of the view looks like the following table:

image

No we can start using R in order to find clusters.

Data Analysis

R is a language and software environment for statistical computing and graphics. It’s a free software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It runs under UNIX, Linux, Windows and MacOS and is very often used in the Analytics area. More information can be found here: http://www.r-project.org/

I prefer to work with RStudio which can be downloaded here: http://www.rstudio.com/

Load the data into R

All the calculations in R happen in-memory. So first of all we need to load our basic dataset. To do this I import the ODBC library, create a new database connection to my AdventureWorks database in SQL Server and read the view “vCustomerSalesIncome” (the one we created before) via the sqlFetch command. At the end we close the database connection.

#load ODBC library
library(RODBC)
#create connection
cn<-odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=localhost;database=AdventureWorksDW2012;trusted_connection=yes;")
#load customer data
custSalesInc<-sqlFetch(cn,'vCustomerSalesIncome')
odbcClose(cn)

We can have a look in RStudio how our dataset now looks like. 11,818 rows have been imported, the table structure looks like in our database.

image

Data Density

Next we want to get a better feeling about our data. Plots in R are very good for data visualizations. So let’s create two plots that visualize the data density of the sales amount and the yearly income of our customers.

Therefor we call the plot function and apply the density function on our dataset columns “SalesAmount” and “YearlyIncome” and the main attribute defines the header of our chart.

par(mfrow=c(1,2))
plot(density(custSalesInc$SalesAmount), main="Desity Sales Amount")
plot(density(custSalesInc$YearlyIncome), main="Desity Yearly Income")

The result is shown in the next picture. What we see is, that the average sales amount per customer is around 1,200 USD and the average yearly income is around 56,000 USD. These information help us to better understand our customer sales behavior and to define the cluster more precisely.

Rplot

Define the income cluster

Based on our data density we now define the income clusters for low, medium and high income. Therefor we use an ifelse-function, that assigns a score based on the yearly income:

  • 3 if the yearly income is > 60,000 USD
  • 2 if the yearly income is <= 60,000 USD and > 30,000 USD
  • 1 if the yearly income is <= 30,000 USD
#define income cluster
IncHigh.cluster <- ifelse(custSalesInc$YearlyIncome > 60000, 3, 0)
IncMid.cluster <- ifelse(custSalesInc$YearlyIncome <= 60000 & custSalesInc$YearlyIncome > 30000, 2, 0)
IncLow.cluster <- ifelse(custSalesInc$YearlyIncome <= 30000, 1, 0)

Now we need to combine these vectors together to one data set.

#bind clusters together
cat <- cbind(IncLow.cluster, IncMid.cluster, IncHigh.cluster)

The result looks like the following picture. For every customer (represented by each row) we get a score for the yearly income.

image

Nice, but what we really want is only one column that tells us in which cluster a certain customer is. Therefor we need to create a new object “IncCategoryKey” where we just sum the 3 variables for our clusters and bind the result to our custSalesInc dataset.

IncCategoryKey <- IncLow.cluster + IncMid.cluster + IncHigh.cluster
#bind data set and category keys
custSalesInc <- cbind(custSalesInc,IncCategoryKey)

As a result we get a structure like this, that shows the income category score for each customer.

image

We can also assign some friendly names to the categories, to make it more readable. The factor function in R allows us to define objects as category variables.

#assign labels to data set
IncCategoryLabel = factor(IncCategoryKey, labels = c("Low", "Mid", "High"))
#bind data set and category labels
custSalesInc <- cbind(custSalesInc,IncCategoryLabel)

The first parameter defines the name of the variable that needs to be converted, the second parameter contains our labels, defined as a vector (defined by the c() function) of “Low”, “Mid” and “High”. The last step binds the labels to our original dataset.

image

Visualize the income clusters

In our last step we want to visualize the income cluster with the help of a plot. Therefor we define a boxplot that shows the Income in USD on the x-axis and the customer cluster on the y-axis.

par(mfrow=c(1,1))
boxplot(custSalesInc$YearlyIncome~custSalesInc$IncCategoryLabel,main="Boxplot Income", xlab="Income in USD",ylab="Income Cluster",horizontal=TRUE)

As a result we get this nice graphic:

image

What we can see is, that in the lower cluster the minimum income is 10,000 USD, the maximum is 30,000 USD and the average is around 20,000 USD (thick black line). The medium cluster reaches from 40,000 USD up to 60,000 USD, whereas the average is around 50,000 USD. In the highest cluster values are spread from 60,000 USD up to 100,000 USD and we have some outliners in the range of 150,000 USD.

Save the results in our Data Warehouse

In the previous examples we used R for statistic calculations and visualizations. Now that we are done with the cluster analysis, we can also save our results back to our Data Warehouse to use it for further analytics.

cn<-odbcDriverConnect(connection="Driver={SQL Server Native Client 11.0};server=localhost;database=AdventureWorksDW2012;trusted_connection=yes;")
sqlSave(cn,custSalesInc,rownames=FALSE,tablename="InternetSalesCluster",colname=FALSE,append=FALSE)
odbcClose(cn)

We create a new ODBC connection to the AdventureWorks database and use the sqlSave function to store the results, which gives us an additional table with the income cluster per customer.

image

Summary

This post showed how easy we can use data stored in our Data Warehouse and use it in R to do more sophisticated analytics. R is a great tool to do this, also for data visualization. Important is, that these analytics are not a one-way road and a new data silo. Therefor I also showed how we can save the results back into our Data Warehouse so that other users are able to work with them.

References