…how well do they play together?

I continue my Data Science Lab series with a new story about Data Analytics based on SQL Server and R. As I already described in the Data Science workplace R is a typical and very often used platform for more advanced analytics like regressions, predictions or recommendations. And the good thing is, it’s free. But what is also typical is, that the data we use for those analytics is stored in relational databases like SQL Server. So in order to apply analytics on our data we need to first connect R to our SQL Server database. All test and examples you will find later on have been done with R version 3.0.2 and RStudio version 0.98.501, which I can highly recommend, when you work with R.

Connecting R and SQL Server

In order to connect R to SQL Server we need to use a connection first. Since there are no special drivers for SQL Server available we have to use the normal ODBC package. An ODBC connection to SQL Server can be made in the following way:

#load ODBC library
#create connection
cn <- odbcDriverConnect(connection="Driver={SQL Server 
Native Client 11.0};server=localhost;database=TPCH;

You need to replace the values for server and database according to your settings. The variable cn stored the connection object that we need for our queries.

Query a table

After we have create a connection we can now query our table. In order to query a table we have certain options:

Read with sqlFetch

The function sqlFetch reads some or all of a table from a database into a data frame. Here is an example (I reuse the connection object cn from the previous example):

#load data
data <- sqlFetch(cn, 'myTable', colnames=FALSE, rows_at_time=1000)

The variable data stores the result of our table in a data frame. I will come to details and advantages of a data frame in my next post.

Read with sqlQuery

By using sqlQuery we are able to submit a SQL query to a database and get the results. Example:

#load data
data <- sqlQuery(cn, "select * from myTable")
status <- sqlGetResults(cn, = FALSE, errors = TRUE,
			max = 0, buffsize = 1000000,
			nullstring = NA_character_, na.strings = "NA",
			believeNRows = TRUE, dec = getOption("dec"),
			stringsAsFactors = default.stringsAsFactors())

Also in this case the result of our query is stored in the variable data as a data frame. sqlGetResults is a mid-level function. It is called after a call to sqlQuery or odbcQuery to retrieve waiting results into a data frame. Its main use is with max set to non-zero when it will retrieve the result set in batches with repeated calls. This is useful for very large result sets which can be subjected to intermediate processing.

Read with odbcQuery

The last option is to use odbcQuery. This is a low-level function that talks directly with the ODBC interface and has some advantages, that I will explain later on. Here is an example:

status  <- odbcQuery(cn, "select * from myTable")
data <- odbcFetchRows(cn, max = 0, buffsize = 10000,
			nullstring = NA_character_, believeNRows = TRUE)
error <- odbcGetErrMsg(cn)

odbcFetchRows returns a data frame of the pending rowset, limited to max rows if max is greater than 0. buffsize may be increased from the default of 1000 rows for increased performance on a large dataset. This only has an effect when max = 0 and believeNRows = FALSE (either for the ODBC connection or for this function call), in which case buffsize is used as the initial allocation length of the R vectors to hold the results. (Values of less than 100 are increased to 100.) If the initial size is too small the vector length is doubled, repeatedly if necessary.
odbcGetErrMsg returns a (possibly zero-length) character vector of pending messages.

What about Performance?

In order to compare the different R function we need a test case and a baseline first. I will just have a look at read performance. So lets define our test case first.

My laboratory


To give you an idea about the test environment I will quickly give you some numbers of my machine. Since I don’t have any high performance servers I can use I just picked my tweaked notebook.

Having said that I also want to point out that I was not so interested in the absolute loading times. I bet that you can run faster tests with more expensive hardware. Here I want to show you the different approaches and how they affect the performance relatively to the absolute numbers.

  • HP EliteBook 8470p, 4 Cores, 2.9 GHz, 16 GR RAM, SSD Card
  • SQL Server 2014, CTP2
  • R (64 bit) and RStudio

The test case

Since it does really matter what kind of data we use for testing I picked the TPC-H data model and data generator as my test data set. The good thing with the TPC-H data model is, that you can download it also on your own and do the same tests on your machine. I created the LINEITEM table in my SQL Server with 1 million rows which is about 140 MB table storage. Here is a screenshot of the table structure:


Setting the stage – our baseline

In order to have some comparable results I first want to see what the baseline numbers and some reference numbers of my test lab are. To get to those numbers I tested the physical read speed of my SSD drive to exclude hardware bottlenecks from testing. I used SQLIO for that. Secondly I also tested BCP and SSIS (SQL Server Integration Services) to see how fast I can read the data with these tools. Here are the results:


As you can see I get a really good speed of my physical SSD drive which will not be a bottleneck in our further tests. The speed of SSIS is as fast as we would expect. If you what to understand how we can tune this, please have a look at my SSIS Performance Guidelines:

Recap and next steps

So far we have seen the different ways how we can connect R with SQL Server. In order to understand how well they perform for data analytics scenarios we need to create a test scenario and get some base values of performance. In my next post I will show how fast the different R functions are, how we can tune them and how we can fix some bottlenecks in order to speed up our read performance.