I will continue my blog series about SQL Server 2012 Parallel Data Warehouse. If you missed my earlier posts of this series you can find them here:
In this blog post we will have a more detailed look on the POC experiences I made with PDW v2 and also with PDW v1. Since these POCs are under NDA I’m not allowed to tell anything about my customer or the customer background. I will purely focus on the results, which are impressive enough
Customer Scope & Expectations
Before starting a POC it’s always important to define a clear scope, customer expectations and acceptance criteria. In my case my customer had a clear expectation.
- Increase ETL load Performance
- Increase load performance by factor 10
- ETL Performance compared to ELT
- SSIS migration effort
- Show support of parallelism handling by PDW
- Current limitations
- Relational Query Performance
- Cube Processing Performance
I think this is a very clear mission, so let’s see how fast we can get it…
Current customer & POC environment
The next picture shows high-level the customer production environment:
The BI solutions is split into a relational Data Warehouse and into an Analysis Services cube. Each part has a dedicated server, the cube side is supported by very fast Fusion-IO cards, the database server is equipped with 16 cores and 128 GB of RAM. ETL is handled by SQL Server Integration Services, which also runs on the database server.
The POC environment consist of a PDW for the relational database and a separated ETL where the cube was also installed on.
We used a SQL Server 2012 Parallel Data Warehouse with 8 compute nodes. The ETL Server had 24 Cores, 96 GB of RAM and was connected to the PDW via Infiniband and Ethernet.
In order to start the POC we first had to migrate the database from SQL Server to PDW. Here just very quickly what we did in 1 week with 2 people:
- Migration of 3 Databases (220 tables, 80 views, 40 stored procedures)
- Adoption of 3 SSIS packages (only connection strings and use of special PDW destination task)
- Adoption of the SSAS cube (only connection strings)
SSIS Load Performance
Just to remember, the easiest SSIS package we can build can give us just a write performance of 25 – 30 MB/s, even with very good hardware. On a normal SMP environment the only way to get this faster is to parallelize the load streams. These techniques I already explained in detail here:
And if we apply these techniques we will propably get SSIS packages like this, that give us a write performance of 120 MB/s:
But this was not the goal of this POC, we wanted to show how good PDW can parallelize our write operations without having such complex SSIS packages with index handling, partition management and duplicate business logic.
For our tests we only used a SSIS package with one data flow (originally design by our customer), removed all index handling (disabled tasks on the left) and partition management operations. So it looks like this:
The first results were a little bit disappointing. We only saw slightly better performance than on a normal SQL Server with 38 MB/s.
But then we tested T-SQL based ETL with a big difference in case of performance, as you can see on the picture above. We also tested different database designs like loading into a Heap, Clustered Index, Clustered Columnstore Index and into a Heap first and apply a Clustered Columnstore index afterwards. But why is SSIS so slow?
The reason is, that we have a highly parallel architecture for our SQL Server, but in order to load data into our Data Warehouse we move the data out of PDW, through the SSIS pipeline and then back to PDW. So the SSIS pipeline becomes our bottleneck. The solution?
We need to read faster in order to push the SSIS pipeline.
We can to this very easily by adding only more OleDB read task to our data flow so that I looks like this:
The performance results are pretty awesome, we get a nearly linear scale, up to 180 MB/s.
We applied this technique to our customer SSIS packages and here are the results:
Now we also migrated the SSIS logic to T-SQL based ETL and compared again:
Does it mean I need to migrate all my SSIS package to T-SQL code, if I use SQL Server Parallel Data Warehouse? Of course not. As I showed, we also get a very good performance with SSIS, but you can get a better load performance with T-SQL, when you need it. So if you need more than 200 MB/s write performance, you should better go with T-SQL based ETL.
The next topic I want to highlight is data compression. We all know data compression already from SQL Server, where we can specify per table or even per partition the storage format (uncompressed, row compressed or page compressed). SQL Server 2012 Parallel Data Warehouse ships with Microsoft’s in-memory technology xVelocity, which allows us to store complete tables in a column oriented, highly compressed format directly in-memory.
In our POC we wanted to analyze, how much additional compression we can get with that technology:
In order to test the compression rates we took 3 different data sets (2 bn., 12 bn. and 80 bn. rows). The first table shows the compression factor on a normal SQL Server 2012, where page compression offers the highest compression rate.
The table below shows the compression rates on PDW. As you can see a normal heap table already offers a high compression rate, because in PDW data is always compressed. But when we store our data in column oriented way within a columnstore table, we get a much higher compression rate. So PDW offers a 5-7x better data compression than SQL Server 2012.
Relational Query Performance
The last point I want to highlight is the relational query performance. We used a customer reference query for our tests and executed the same query without any optimizations on PDW. We also didn’t create any indexes on the tables, and here are the results:
The table above shows in the first line the base line performance of the query on the current customer production system. With no further optimizations we achieved a 4x better performance. Querying data from a Columnstore table instead of a heap gives us extra 2x performance and by adjusting the workload management to use more resources, we are able to run the same query 8x faster. PDW has a build-in workload management that can be used to assign resources to certain users or groups. So if you have a heavy or very critical query, you can assign more resources to it, in order to run faster.
Based on what we have seen so far I want to quickly summarize the POC results:
- SSIS based load processes are up to 6x faster
- T-SQL based load processes are up to 20x faster
- Relational queries are 8x faster in average
- 50x data compression (compared to uncompressed data)
- Performance & Storage can be scaled in a linear way by adding additional PDW nodes
- PDW abstraction layer simplifies database administration
- Automated creation & distribution of database files
- Automated creation & distribution of database file groups
- Indices are used rarely