Analytic Powerhouse–Parallel Data Warehouse and R

It’s already some weeks ago since I presented this session about PDW and R at the SQL Server Conference in Darmstadt. The conference itself was very nice, we had about 350 participants, mostly from Germany. I wanted to take the chance to say thank you to organization team, especially to PASS and Microsoft.

For everybody who missed my talk, I uploaded my slides to slideshare: http://de.slideshare.net/marcelfranke/analytic-powerhouse-parallel-data-warehouse-und-r

All slides are still in German but I will write another blog post in English about the R integration and findings. If you have any questions or feedback, please let me know.

 

Create a Data Science Lab with Microsoft and Open Source Tools

…this was the title of my session I had at the SQL Rally Nordic. Together with my friends Gerhard, Alexei and Thomas and a crowd of the best speakers in the PASS community we had the chance to present in front of 500 attendees from more than 20 different countries. If you missed the conference you can find some impressions here: http://www.flickr.com/photos/passevents/

So finally a very good event and my respect to the organization team, which can be very proud of what they achieved. The last thing I can do is to share my slides, so have fun.

Rock your data with SQL Server 2012 Parallel Data Warehouse (PDW) – POC Experiences

I will continue my blog series about SQL Server 2012 Parallel Data Warehouse. If you missed my earlier posts of this series you can find them here:

In this blog post we will have a more detailed look on the POC experiences I made with PDW v2 and also with PDW v1. Since these POCs are under NDA I’m not allowed to tell anything about my customer or the customer background. I will purely focus on the results, which are impressive enough Smile

Customer Scope & Expectations

Before starting a POC it’s always important to define a clear scope, customer expectations and acceptance criteria. In my case my customer had a clear expectation. Smileimage

  • Increase ETL load Performance
    • Increase load performance by factor 10
    • ETL Performance compared to ELT
    • SSIS migration effort
    • Show support of parallelism handling by PDW
  • Current limitations
  • Relational Query Performance
  • Cube Processing Performance

I think this is a very clear mission, so let’s see how fast we can get it…

Current customer & POC environment

The next picture shows high-level the customer production environment:

image

The BI solutions is split into a relational Data Warehouse and into an Analysis Services cube. Each part has a dedicated server, the cube side is supported by very fast Fusion-IO cards, the database server is equipped with 16 cores and 128 GB of RAM. ETL is handled by SQL Server Integration Services, which also runs on the database server.

image

The POC environment consist of a PDW for the relational database and a separated ETL where the cube was also installed on.

We used a SQL Server 2012 Parallel Data Warehouse with 8 compute nodes. The ETL Server had 24 Cores, 96 GB of RAM and was connected to the PDW via Infiniband and Ethernet.

POC Preparation

In order to start the POC we first had to migrate the database from SQL Server to PDW. Here just very quickly what we did in 1 week with 2 people:

  • Migration of 3 Databases (220 tables, 80 views, 40 stored procedures)
  • Adoption of 3 SSIS packages (only connection strings and use of special PDW destination task)
  • Adoption of the SSAS cube (only connection strings)

SSIS Load Performance

Just to remember, the easiest SSIS package we can build can give us just a write performance of 25 – 30 MB/s, even with very good hardware. On a normal SMP environment the only way to get this faster is to parallelize the load streams. These techniques I already explained in detail here:

And if we apply these techniques we will propably get SSIS packages like this, that give us a write performance of 120 MB/s:

image

But this was not the goal of this POC, we wanted to show how good PDW can parallelize our write operations without having such complex SSIS packages with index handling, partition management and duplicate business logic.

For our tests we only used a SSIS package with one data flow (originally design by our customer), removed all index handling (disabled tasks on the left) and partition management operations. So it looks like this:

image

Results

The first results were a little bit disappointing. We only saw slightly better performance than on a normal SQL Server with 38 MB/s.

image

But then we tested T-SQL based ETL with a big difference in case of performance, as you can see on the picture above. We also tested different database designs like loading into a Heap, Clustered Index, Clustered Columnstore Index and into a Heap first and apply a Clustered Columnstore index afterwards. But why is SSIS so slow?

The reason is, that we have a highly parallel architecture for our SQL Server, but in order to load data into our Data Warehouse we move the data out of PDW, through the SSIS pipeline and then back to PDW. So the SSIS pipeline becomes our bottleneck. The solution?

We need to read faster in order to push the SSIS pipeline.

We can to this very easily by adding only more OleDB read task to our data flow so that I looks like this:

image

The performance results are pretty awesome, we get a nearly linear scale, up to 180 MB/s.

Customer results

We applied this technique to our customer SSIS packages and here are the results:

image

Now we also migrated the SSIS logic to T-SQL based ETL and compared again:

image

Does it mean I need to migrate all my SSIS package to T-SQL code, if I use SQL Server Parallel Data Warehouse? Of course not. As I showed, we also get a very good performance with SSIS, but you can get a better load performance with T-SQL, when you need it. So if you need more than 200 MB/s write performance, you should better go with T-SQL based ETL.

Data Compression

The next topic I want to highlight is data compression. We all know data compression already from SQL Server, where we can specify per table or even per partition the storage format (uncompressed, row compressed or page compressed). SQL Server 2012 Parallel Data Warehouse ships with Microsoft’s in-memory technology xVelocity, which allows us to store complete tables in a column oriented, highly compressed format directly in-memory.

In our POC we wanted to analyze, how much additional compression we can get with that technology:

image

In order to test the compression rates we took 3 different data sets (2 bn., 12 bn. and 80 bn. rows). The first table shows the compression factor on a normal SQL Server 2012, where page compression offers the highest compression rate.

The table below shows the compression rates on PDW. As you can see a normal heap table already offers a high compression rate, because in PDW data is always compressed. But when we store our data in column oriented way within a columnstore table, we get a much higher compression rate. So PDW offers a 5-7x better data compression than SQL Server 2012.

Relational Query Performance

The last point I want to highlight is the relational query performance. We used a customer reference query for our tests and executed the same query without any optimizations on PDW. We also didn’t create any indexes on the tables, and here are the results:

image

The table above shows in the first line the base line performance of the query on the current customer production system. With no further optimizations we achieved a 4x better performance. Querying data from a Columnstore table instead of a heap gives us extra 2x performance and by adjusting the workload management to use more resources, we are able to run the same query 8x faster. PDW has a build-in workload management that can be used to assign resources to certain users or groups. So if you have a heavy or very critical query, you can assign more resources to it, in order to run faster.

Summary

Based on what we have seen so far I want to quickly summarize the POC results:

  • SSIS based load processes are up to 6x faster
  • T-SQL based load processes are up to 20x faster
  • Relational queries are 8x faster in average
  • 50x data compression (compared to uncompressed data)
  • Performance & Storage can be scaled in a linear way by adding additional PDW nodes
  • PDW abstraction layer simplifies database administration
    • Automated creation & distribution of database files
    • Automated creation & distribution of database file groups
    • Indices are used rarely

References

Rock your data with SQL Server 2012 Parallel Data Warehouse (PDW) – What’s new?

As I already stated out in my first post I want share some insights about the new SQL Server 2012 version of Parallel Data Warehouse. In this post I will talk about some new features as well as architecture changes.

Architecture Update

The SQL Server 2012 version of PDW introduces some major architecture changes:

  • As show in the picture below the Landing Zone and the Backup node have been removed from the appliance and there good reasons for. The standard sizing of these component didn’t met most of the customer requirements and it was very hard to find a configuration that meets 80% of PDW customers. So decision has been made to remove these components and customers have now more flexibilities to size the ETL and the Backup node to their needs.

image

  • Expensive storage components have been replace by an economical high density Direct Attached Storage
  • Virtualization of the Software Components based on Windows Server 2012 and Hyper-V
  • Modular Design and Smarter Scale Out: SQL Server 2012 PDW has reshaped the very hardware specifications required of an appliance through innovations from the software to deliver optimal value to customers. The new version introduces a new modular design, as show in the following picture:

image

The smallest PDW appliance consists of a Base Scale Unit (with a Passive Scale Unit for HA) and can be extended by a number of Capacity Units. If you reach the physical space limit of the rack you can add up to 6 more racks.

Depending of the vendor the units are shipped differently: HP offers a Base Scale Unit with 2 compute nodes and every Capacity Unit is also shipped with 2 compute nodes (left side of the picture). Dell offers the Base Scale Unit and the Capacity Unit with 3 compute nodes each (right side of the picture).

If we have a look at the capacity the smallest PDW appliance from HP (Quarter Rack with only the Base Scale Unit) offers a capacity of 53 – 227 TB (depending on the compression rates) and a raw disk space of 45 TB (with 3 TB disks). The Quarter Rack of DELL provides a capacity of 79 – 3470 TB (depending on the compression rates) and a raw disk space of 68 TB (with 3 TB disks).

  • Further Scale: Based on the new modular hardware design and a hardware refresh PDW offers now a scale out capacity up to 5 Petabytes.

Software Architecture Overview

The new version of PDW provides also some major software updates. As the product name already states PDW is now running on SQL Server 2012. The operating system on all hosts is Windows Server 2012 Standard edition. All fabric and workload activity happens in Hyper-V virtual machines which also run on Windows Server 2012 Standard edition. A PDW Agent runs on all hosts and all VMs and collects appliance health data on fabric and workload.

image

A special PDW version of SQL Server 2012 Enterprise Edition is used on the Control node and on all Compute nodes to provide high scale database capabilities.

New Features

Additionally to the new hardware and software architecture I want to highlight some more very interesting features for customers:

  • Columnar Storage: Microsoft continues the rollout of xVelocity and provides with SQL Server 2012 PDW a new primary storage type for databases. Customers can now choose between a row store and a new updateable version of the xVelocity memory optimized columnstore as table storage format. This means that we can define a writable Clustered Columnstore Index (Updates and bulk load are fully supported) at a table so that the whole table is stored into memory and we benefit of a much higher compression by the column oriented storage format.
  • Visual Studio 2012 Integration: As you probably know the tool used for database administration and database development in PDW Version 1 was Nexus. With the new version of PDW we have now full support for SQL Server Data Tools for Visual Studio 2012.

image   image

Also the project types for SSIS, SSRS & SSAS are fully supported with Visual Studio 2012.

  • Monitoring Enhancements: The whole monitoring of sessions, queries, loads, appliance health status and performance information has been completely redesign. Microsoft did a very good job here and the design looks like the Azure Portal. This means also for customers that monitoring and operations will look like the same On Premises as well as in the Cloud on Windows Azure. The following screenshot gives an impression of the new web user interface:

image

    • Polybase & Hadoop Integration: Today almost every big database vendor also provide a “Big Data” solutions based on Hadoop. Whether it’s a vendor specific distribution like IBM’s BigInsights solution, which is based on Cloudera or Microsoft’s specific implementation of Hortonworks Hadoop distribution called HDInsight or a full appliance which comes pre-installed with Hadoop. No matter which Hadoop platform you choose you still have the challenge of integration. Some type of data and analytics will happen on Hadoop (like text, log or sensor analysis) but we will still use databases for BI & Reporting.

While this might be something you have decided to do, you realize that there is a big learning curve when your IT department needs to re-orient themselves around HDFS, MapReduce, Hive, Hbase, etc. rather than T-SQL and a standard RDBSMS design. It will require a significant re-training around Hadoop and the ecosystem as well as a major effort to integrate the Hadoop implementation with the data warehouse.

In order to meet the requirements of a modern data platform, it must provide insights to your end users without having to acquire another tool from a third party or another expensive appliance offering to purchase.

The unfortunate reality is that no one vendor can deliver on all the options you need at a cost that you want to pay. They either have a data warehouse solution but no BI or provide BI but no data warehousing. Some vendors provide a Big Data solution but is disconnected with their data warehouse solution. Finally, some vendors might have a solution for each workload and will happily charge you millions of euros or dollars for them all.

image

Microsoft goes a different way and brings with SQL Server 2012 PDW an integrated query layer called “Polybase” which enables queries across Hadoop and SQL Server. Data structures stored in Hadoop are described by tables in PDW and customers can consume these data by standard T-SQL and can also join those tables with normal relational ones. So to end users that only consume this data it’s totally transparent where the data comes from and IT departments can use their normal database skillset to work with Hadoop. How does those kind of tables look like?

image

Now we can easily query and join this table.

image

Polybase is developed by the PDW team together with the Gray System Lab and its famous team lead David DeWitt. More detailed information can be found on the project page: http://gsl.azurewebsites.net/Projects/Polybase.aspx or in the video of the PASS 2012 conference http://www.sqlpass.org/summit/2012/DavidDewittSpotlight.aspx

Summary

As you can see from this post Microsoft did very heavy investments in its modern Big Data platform and gives customers the possibility to invest build high scale solutions on SQL Server 2012 PDW as well as on HDInsight for Hadoop based workloads. With Polybase customers get a fully integrated Hadoop query engine into the Data Warehouse Layer that can easily consumed with T-SQL knowledge.

The new architecture and the modular design of the appliance gives customers the possibility to start with small investments and scale very cost efficient on demand.

In my next post I will talk in more detail about a POC I did with the new version and the results and lessons learned.

Source: Microsoft slide decks of SQL Server 2012 Parallel Data Warehouse

Rock your data with SQL Server 2012 Parallel Data Warehouse (PDW)

I had the chance to do a customer POC on a very early version of SQL Server 2012 Parallel Data Warehouse and like to share the new features for PDW as well as the impressive results we got. The good thing with this POC is, that we did the same POC also on the SQL Server 2008R2 version of PDW so we have a direct comparison on the results.

If you are not familiar with Parallel Data Warehouse I can suggest my further post list about PDW to get a step into it: https://dwjunkie.wordpress.com/pdw/

I will split this post into several topics that I will explain in more details in my next posts:

Source: http://www.worldwithlove.org/blog/lets-rock-the-green-music-festival/

Microsofts Big Data Appliance

As you probably already heard there will a new version of the Microsoft Parallel Data Warehouse available in the first half of 2013. So let me quickly go through the highlights of Version 2.

Functionality

The new version will be based on SQL Server 2012 and will also include the enhancements of the column store index. This means Microsoft ships also in-memory technology within it’s product. Most of the T-SQL limitations that we know from version 1 will be also gone.

Architecture & Hardware

Additionally the new version will have some architectural changes, a hardware refresh (HP EDW v2) and a new Quarter Rack, which is more attractive also for smaller Data Warehouse projects and customers.

Big Data & more

As you probably also heard from the PASS Summit PDW Version 2 will ship with Polybase. Polybase provides users the ability to query against structured data that is stored in PDW and unstructured data that is stored in Hadoop in the same way (T-SQL).

image

This means that Parallel Data Warehouse is and will be Microsoft’s strategic Big Data Platform and Appliance that combines the scale and performance of a relational Data Warehouse with the velocity, variety and volume of unstructured data stored in Hadoop.  And as Google already figured out NoSQL will not be the solution to handle all kind of data in a way that query performance will be acceptable by business users. If there is a certain structure in your data, you probably want to use it. Smile

If you are interested in more details about Polybase have a look at the session of David DeWitt: http://www.sqlpass.org/summit/2012/DavidDewittSpotlight.aspx

Parallel Data Warehouse – POC lessons learned Part 3

The first part of the PDW lessons learned series gave an introduction about the goals of the POC, a customer introduction, an overview of the requirements and the test cases. Within the second part I gave some more insights on how to migrate the existing code base I used for the POC (relational structures, stored procedures, SSIS packages) in order to work with PDW. The last post will concentrate on the results and a summary of our findings.

ETL Performance

As shown in my series “SSIS Design for Performance and Scale” a typically designed SSIS package with only 1 stream in the dataflow will give us a write performance of around 25 MB/s. Because data load performance is a critical issue in the current customer environment one of the original SSIS packages looked like this:

image

What we can see here are 6 sequence container that are executed as 6 threads in parallel, every container handles a subset of the data. But there are several disadvantages with that kind of design:

  • Complex parallel loads of partitions (partition by quarter)
  • Manual parallelism within package design
  • Multiple implementation of business logic per thread

With PDW we wanted to see how much the engine itself can help us in order to parallelize data loads. So we decided to remove all parallelism from the package so that it looks like this:

image

This gives us the following advantages:

  • Parallelism of load processes is handled by PDW (let’s see)
  • Much simpler SSIS Design -> less implementation effort
  • Business logic exists only once -> simpler and faster maintainability

So far so good, but what about the performance numbers? Let’s have a quick look at the 2 scenarios we have tested. We used the Landing Zone as host for Integration Services and executed all packages on this server.

image

Scenario 1

In the original customer scenario the SSIS package loads data from on table to another table within the same database. The whole database is stored in PDW.

Scenario 2

Because we had some trouble with the read performance we decided to put a backup of the database to the landing zone and load the data from the landing zone to the PDW nodes.

Performance Results

The following matrix shows the performance numbers we got during our tests.

image

The column “Production Duration” shows the reference value from the current customer environment. This is our baseline number.

Now we can see 2 interesting behaviors. The first scenario is 2x slower, but the second one up to 4x faster. So what is the reason for these big differences in performance?

  • Read data from PDW is slow, around 8 MB/s (OleDB driver) –> Scenario 1
  • Load data into PDW is fast, around 45 MB/s –> Scenario 2
  • PDW automatically distributes loading processes

Important: The product group is aware of the slow OleDB driver and is working hard to make it faster. We have already seen a much better performance on a later Appliance Update.

Summary

I always hear people saying that SSIS is not the right tool for data loading scenarios with PDW. It’s maybe not the fastest option, but we have seen nevertheless good performance measures. Just to remember the easiest SSIS package on a normal SQL Server gives us a write performance of 25 MB/s. With PDW we get 45MB/s, so performance is nearly doubled. If this performance is good enough for you, SSIS is still a good choice and of course additionally you can apply the same parallelism techniques for further scale. And sometimes you don’t have the option to migrate all existing SSIS packages to a T-SQL based ETL.

My feeling is, that there is much more possible in case of performance, if the driver becomes faster and more efficient in the future.

Very often also the BCP tool (bulk copy utility) is recommended for data loading. This is a very fast option if you get your source data as flat file. But when you first need to create flat files before you can use BCP you better use SSIS for direct connectivity.

Relational Query Performance

Now let’s have a look at the relational query part. We got 3 reference queries from the customer that we used for our tests. In order to make the queries work with the PDW database we only had to change the schema name (because only the “dbo” schema is currently supported) and that’s it. And the performance? Awesome, as you can see here:

image

The column “Production Duration” shows the reference value from the current customer environment. This is our baseline number.

All queries are up to 8 times faster. This is the real power that PDW offers us. A highly scalable SQL engine with lot’s of query power. So no surprises here, it works as we had expect it.

Summary

So what do we learn from this POC. I’d like to address the most important things we have seen again in a short bullet list.

Usability

einfachheit

  • PDW is not (yet) a normal SQL Server. So you need to understand the architecture of a MPP system and why it works differently to a SMP system
  • It’s not a no brainer. You need to think about, why would you need to buy a MPP System and which problem should it solve. It’s not a “one size fit’s all” thing.
  • Plan for migration effort. There are features in SQL Server that are currently not supported in PDW. Work closely with the Microsoft team to understand your specific case and there are already a lot of good design pattern you can use to overcome the missing features.

Performance

PowerRanger

  • The relational query engine is fast. You have seen the results, but you can also write statements that do not well perform. Think about data distribution and other techniques and avoid data shuffling.
  • Plan your ETL. As we have see we get good parallelism support for SSIS, although the read problem has to be solved. So the system helps us with built-in parallelism support and we can keep our SSIS packages much simpler for a longer time.
  • Think about ELT.  I think in a MPP environment the SSIS engine itself becomes the bottleneck. We take out the data from a highly parallelized engine pump it through a buffer oriented SSIS engine and put it back to our highly parallelized engine. This works like a funnel which scales only to a certain point. If you can start from scratch or optimize certain packages try to use SSIS to bring the data into PDW and do the transformations within the engine.

You are free to draw your own conclusion out of the presented results or you can also contact we for further questions or feedback. Don’t hesitate, just let me know.

References