As I already stated out in my first post I want share some insights about the new SQL Server 2012 version of Parallel Data Warehouse. In this post I will talk about some new features as well as architecture changes.

Architecture Update

The SQL Server 2012 version of PDW introduces some major architecture changes:

  • As show in the picture below the Landing Zone and the Backup node have been removed from the appliance and there good reasons for. The standard sizing of these component didn’t met most of the customer requirements and it was very hard to find a configuration that meets 80% of PDW customers. So decision has been made to remove these components and customers have now more flexibilities to size the ETL and the Backup node to their needs.

image

  • Expensive storage components have been replace by an economical high density Direct Attached Storage
  • Virtualization of the Software Components based on Windows Server 2012 and Hyper-V
  • Modular Design and Smarter Scale Out: SQL Server 2012 PDW has reshaped the very hardware specifications required of an appliance through innovations from the software to deliver optimal value to customers. The new version introduces a new modular design, as show in the following picture:

image

The smallest PDW appliance consists of a Base Scale Unit (with a Passive Scale Unit for HA) and can be extended by a number of Capacity Units. If you reach the physical space limit of the rack you can add up to 6 more racks.

Depending of the vendor the units are shipped differently: HP offers a Base Scale Unit with 2 compute nodes and every Capacity Unit is also shipped with 2 compute nodes (left side of the picture). Dell offers the Base Scale Unit and the Capacity Unit with 3 compute nodes each (right side of the picture).

If we have a look at the capacity the smallest PDW appliance from HP (Quarter Rack with only the Base Scale Unit) offers a capacity of 53 – 227 TB (depending on the compression rates) and a raw disk space of 45 TB (with 3 TB disks). The Quarter Rack of DELL provides a capacity of 79 – 3470 TB (depending on the compression rates) and a raw disk space of 68 TB (with 3 TB disks).

  • Further Scale: Based on the new modular hardware design and a hardware refresh PDW offers now a scale out capacity up to 5 Petabytes.

Software Architecture Overview

The new version of PDW provides also some major software updates. As the product name already states PDW is now running on SQL Server 2012. The operating system on all hosts is Windows Server 2012 Standard edition. All fabric and workload activity happens in Hyper-V virtual machines which also run on Windows Server 2012 Standard edition. A PDW Agent runs on all hosts and all VMs and collects appliance health data on fabric and workload.

image

A special PDW version of SQL Server 2012 Enterprise Edition is used on the Control node and on all Compute nodes to provide high scale database capabilities.

New Features

Additionally to the new hardware and software architecture I want to highlight some more very interesting features for customers:

  • Columnar Storage: Microsoft continues the rollout of xVelocity and provides with SQL Server 2012 PDW a new primary storage type for databases. Customers can now choose between a row store and a new updateable version of the xVelocity memory optimized columnstore as table storage format. This means that we can define a writable Clustered Columnstore Index (Updates and bulk load are fully supported) at a table so that the whole table is stored into memory and we benefit of a much higher compression by the column oriented storage format.
  • Visual Studio 2012 Integration: As you probably know the tool used for database administration and database development in PDW Version 1 was Nexus. With the new version of PDW we have now full support for SQL Server Data Tools for Visual Studio 2012.

image   image

Also the project types for SSIS, SSRS & SSAS are fully supported with Visual Studio 2012.

  • Monitoring Enhancements: The whole monitoring of sessions, queries, loads, appliance health status and performance information has been completely redesign. Microsoft did a very good job here and the design looks like the Azure Portal. This means also for customers that monitoring and operations will look like the same On Premises as well as in the Cloud on Windows Azure. The following screenshot gives an impression of the new web user interface:

image

    • Polybase & Hadoop Integration: Today almost every big database vendor also provide a “Big Data” solutions based on Hadoop. Whether it’s a vendor specific distribution like IBM’s BigInsights solution, which is based on Cloudera or Microsoft’s specific implementation of Hortonworks Hadoop distribution called HDInsight or a full appliance which comes pre-installed with Hadoop. No matter which Hadoop platform you choose you still have the challenge of integration. Some type of data and analytics will happen on Hadoop (like text, log or sensor analysis) but we will still use databases for BI & Reporting.

While this might be something you have decided to do, you realize that there is a big learning curve when your IT department needs to re-orient themselves around HDFS, MapReduce, Hive, Hbase, etc. rather than T-SQL and a standard RDBSMS design. It will require a significant re-training around Hadoop and the ecosystem as well as a major effort to integrate the Hadoop implementation with the data warehouse.

In order to meet the requirements of a modern data platform, it must provide insights to your end users without having to acquire another tool from a third party or another expensive appliance offering to purchase.

The unfortunate reality is that no one vendor can deliver on all the options you need at a cost that you want to pay. They either have a data warehouse solution but no BI or provide BI but no data warehousing. Some vendors provide a Big Data solution but is disconnected with their data warehouse solution. Finally, some vendors might have a solution for each workload and will happily charge you millions of euros or dollars for them all.

image

Microsoft goes a different way and brings with SQL Server 2012 PDW an integrated query layer called “Polybase” which enables queries across Hadoop and SQL Server. Data structures stored in Hadoop are described by tables in PDW and customers can consume these data by standard T-SQL and can also join those tables with normal relational ones. So to end users that only consume this data it’s totally transparent where the data comes from and IT departments can use their normal database skillset to work with Hadoop. How does those kind of tables look like?

image

Now we can easily query and join this table.

image

Polybase is developed by the PDW team together with the Gray System Lab and its famous team lead David DeWitt. More detailed information can be found on the project page: http://gsl.azurewebsites.net/Projects/Polybase.aspx or in the video of the PASS 2012 conference http://www.sqlpass.org/summit/2012/DavidDewittSpotlight.aspx

Summary

As you can see from this post Microsoft did very heavy investments in its modern Big Data platform and gives customers the possibility to invest build high scale solutions on SQL Server 2012 PDW as well as on HDInsight for Hadoop based workloads. With Polybase customers get a fully integrated Hadoop query engine into the Data Warehouse Layer that can easily consumed with T-SQL knowledge.

The new architecture and the modular design of the appliance gives customers the possibility to start with small investments and scale very cost efficient on demand.

In my next post I will talk in more detail about a POC I did with the new version and the results and lessons learned.

Source: Microsoft slide decks of SQL Server 2012 Parallel Data Warehouse