Parallel Data Warehouse (PDW) POC – lessons learned

image

The first version of Microsoft’s Parallel Data Warehouse is out for a while, now I had the chance to get my hands on it during a customer POC. Because PDW is an appliance solution the software is hardware bounded. You can’t download and install PDW on a normal server, you need the right hardware which needs to be MPP capable. Currently there are only 2 vendors providing PDW hardware, HP and DELL.

For all of you who need a further introduction into PDW I can recommend my past blog posts:

Customer Requirements

So let’s go back to the customer POC. My customer currently has a Data Warehouse solution build on SQL Server. The following list provides some high level information about the environment:

  • Data Volume: 15 TB
  • Number of users: 200
  • Biggest Dimension: 20 mill. records
  • Biggest Fact Table: 750 mill. records
  • Number of relational Data Warehouse layers: 3
  • Number of data sources: 15
  • Number of SSIS packages: 400
  • Daily processed data: 50 mill. records

Since his current hardware is not able to scale beyond this size he is looking for a new solution. Together with HP & Microsoft we started a 3 weeks POC to check if PDW can meet customer needs. Without going to much into detail let me quickly highlight the current customer pain points.

imageETL Performance: Currently all data loads are implemented with SQL Server Integration Services. Altogether there are about 400 SSIS package that are responsible for source system connectivity and to support data loading over 3 data warehouse layers. In the current environment data loads are done daily and have a predefined time frame for execution. The customer requirement was to get a least the same SSIS-Performance as on the current production system and to be able to scale SSIS in case of adding additional load processes to the daily load without breaking the current loading window.

imageRelational Query Performance: Today nearly 80% of all customer analysis is done based on an Analysis Services cube. The cube gives us the ability to present the complex data model and business logic in a simplified form that end users understand and are able to work with. But there is also a small amount of technical affine power users which want to be able to analyze data directly on the relational data model. So the demand for a scalable relational reporting solution is growing.

imageScalability: As mentioned before my customer is looking for a scalable Data Warehouse solution that is able to handle up to 50TB of relational data (estimated growth in the next 2 years).

 

 

imageAdministration: This area is not a real critical success criteria but the customer is also interested in administration efforts and IT operation processes like backup / restore, database administration, database reorg, etc.

 

 

Test cases

As you can imagine we can’t test all this within 3 weeks and there are also points that doesn’t make sense to test on your own, like scalability. I think based on the given MPP architecture it’s totally clear, that the system is able to scale for data volume. If you need more space, you need to buy an additional rack, this is how it works.

So clearly we had to focus on what we can really test within 3 weeks and what are the most important things the customer wants to see that are working so that he is able to make a decision to go ahead with PDW or not. Together with the team we agreed that we have a deeper look at ETL Performance and relational query performance.

ETL Performance

We decided to take a representative ETL process and run it against PDW.  This process includes:

  • 5 SSIS packages
  • 33 tables
  • 3 Stored Procedures (SP)
  • a User Defined Function (UDF)
    The first question that came up was “where do we executed the SSIS packages”?

As we all know from Best Practices it is recommended to have a dedicated environment for ETL and one for the relational engine. But what about PDW?

PDW is a solution for a relational SQL Server engine based on a MPP architecture. So the Best Practice about a dedicated ETL server is still valid. The good news is, that within the PDW architecture there is also a small ETL server included.

image

Why small? The server itself is not a high end scalable ETL server (it only has 6 Cores, 24 GB RAM, connected via Infiniband to the other servers). And because PDW is an appliance there is no option to put in more cores or memory in. So for the POC this server was good enough but for a real life project scenario an additional ETL server with enough hardware resources would be needed.

Relational Query Performance

To measure relational query performance the customer provided us 3 representative relational SQL queries with corresponding runtimes on the current environment. So we took these and executed them against PDW.

Next Steps

Within Part 2 of this PDW POC series I will dig into more details about our test cases, about the migration, issues we  faced with and of course the results and lessons learned we got. So stay tuned…

7 thoughts on “Parallel Data Warehouse (PDW) POC – lessons learned

  1. Bernhard says:

    Hi DWJunkie,

    SQL Server has the possibiltiy to show a graphical execution plan. As I no PDW has not this option. But there is the explain command, with this help it is possible to produce the execution plan for a query. The result is delivered in XML. Is there a tool to look at this data graphically? How did you work with that?

    Thanks
    Bernhard

    • Hi Bernhard,

      I don’t think so. The problem with PDW is, that it is a distributed node architecture. This means that the query is split up into different chunks and each chunk runs on a different node. This means you have different execution plans per node and then an overall one. But so far it is not possible to see the execution plans.

      DWJunkie

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s