Posts Tagged ‘PDW’

Parallel Data Warehouse – Slowly Changing Dimension

April 1, 2012

Most of my day-to-day work is currently centered around Microsoft’s Massively Parallel Processing Appliance call PDW (Parallel Data Warehouse).  Many of the same techniques common to SMP (Symmetric Multiprocessing)systems such as Kimball method Slowly Changing Dimension (SCD) are still very important.

Identifing the deltas between an incoming dataset and the destination dimension table can be one most resource intensive, complex, and valuable processes in a SMP or PDW ETL system. Careful planning is necessary to ensure that ETL process preforms well  and is scalable. Nowhere is this more important than in a PDW that can process 10s to 100s of terabytes daily!

My previous post Building the Enterprise DW/BI System with SQL Server PDW describes the basic PDW components discussed below.

The following gives some real world SCD statistics and lessons learned.

SCD – Replicate vs. Distributed tables

Replicated Table ‘A’ – 28,977,757 rows – 34 min

Distributed Table ‘A’ – 28,977,757 rows – 21 seconds!

The difference between  SCD on a replicated and distributed table is dramatic. I expected that the distributed table might preform slightly faster because the table is distributed across all the compute nodes and so therefore it maximizes the parallelism of the PDW. However I didn’t expected such a huge difference. (Note: 34 minutes to process 28 million + rows using type 2 SCD is not bad either. Try doing that on your SMP )

I found the reason for the difference in the Query Plan. The Query Plan contains several steps. The most  costly steps are SCD DML operations (update expire rows) steps. Basically for each update statement the PDW;

1 – Creates a copy of the Target table /Dimension Table

2 – Preforms the update on the copy

3 – Publishes the updated copy to each node

See example below which shows part of the SCD logic of a 1.5 million row table:

SCd update replicated tables

Conclusion: Distributed tables can a have superior DML performance, especially for large tables, however this SCD performance increase will come at the cost of query performance which will be slower if the queries do not contain the distribution column. From my experience I would recommend considering the possibility of converting  a replicated tabled to a distributed table for SCD DML performance improvement if the table is great than 5-10 million rows.

Distribute Table on a Key Column

If you chose to make the Destination Table/Dimension Table a Distributed Table it is important to distribute the table on  one of the Traget Tables Key Column(s), other wise the SCD DML operations will incur a heavy performance hit of ShuffleMoveOperation because the processing nodes must shuffle/move data in order to satisfy the SCD DML queries.

72 million row Distributed Table

Not Distributed on a Key Column – 13 min.

Distributed a Key Column –   30 seconds!

Example: Not Distributed on a Key Column:

Distributed on non Key Columns

Example: Distributed on a Key Column

Distributed on Key Columns

There is still a small ShuffleMove but the most expensive SufflesMoves have been eliminated.  It might not always be possible eliminated all ShuffleMoves.

Conclusion:

Microsoft’s Parallel Data Warehouse offers a huge advantage in Slowly Changing Dimension (DML) processing performance over traditional SMP systems.

Microsoft Data Warehouse Offerings

October 6, 2011

James Serra has recently posted a lot of good information about Microsoft’s Data Warehouse offerings. He has also provided a lot of good links to more information.

MicrosoftSQL Server Reference Architecture and Appliances

MicrosoftSQL Server Parallel Data Warehouse (PDW) Explained

The article Implementing a Microsoft SQL Server Parallel Data Warehouse Using the Kimball Approach  is a great article to get a little deeper into Microsoft’s Parallel Data Warehouse as well.

This white paper explores how the Kimball approach to architecting and building data warehouse/business intelligence (DW/BI) system works with Microsoft’s Parallel Data Warehouse, and how you would incorporate this new product as the cornerstone of your DW/BI system. For readers who are not familiar with the Kimball approach, we begin with a brief overview of the approach and its key principles. We then explore the Parallel Data Warehouse (PDW) system architecture and discuss its alignment with the Kimball approach. In the last section, we identify key best practices and pitfalls to avoid when building or migrating a large data warehouse to a Microsoft SQL Server PDW system. This includes a look at how Parallel Data Warehouse can work with other Microsoft offerings such as SQL Server 2008 R2, the Microsoft Fast Track Reference Architecture for Data Warehouse, and the new Business Data Warehouse to provide a complete solution for enterprise-class data warehousing.

I have even proposed my own ideas to help with SSISETL Scalability

As a side note, I was recently was drawn into a discussion of Kimball vs. Inmon.  In my experience, most people who say they understand both really don’t. Here is a good article from Kimball which explains some of the differences.

The Kimball bus architecture and the Corporate
Information Factory: What are the fundamental differences?

Facts and Fables About Dimensional Modeling

Honestly, I believe that reading all of Kimball’s books should be a requirement before anyone who is called a Microsoft BI Professional.


Follow

Get every new post delivered to your Inbox.

Join 54 other followers