Most of my day-to-day work is currently centered around Microsoft’s Massively Parallel Processing Appliance call PDW (Parallel Data Warehouse). Many of the same techniques common to SMP (Symmetric Multiprocessing)systems such as Kimball method Slowly Changing Dimension (SCD) are still very important.
Identifing the deltas between an incoming dataset and the destination dimension table can be one most resource intensive, complex, and valuable processes in a SMP or PDW ETL system. Careful planning is necessary to ensure that ETL process preforms well and is scalable. Nowhere is this more important than in a PDW that can process 10s to 100s of terabytes daily!
My previous post Building the Enterprise DW/BI System with SQL Server PDW describes the basic PDW components discussed below.
The following gives some real world SCD statistics and lessons learned.
SCD – Replicate vs. Distributed tables
Replicated Table ‘A’ – 28,977,757 rows – 34 min
Distributed Table ‘A’ – 28,977,757 rows – 21 seconds!
The difference between SCD on a replicated and distributed table is dramatic. I expected that the distributed table might preform slightly faster because the table is distributed across all the compute nodes and so therefore it maximizes the parallelism of the PDW. However I didn’t expected such a huge difference. (Note: 34 minutes to process 28 million + rows using type 2 SCD is not bad either. Try doing that on your SMP )
I found the reason for the difference in the Query Plan. The Query Plan contains several steps. The most costly steps are SCD DML operations (update expire rows) steps. Basically for each update statement the PDW;
1 – Creates a copy of the Target table /Dimension Table
2 – Preforms the update on the copy
3 – Publishes the updated copy to each node
See example below which shows part of the SCD logic of a 1.5 million row table:
Conclusion: Distributed tables can a have superior DML performance, especially for large tables, however this SCD performance increase will come at the cost of query performance which will be slower if the queries do not contain the distribution column. From my experience I would recommend considering the possibility of converting a replicated tabled to a distributed table for SCD DML performance improvement if the table is great than 5-10 million rows.
Distribute Table on a Key Column
If you chose to make the Destination Table/Dimension Table a Distributed Table it is important to distribute the table on one of the Traget Tables Key Column(s), other wise the SCD DML operations will incur a heavy performance hit of ShuffleMoveOperation because the processing nodes must shuffle/move data in order to satisfy the SCD DML queries.
72 million row Distributed Table
Not Distributed on a Key Column – 13 min.
Distributed a Key Column – 30 seconds!
Example: Not Distributed on a Key Column:
Example: Distributed on a Key Column
There is still a small ShuffleMove but the most expensive SufflesMoves have been eliminated. It might not always be possible eliminated all ShuffleMoves.
Microsoft’s Parallel Data Warehouse offers a huge advantage in Slowly Changing Dimension (DML) processing performance over traditional SMP systems.