Note: This post is a continuation of my previous post entitled ‘Metadata Driven ETL Process’. If you have not already done so, download Metadata ETL Demo.zip and walk through step-by-step the SQL script.
As a review, the purpose of the Metadata Driven ETL Process was to create a robust and flexible Metadata Driven ETL Process that addresses some the most challenging situations that I have found. This is primarily through the use of metadata columns which allow for a portion (or the entire) Dimension table to be re-created in the event of a disaster (or bad ETL logic) without any data loss, even when the source system does not contain historical changes.
While there are many positives to this approach there is one big downside. The Metadata Driven ETL Process still has to load the entire source system stage tables into memory, sort and join them together, and do a Row-by-Row comparison of each field in each row to identify which field caused the row to be changed.( If your source system table is on a SQL Server 2008 or greater DB you can use the MERGE statement, thereby completely bypassing the SSIS lookup task)
<rant> I know that in some cases you are provided with a ‘Last Change’ datetime stamp metadata column from the source table. However, most of these are implemented through triggers which can and often are by-passed by a DBA for performance reasons, especially in data recovery situations. Unless you can get the DBA to sign in blood. I suggest not relying on triggers</rant>
Even then, only doing a lookup between a single source system table and a single stage table preforms a lot better than the standard ETL Process (sorting and joining all the source systems tables together and then sorting joining that to the target dimension table and finding all the changes). See the SSIS ETL Scalability for more details.
But for large dimensions (millions of members) this can still be a lot of processing. Change Data Capture was developed to make finding changes in even large dimensions take only a few seconds! Read the following to get the best possible performance out of CDC: Tuning the Performance of Change Data Capture in SQL Server 2008
It was the output to CDC that gave me the first idea for the Metadata Driven ETL Process but as great as CDC is there are a few negatives. It is only as accurate as the transaction log file :) There times when the transaction log file doesn’t have the changes that you need (the initial load of the stage table) or you don’t want changes in the transaction log to be reflect in the dimension table.
I recently ran into a situation where, as part of a source system upgrade, the tables were truncated and completed reloaded. Initially the data was bad even after reloading the source table. It took them a few tries to get it right (yes this happened in production :) . As a result even though my dimension table was accurate in terms of preserving all the changes, it made analysis inaccurate for the time period that the upgrade was occurring.
To address this situation and many more I combined CDC for performance with ability to use Row-By-Row on an ad-hoc basis for flexibility.
How does this all work together? The answer is spectacularly!!!
Download Metadata Driven ETL with Change Data Capture.zip and follow each step of the Change Data Capture Demo.sql script