SQL Server Database Performance Tuning: ETL the Right Way?

The good thing about the Microsoft SQL Server Database Engine is that there can be many ways to perform a certain task and the bad thing about the SQL Server Database Engine is that there can many ways to perform a certain task. Sometimes it can be difficult to choose the best option for your situation and so most people naturally go with what they are most familiar with.

Take choosing an  ETL process as an example. At its most basic level an ETL process simply moves data from point A (source) to point B (data warehouse) but there are many ways to do this and each one has its own inherent advantages and disadvantages. The hard part is that sometimes it is not immediately obvious that you have choosen the incorrect ETL process until months or years later when you start having performance and scalability issues with your ETL and the SQL Server Data Warehouse Database. In some instances people get so frustrated that they just throw more hardware at the problem. While your hardware vendor might love this approach, in most cases this doesn’t solve the root cause of the problem, which in most cases is more of an architectural problem than a hardware problem.

The following is a list of some of the ways that an ETL process can be setup with Microsoft SQL Server. Note: this is not a complete list. I just tried to list the most common methods that I have seen.

OLTP based ETL-ish

  • Mirroring or  SQL Server 2012  Database Availability Groups
  • Replication (Transactional, Merge, Peer-to-Peer )
  • Log Shipping

Data Warehouse ETL

  • Change Data Capture – CDC
  • Integration Services – SSIS
  • BCP and Flat-Files

OLTP based ETL

The OLTP based ETL techniques are usually employed to off load the overhead of reporting/analytical queries to a secondary backup or “data warehouse” SQL  Server. While they do accomplish this goal reasonable well they all come with inherent disadvantages that make them inappropriate for most SQL Server data warehouse ETL scenarios. The first is that every single DML operation is replayed on the “data warehouse” SQL  Server database. This is great if you want a transactionally consistent warm-backup however it will negatively affect the performance if that server happens to be a SQL Server ‘data warehouse’. For example: If a row is inserted into the source and then updated 100 times throughout the day then all those DML transactions will need to be performed on the secondary “data warehouse” SQL Server , which also must support all the reporting and analytical queries as well. You may be able to squeeze a little bit more performance out of the “data warehouse” SQL Server by using little tricks like changing the isolation level to READUNCOMITTED but this is a band aide fix to a much larger issue.

Scalability can also quickly become an issue. If your “data warehouse” SQL Server, which only has 5 sources, runs fine right now with only few minor issues what happens  if the number of sources is increased to 50 or 500? That is 10 to 100 times an increase in volume of DML transactions that need to be committed on the SQL Server data warehouse server. I hope that your resume is up to date 😉

Aside from the issues mentioned above, additional processing will have to done on the “data warehouse” SQL Server to find the deltas so that a traditional star schema (fact and dimension tables) can be populated.

To be honest most the environments that use OLTP based ETLs have not implemented basic data warehouse best practices such as star schemas and so they will naturally pay a price in terms of performance and scalability in the future.

SQL Server Data Warehouse ETL

The SQL  Server Date Warehouse ETLs are more aligned to daily ETL workloads. They extract the data as it exists at a specific point of time. This is important because the SQL Server  data warehouse server does not have to process all of the DML transactions up to a specific point in time. After the nightly extraction and transformation processes are complete, all of the SQL Servers’s resources can be dedicated to reporting and analytical queries. This allows for better performance and scalability.

Using incremental data extraction techniques, such as Change Data Capture (CDC), scale very well because only the deltas are extracted from the source and not the full data sets. CDC also has the additional benefit of being able to tell you the DML operation that made the new row version and even which columns were changed! This great because it can greatly reduce the processing required for the transformation process to convert the data into a dimensional format.

Of course this approach assumes that you have gone through the effort to build solid dimensional model in the first place. If you have done that then things like the Fast Track Reference Appliances, Parrellel Data Warehouse, Page Comperssion, and ColumnStore Indexes will work beautifully   to  scale your data warehouse into multiple petabytes if needed!

In short use the right tool for the right job and in the long term that will save you a lot of extra work and pain. The old adage is correct: When you have a hammer everything looks like a nail. So don’t try hammering a screw or you will get screwed!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: