Archive for August, 2012

Great BLOG posts

August 19, 2012

I don’t usually do this but there have been recent BLOG posts that due to the importance the information being shared, I would like to call attention too and preserve their conclusions on my BLOG as well.

Thomas Kesjer has created a great series of posts explaining some of the fundamentals of Columnstore Indexes. I particular enjoy the his latest post entitled “What is the Best Sort Order for a Column Store?” because it gives some SQL queries that can be used to calculate such things as data entropy.

example:

p(xi)

SELECT X
, CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER (PARTITION BY NULL) pX
FROM Table
GROUP BY X

 H(X)

SELECT -1 * SUM(pX * LOG(pX, 2)) AS H
FROM
(
SELECT X
, CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER (PARTITION BY NULL) pX
FROM Table
GROUP BY X
) AS p

The Data Warehouse Junkie also wrote a post entitled “Design SSIS for Performance and Scale – Baseline tests” which demonstrates some good and proven conclusions based on the Fast Track recommendations.

SSIS Degrees of parallelism impact data fragmentation: “the DOP setting has no influence on our SSIS performance and this also means that the BULK API is single threaded (because performance does not increase). As you can also see we don’t get any fragmentation of our data when we increase the DOP setting.”

This is very good to know.

I also think that his conclusion about enabling page compression matches what I have seen with the Fast Track systems that I have worked with as well. Enabling Page Compression comes at the cost of increasing load times but gaining better I/O throughput (sequential reads/writes in MB/sec).  For most DW systems this is a good trade off but if your DW  is not I/O and you have a very tight load window then you might to consider not using Page Compression.

In either case the most important thing to due is to test it first before you make any decisions!

Scale Out SSIS

August 12, 2012

There are many ways to parallelize SSIS ETL processing.  I am not going to discuss the variety of methods of SSIS parallelization. There are many examples on the net already.

One of the resources that describes some of the more standard methods is Matt Masson’s presentation SSIS Design Patterns.

What I would like to describe is one way to scale out SSIS ETL processing for a specific use case.

Scenario:

    • There are many sources that all have identical schemas
    • You have multiple SSIS servers

The standard approach is to develop a SSIS package for each source and then spread the SSIS packages among the SSIS servers. You then schedule them to run in parallel as needed. While this approach works good. I recently had to architect a solution where the number of sources was constantly changing. New sources where constantly being brought on online or being disabled and having to develop new SSIS packages would take too long.

What I developed was a Round_Robin table that contained a list of all the source databases along with their connection information. Since their schemas where identical for the data that was being extracted, I could develop one SSIS package and then deploy many copies of that same package to the SSIS servers.

The packages where developed is such a way that when  they where were executed they would first lookup a source  from the Round_Robin table with a status as “enabled”  get the connection information, extract the data, and then mark the the source as “complete”.  This same cycle would continue until there were no more sources left to be extracted.  Additionally, since there were multiple packages with the same logic on the server I could execute as many packages as needed and they would just grab the next source database from the Round_Robin table and loop through all the source databases until they were all complete.

Below is a simplified example:

Execute Demo Package to create Test DB and Round_Robin table.

image

Select all the rows from the Round_Robin table

image

Every source data base has been assigned an ID and has a status. An additional connection string column could be added so that the connection string information for the database can be passed to package. However, because this is a simple example I did not include it.

Notice that one of the databases is “disabled”. Any database not in “enabled” status will be ignored by the package by default.

Press the continue button in BIDS

image

Notice that the DB_ID variable is now 4

Select all the rows from the Round_Robin table

image

Notice that the source database has now been marked as “processing” so that no other packages running will try to use the same database.

Press the continue button in BIDS

image

BD_ID has been changed to 3

Select all the rows from the Round_Robin table

image

Database 4 has been marked as “complete” and Database 3 is marked as “processing”.

Keep pressing the continue button in BIDS until all the database have been processed.

image

image

Notice that the package loops through the source databases until it reaches the “none” database (DB_ID = 0). While this example only shows one package looping through the database, the package can be copied many times and deployed to multiple SSIS servers and adding new sources database is as simple as inserting a new record into the Round_Robin table. The same technique can be used on any other source like flatfiles. Additional functionality can be added as well such grouping sources into time slots to processed in groups at various times.

Sample SSIS Solution

Performance Tuning: Data Warehouse Storage Throughput (MB/sec)

August 1, 2012

SQL Server Data Warehouse Workload Optimization

There are significant differences in how to achieve optimal resource balance for a SQL Server Data Warehouse system versus accomplishing the same optimization for lookup based online transactional system (OLTP). OTLP systems are heavily indexed in order to support high concurrency (many users) with low latency retrieval of a small number of rows from datasets that often contain very little historical data. This workload produces significant disk head movement which can be measured in random I/O reads and writes (IOPS). Data warehouse analytical queries, however, have relatively low concurrency (fewer users) and involve much larger historical datasets that benefit greatly from sequential reads and writes (disk scans). As a result, Data Warehouse systems are best measured by analyzing total throughput (sequential reads and writes) in MB/s rather than the more traditional focus on operations per second (IOPS) for OLTP Systems.

SQL Server Data Warehouse Balanced Architecture

Due to the fact the data warehouse scan centric (sequential reads/writes) workload is best measured in MB/sec, data warehouse hardware components need be architected in such way as to allow the maximum storage system data throughput (MB/sec) to be able to be consumed by the processors. Having a balanced Data Warehouse Architecture means that the storage system has been optimized for scan centric (sequential reads/writes measured in MB/sec) operations and that the processors can receive and process the data delivered by the storage at the same MB/sec rate.

Storage

In the SQL Server Data Warehouse the throughput of the storage is the most critical, complex, and costly portion of the system. Because Data Warehouse Analytical queries are too large to fit in memory, the storage must be able to quickly locate and deliver the data from disk where it can then be aggregated by the processors. In most data warehouse systems it is the storage that is the bottle neck and so careful monitoring and measuring of its performance is essential.

The best tool for measuring storage performance is a free Microsoft tool call SQLIO. It can test the performance of storage under a great variety scenarios and produces very detailed results which can then be charted out overtime. SQLIO should be used once the storage hardware is installed to validate that it has been configured properly and during regularly scheduled maintenances cycles to ensure that the storage is preforming properly.

Laptop Throughput (MB/sec)

As an example I used SQLIO to generate the performance statistics of my personal laptop. I also used David K Lee’s excellent  Storage Analysis Website to produce the following chart which summarizes the SQLIO results very nicely.

image

As you can tell from the graph my puny laptop is actually pretty well configured for Data Warehouse workloads.  The sequential reads/writers preform much better than random reads/writes.

In fact, I would guess that my laptop might preform better than SANs that cost 100x more simply because my laptop has dedicated Direct Attach Storage (DAS) that does have to fight with other resources in a huge (OLTP/IOPs centric) SAN that every application shares.

Conclusion:

If your SQL Server Data Warehouse is preforming poorly download SQLIO and see if a laptop can get better performance. If so then you might have a problem with your storage Winking smile

Additional Resources:

Fast Track Data Warehouse Reference Guide for SQL Server 2012 – The Bible for all things DW. It is a must read !!!

Dell Reference Configuration –  This Dell ”starter” Fast Track gets 4,700+ MBs/sec!!!

image

Also see: SQL Server OLTP vs. Data Warehouse Performance Tuning