Posts Tagged ‘ETL’

SQL Server Database Performance Tuning: Update Fact Tables?

April 7, 2013

I have been asked on several occasions if rows/values in a SQL Server database fact table should be updated. The answer to this question in almost all cases is NO! Fact tables should not be updated. There are several really good reasons why facts measure values should almost never be updated.

1. Consistent Reporting: If I run a report today which reports on last week’s measure values I should get the exact same measure values if I run the same report tomorrow. The end users will not be happy if their historical numbers keep changing every day that they run the report!

2. Fact Table Processing: Preforming updates on tables can be a very costly operation on the SQL Server data warehouse database server! Fact tables can and should be very large. They can contain 100’s millions, billions, or trillions of rows and they should only have one index on each surrogate key column as well (useless you are fortunate enough to have SQL Server 2012 then just create one Columnstore Index which includes all the surrogate keys). All that makes updating anything in a database fact table very costly and should be done with great care and thought.

But I also realize that we also live in the real world. And in the real world sometimes fact transaction measure values change (even though they shouldn’t!!). I won’t name names but there are plenty of source systems which allow fact transaction measure values to be updated. So what can you do? You need to record those changing facts, but you also want to preserve reporting consistency and you don’t what to exponentially increase the time it takes to process the transactional fact table.

Well there are a few options:

1. Covert the transactional fact table to a snapshot fact table (more on that later)

2. Create the ‘Adjustment’ facts in the ETL process and insert them into the Fact table

The second option works especial well if the source systems actually tells you when the fact was last updated i.e. LastUpdatedDate. Personally, I like to create another TransactionType dimension which will hold a member representing the ‘normal’ non-adjusted facts and a member for the ‘adjustment’ facts that have to be created in the ETL. In the end the ERD would look something like the following:

clip_image002

An example scenario using a flatten fact table for simplicity:

Let say that the fact table starts at 01/01/2013 with only 3 transactions (one for each product)

DateKey DimProduct DimTransType Measure
20130101 ProdA normal 100
20130101 ProdB normal 100
20130101 ProdC normal 100

However on 01/02/2013 it was realized that the data entered on 01/01/2013 for ProdA was incorrect and the real value should have been 75. The ETL should hopefully be able to identify the LastUpdatedDate of the transaction and calculate the difference between the historical ‘normal’ value and the new target value. That difference should then be inserted in the fact table as an ‘adjustment’ transaction.

DateKey DimProduct DimTransType Measure
20130101 ProdA normal 100
20130101 ProdB normal 100
20130101 ProdC normal 100
20130101 ProdA adjustment -25

Now normal + adjustment still sums to the correct value of 75 but report consistency can still be preserved because the report or cube can choose to see the fact values with or without the adjustment records. This will also greatly reduce the ETL fact load processing requirements as oppose to just updating the facts in place. Note: having a good table partitioning strategy is critical for large fact tables.

I also like preserving the adjustment date value as well, as that will allow the business can see how far back transactions are being corrected. In many cases I have seen managers shocked by the frequency of these ‘rare’ corrections Smile

DateKey AdjDateKey DimProduct DimTransType Measure
20130101 20130101 ProdA normal 100
20130101 20130101 ProdB normal 100
20130101 20130101 ProdC normal 100
20130101 20130103 ProdA adjustment -25

ETL from Oracle to SQL Server

December 21, 2011

I was recently asked to provide some performance tuning recommendations for an ETL process that extracts data from Oracle using Informatica (Unix) and loads it into some SQL Server Staging tables.

Initially the ETL process was taking  20+ hrs to complete. After some preliminary analysis it was determined that the first major bottle neck was that Informatica was doing row-by-row inserts into SQL Server stage tables !

Informatica could read data  at about 3,500 rows per second from a single Oracle table but was only inserting into SQL Server at about 300 rows per second. The night ETL load is  4+ terabytes so then it became apparent why it was taking 20+ hours to complete!

Note: This installation of  Informatica was on a Unix server and so it did not have the SQL drivers to do bulk inserts. If you install informatica on a windows server then you can take advantage of the SQL drivers and do bulk inserts.

Our recommendations came in the form of three options:

1.  Informatica Oracle flatfile dump with SSIS Load to SQL Server Staging Tables. see image below

option 1 Informatica flatfile dump

While this technique greatly improved ETL performance it would require completely redesigning the existing Informatica mappings and buying another server for just holding the flatfiles.

Plus it seems a tad outdated to still being doing flatfile dumps in 2011 between 2 perfectly good Database Engines simply because the ETL tool can’t do bulk inserts!

2. Upgrade DataDirect to version 6.1 which has a driver which can do a bulk insert into SQL Server. (Actually, DataDirect 6 ships with the driver disabled!!!). The downside is that costs some serious $$$ to upgrade but there is no additional development work needed once the upgrade in complete.

3. Use the SSIS Oracle Connector to pull directly form Oracle. Here are the performance results. There are many factors which can effect performance. So your results will vary.

Source Destination rows time min rows/sec
Oracle SQL Server DB Local Laptop 1,000,000 6 2,778
Oracle SQL Server DB 1,000,000 6.1 2,732
Oracle flatfile 1,000,000 5.4 3,086

Of course I prefer using the SSIS option because it performs roughly equal to Informatica but without all the additional cost. With more optimizations SSIS can even out perform Informatica. See links at the bottom of the post for more information.

Steps to Install SSIS Oracle Connector on dev box which uses BIDS

1. Download and install 32bit and 64bit SSIS Oracle Connector. BIDS debug runs in 32bit ! so you need both!

2. Download and install Oracle SQL Developer and Java Platform JDK

3. Setup the connection to the Oracle DB in SQL Developer

SQLDeveloperconnection

4. Setup the Oracle Connection in BIDS.

SSISOracleConnectionConfig

**Make sure that TNS service name is in the following format <IP>:<port>/<service name>. Also think about how to manage the credentials used to access oracle.

Also for ’Data access mode’ in the ‘Oracle Source’ dialogue box use “SQL command”. It performs a lot better!

OracleSourceConfig

**To install the SSIS Oracle connector follow the same step above but you won’t need to install the 32bit stuff 😉

Good Performance Tuning Articles that are related to SSIS and ETL:

The Data Loading Performance Guide:

This document described techniques for bulk loading large data sets into SQL Server. It covers both the available techniques as well as methodologies to performance tune and optimize the bulk loading process.

We Loaded 1TB in 30 Minutes with SSIS, and So Can You

In February 2008, Microsoft announced a record-breaking data load using Microsoft® SQL Server® Integration Services (SSIS): 1 TB of data in less than 30 minutes. That data load, using SQL Server Integration Services, was 30% faster than the previous best time using a commercial ETL tool. This paper outlines what it took: the software, hardware, and configuration used. We will describe what we did to achieve that result, and offer suggestions for how to relate these techniques to typical scenarios. Even for customers who don’t have needs quite like this benchmark, such efforts can teach a lot about getting optimal performance.

Considerations for High Volume ETL Using SQL Server Integration Services

Providing for high volume extract, transformation, and load (ETL) throughput requires consideration of both best practices for standard relational databases, and of high performance techniques for ETL.

This article presents a number of suggestions relating to high volume ETL throughput, specifically using SQL Server Integration Services. Aspects of both ETL performance and management are described

Fast Track Data Warehouse 3.0 Reference Guide

This paper defines a reference configuration model (known as Fast Track Data Warehouse) using an I/O balanced approach to implementing a symmetric multiprocessor (SMP)-based SQL Server data warehouse with proven performance and scalability expectations for data warehouse workloads. The goal of a Fast Track Data Warehouse reference configuration is to achieve a cost-effective balance between SQL Server data processing capability and realized component hardware throughput.

Best Practices for Data Warehousing with SQL Server 2008 R2

There is considerable evidence that successful data warehousing projects often produce a very high return on investment. Over the years a great deal of information has been collected about the factors that lead to a successful implementation versus an unsuccessful one. These are encapsulated here into a set of best practices, which are presented with particular reference to the features in SQL Server 2008 R2. The application of best practices to a data warehouse project is one of the best investments you can make toward the establishment of a successful Business Intelligence infrastructure.

Data Integration at Microsoft: Technologies and Solution Patterns

Microsoft SQL Server Integration Services Performance Design Patterns