Posts Tagged ‘SSIS’

Great BLOG posts

August 19, 2012

I don’t usually do this but there have been recent BLOG posts that due to the importance the information being shared, I would like to call attention too and preserve their conclusions on my BLOG as well.

Thomas Kesjer has created a great series of posts explaining some of the fundamentals of Columnstore Indexes. I particular enjoy the his latest post entitled “What is the Best Sort Order for a Column Store?” because it gives some SQL queries that can be used to calculate such things as data entropy.

example:

p(xi)

SELECT X
, CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER (PARTITION BY NULL) pX
FROM Table
GROUP BY X

 H(X)

SELECT -1 * SUM(pX * LOG(pX, 2)) AS H
FROM
(
SELECT X
, CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER (PARTITION BY NULL) pX
FROM Table
GROUP BY X
) AS p

The Data Warehouse Junkie also wrote a post entitled “Design SSIS for Performance and Scale – Baseline tests” which demonstrates some good and proven conclusions based on the Fast Track recommendations.

SSIS Degrees of parallelism impact data fragmentation: “the DOP setting has no influence on our SSIS performance and this also means that the BULK API is single threaded (because performance does not increase). As you can also see we don’t get any fragmentation of our data when we increase the DOP setting.”

This is very good to know.

I also think that his conclusion about enabling page compression matches what I have seen with the Fast Track systems that I have worked with as well. Enabling Page Compression comes at the cost of increasing load times but gaining better I/O throughput (sequential reads/writes in MB/sec).  For most DW systems this is a good trade off but if your DW  is not I/O and you have a very tight load window then you might to consider not using Page Compression.

In either case the most important thing to due is to test it first before you make any decisions!

SSIS 2012 Flat File Connection Manager

July 15, 2012

There have been several enhancements made and features added to Integration Services with the release of SQL Server 2012.

One of the SSIS components that has been enhanced is the Flat File Connection Manager. The Flat File Connection Manager no longer errors out when columns are missing from the Flat File. See Row2 in example below:

Figure 1

image

Figure 1 is a simple example of a text file with two columns and three rows that tests three different scenarios.  Row1 has values for both columns. Row2 is missing the last column and row3 contains extra columns which are not defined in SSIS package metadata.

Figure 2

image

In Figure 2 I setup a very simple SSIS package that pulls all the rows from the text file in Figure 1. Notice that row2 from the text file is missing the last column and returns an empty/null value. In the older version of SSIS row2 would have produced an error at run time.

The interesting thing about this example is that row3 col2 contains all the  values for columns that were not originally defined in the SSIS metadata. This means that unless you want all those values in the last column you should redirect those rows using a Conditional Split.

SSIS Expressions 4000 Character Limit

July 1, 2012

One of the most powerful and most overlooked features of SSIS is the ability have virtually every property of any component be controled by an expression. This allows for SSIS packages to have great flexibility.

However in SSIS 2008 the output of an expression cannot exceed 4000 characters. While in most cases your expression output will not exceed the limit, if you are ever unfortunate enough to run across this limit you will soon realize that there is no easy work-around for it.

See this Microsoft connect: SSIS Maximum string length of 4000 is too restrictive

Testing this restriction is as easy as creating a expression on a variable that returns more than 4000 characters.

Example: REPLICATE(“1”,4001)

image

Note: The above uses BIDS Helper to evaluate the expression.

If you change the expression to  REPLICATE(“1”,4000) then the expression is evaluated successfully.

image

However, this limitation has been removed in SSIS 2012. Just open up SSDT (no more BIDS) and create the following  expression on a variable:  REPLICATE(“1”,40001)

image

Cartesian/Cross Join

February 19, 2012

I was once asked in a job interview if I knew what a Cartesian/Cross Join was.

A more interesting question is: How can a Cross Join be useful?

There are actually many good reasons to use a Cross Join. Outlined below are a couple the most common scenarios that I have used in Data Warehousing:

1. Junk Dimension

In addition to the advantages listed in the Kimball article, Junk Dimensions can greatly reduce the overhead of processing a large dimension. Consider the following real life scenario:

An existing type 1 slowly changing customer dimension with 10 + million members, needs  to have a set of Flag/Boolean attributes (one for each day of the week) added to it. and just make things worse these attributes need to be type 2. Do you change the already large dimension to SCD 2 on attributes that change might frequently? NO!

Solution: Create a Junk Dimension with all possible combinations of the Flags/Boolean indicators. see screen shots:

–create table to hold weekday values
declare @WeekFlag table
(Sunday bit,
Monday bit,
Tuesday bit,
Wednesday bit,
Thrusday bit,
Friday bit,
Saturday bit)
–insert boolean values for the each weekday
insert into @WeekFlag values( 0,0,0,0,0,0,0)
insert into @WeekFlag values( 1,1,1,1,1,1,1)
–look at table
select * from @WeekFlag
— self join table by using cross join
select a.Sunday,b.Monday,c.Tuesday,d.Wednesday,e.Thrusday,f.Friday,g.Saturday
from (select Sunday from @WeekFlag) as  a
cross join (select Monday from @WeekFlag) as  b
cross join (select Tuesday from @WeekFlag) as  c
cross join (select Wednesday from @WeekFlag) as  d
cross join (select Thrusday from @WeekFlag) as  e
cross join (select Friday from @WeekFlag) as  f
cross join (select Saturday from @WeekFlag) as  g

image

2. Time Dimension

“Time Dimension” here does not mean a Time Dimension with the granularity of a day (like AdventureWorks) but a Dimension with a granularity LESS THAN A DAY. Example: analyze transactions down to the millisecond.

Don’t try to combine a Dimension with a granularity of a day with one of a millisecond you will end up with billions or more members!!

Solution:

Cross Join each level of the Time Dimension.

Milliseconds –> Seconds –> Minutes –> Hours

here is an example of the first cross join in the series:

declare @MilliSecTable table ( MilliSecValue int)
declare @Msec as int = 0
    insert into @MilliSecTable values (@Msec)
while (select MAX(MilliSecValue) from @MilliSecTable) < 999
begin
    insert into @MilliSecTable values((select MAX(MilliSecValue) from @MilliSecTable) + 1)
end
—————————————————————————————–
declare @SecTable table ( SecValue int)
declare @Sec as int = 0
    insert into @SecTable values (@Sec)
while (select MAX(SecValue) from @SecTable) < 59
begin
    insert into @SecTable values((select MAX(SecValue) from @SecTable) + 1)
end

select SecValue,MilliSecValue from @SecTable cross join @MilliSecTable

image

**Note: The above code also performs a lot better than simply each incrementing each member by 1 millisecond. That could literally take days 😉

Resources:

The Many-to-Many Revolution: This takes the basic idea of T-SQL Cross Joins and applies them to SSAS. It is a great read!

CHECKSUM vs HashBytes

February 1, 2012
Hashing can be useful in Data Warehousing  as well 😉 It can give you the ability to break large problems into smaller more manageable sizes  or scale out your ETL process 😉 PDW even uses a similar method to distribute data internally among all the compute nodes.

In reality there is no competition between CHECKSUM and HashBytes. CHECKSUM is a clear loser all the way around as BOL clearly? states:

“CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For this definition, null values of a specified type are considered to compare as equal. If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.”

The CHECKSUM algorithm is very simple but fast. Depending on the data types used or if there are null values collisions can frequently occur. I have used GUIDs to produce collisions in just a few thousand rows!

Thomas Kejser has written a great series of articles about Hash functions

Exploring Hash Functions in SQL Server:

Hash distributing rows is a wonderful trick that I often apply. It forms one of the foundations for most scale-out architectures. It is therefore natural to ask which hash functions are most efficient, so we may chose intelligently between them.

In this blog post, I will benchmark the build in function in SQL Server. I will focus on answering two questions:

  1. How fast is the hash function?
  2. How well does the hash function spread data over a 32-bit integer space

I know there is also the question about how cryptographically safe the function is, but this is not a necessary property for scale-out purposes – hence, that aspect is out of scope for this blog.

Writing New Hash Functions for SQL Server: In this blog, I will explore ideas for extending SQL Server with new, fast hash functions. As will be shown, the high speed, built in functions CHECKSUM and BINARY_CHECKSUM are not always optimal hash functions, when you require the function to spread data evenly over an integer space. I will show that it can be useful to extend SQL Server with a new CLR User Defined Function (UDF).

ETL from Oracle to SQL Server

December 21, 2011

I was recently asked to provide some performance tuning recommendations for an ETL process that extracts data from Oracle using Informatica (Unix) and loads it into some SQL Server Staging tables.

Initially the ETL process was taking  20+ hrs to complete. After some preliminary analysis it was determined that the first major bottle neck was that Informatica was doing row-by-row inserts into SQL Server stage tables !

Informatica could read data  at about 3,500 rows per second from a single Oracle table but was only inserting into SQL Server at about 300 rows per second. The night ETL load is  4+ terabytes so then it became apparent why it was taking 20+ hours to complete!

Note: This installation of  Informatica was on a Unix server and so it did not have the SQL drivers to do bulk inserts. If you install informatica on a windows server then you can take advantage of the SQL drivers and do bulk inserts.

Our recommendations came in the form of three options:

1.  Informatica Oracle flatfile dump with SSIS Load to SQL Server Staging Tables. see image below

option 1 Informatica flatfile dump

While this technique greatly improved ETL performance it would require completely redesigning the existing Informatica mappings and buying another server for just holding the flatfiles.

Plus it seems a tad outdated to still being doing flatfile dumps in 2011 between 2 perfectly good Database Engines simply because the ETL tool can’t do bulk inserts!

2. Upgrade DataDirect to version 6.1 which has a driver which can do a bulk insert into SQL Server. (Actually, DataDirect 6 ships with the driver disabled!!!). The downside is that costs some serious $$$ to upgrade but there is no additional development work needed once the upgrade in complete.

3. Use the SSIS Oracle Connector to pull directly form Oracle. Here are the performance results. There are many factors which can effect performance. So your results will vary.

Source Destination rows time min rows/sec
Oracle SQL Server DB Local Laptop 1,000,000 6 2,778
Oracle SQL Server DB 1,000,000 6.1 2,732
Oracle flatfile 1,000,000 5.4 3,086

Of course I prefer using the SSIS option because it performs roughly equal to Informatica but without all the additional cost. With more optimizations SSIS can even out perform Informatica. See links at the bottom of the post for more information.

Steps to Install SSIS Oracle Connector on dev box which uses BIDS

1. Download and install 32bit and 64bit SSIS Oracle Connector. BIDS debug runs in 32bit ! so you need both!

2. Download and install Oracle SQL Developer and Java Platform JDK

3. Setup the connection to the Oracle DB in SQL Developer

SQLDeveloperconnection

4. Setup the Oracle Connection in BIDS.

SSISOracleConnectionConfig

**Make sure that TNS service name is in the following format <IP>:<port>/<service name>. Also think about how to manage the credentials used to access oracle.

Also for ’Data access mode’ in the ‘Oracle Source’ dialogue box use “SQL command”. It performs a lot better!

OracleSourceConfig

**To install the SSIS Oracle connector follow the same step above but you won’t need to install the 32bit stuff 😉

Good Performance Tuning Articles that are related to SSIS and ETL:

The Data Loading Performance Guide:

This document described techniques for bulk loading large data sets into SQL Server. It covers both the available techniques as well as methodologies to performance tune and optimize the bulk loading process.

We Loaded 1TB in 30 Minutes with SSIS, and So Can You

In February 2008, Microsoft announced a record-breaking data load using Microsoft® SQL Server® Integration Services (SSIS): 1 TB of data in less than 30 minutes. That data load, using SQL Server Integration Services, was 30% faster than the previous best time using a commercial ETL tool. This paper outlines what it took: the software, hardware, and configuration used. We will describe what we did to achieve that result, and offer suggestions for how to relate these techniques to typical scenarios. Even for customers who don’t have needs quite like this benchmark, such efforts can teach a lot about getting optimal performance.

Considerations for High Volume ETL Using SQL Server Integration Services

Providing for high volume extract, transformation, and load (ETL) throughput requires consideration of both best practices for standard relational databases, and of high performance techniques for ETL.

This article presents a number of suggestions relating to high volume ETL throughput, specifically using SQL Server Integration Services. Aspects of both ETL performance and management are described

Fast Track Data Warehouse 3.0 Reference Guide

This paper defines a reference configuration model (known as Fast Track Data Warehouse) using an I/O balanced approach to implementing a symmetric multiprocessor (SMP)-based SQL Server data warehouse with proven performance and scalability expectations for data warehouse workloads. The goal of a Fast Track Data Warehouse reference configuration is to achieve a cost-effective balance between SQL Server data processing capability and realized component hardware throughput.

Best Practices for Data Warehousing with SQL Server 2008 R2

There is considerable evidence that successful data warehousing projects often produce a very high return on investment. Over the years a great deal of information has been collected about the factors that lead to a successful implementation versus an unsuccessful one. These are encapsulated here into a set of best practices, which are presented with particular reference to the features in SQL Server 2008 R2. The application of best practices to a data warehouse project is one of the best investments you can make toward the establishment of a successful Business Intelligence infrastructure.

Data Integration at Microsoft: Technologies and Solution Patterns

Microsoft SQL Server Integration Services Performance Design Patterns

Dimensional Modeling Financial Data in SSAS

October 26, 2011

Part 1 – Parent Child Hierarchies and Pseudo GL Accounts

new: Financial Data Warehouse video 

This post is born out of some real life  lessons that I learned while putting together a Financial Data Warehouse based on General Ledger (GL) data i.e. all the GL’s in the foundational financial statements from Net Income and  Balance Sheet to  Statistical GLs and  Cash Flow statements.

While there are many great sources for dimensional modeling, Kimball’s book The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition) is essential, but there are few resources which discuss dimensional modeling for Financial Data. I am going assume that the reader already has a firm knowledge base of Dimensional Modeling.  If not then see my recommendations for books that will help form that foundation: The Start of Something Great.

To be frank this could be a book sized topic in and of it’s self, but I am going to try to cover all the basics so that the reader can avoid the major pit falls.

The first step is gain an understanding of the business processes being modeled (GL creation and maintenance, Period End Close-Out etc..) and the associated requirements (financial statements: Net Income, Balance Sheet, Cash Flow etc…). It is absolutely critical to gain a deep understanding the GL analytical attributes e.g. Business Units, Cost Centers, and Sales Channels. Don’t be surprised when you find out that most of the critical GL attributes are not maintained in any source system but in excel spreadsheets!

I am going to dispense with all the gory details that can be encountered in this process and jump right into the nature and structure of Financial Statements, because their structure will have a significant effect on how the data is modeled.

To make thing easier, I am going to use AdventureWorks as a common point of reference. Here is an example Net Income Statement:

Adventure Works Cube -> Financial Reporting -> rows = named set -> Summary P&L, columns = Amount

I have rearranged the rows from the named set Summary P&L by click and dragging them into something that might look like a financial statement.

A typical financial system will generate financial statements like the one shown above through a series of calculations for each line in the financial statement.

Example:

Line/row 5 (Net Sales) = line 2 (Gross Sales) – line 3 (Returns and Adjustments) – line 4 (Discounts)

In other words Net Sales is calculated by aggregating its dependents/children (Gross Sales, Returns and Adjustments, and Discount).

Note: I know that Returns and Discounts need to be subtracted instead of added, I will discuss ways of dealing with that later.

In fact if you look at the rest of the P&L you will realize that all of the line items have a parent child relationship to each other with Net Income being the parent of all the other line items. The following image shows exactly what that a Financial GL Parent Child Hierarchy
might look like.

Create another pivot table by using the following:

Adventure Works Cube -> Financial Reporting -> Rows = Accounts, Columns = Amount

This is the full GL Accounts Parent Child Hierarchy and it is incredibly valuable to providing consistent and flexible financial analysis. The user can drill down to any level starting from the top (Net Income) to the bottom (individual GLs) and preform all their analysis at any level of granularity. Add to that the analytical attributes like Business Unit, Sales Channel, and Cost Center and you have an extremely valuable solution capable of analyzing any part of the company horizontally (GL Hierarchy) and vertically (Analytical attributes which can have their own hierarchies) instantly!!

Note: The full GL Hierarchy not only contains Net Income GL but Balance Sheet and Statistical GLs as well. Net Income is a child of the Balance Sheet which typically goes under one the Current Retained Earnings Accounts.

Take a look at the DimAccount table in the AdventureWorksDW database:

Notice how the Parent of Balance Sheet member is null. That is because Balance Sheet the parent of all GL’s it has no parent.

Ok now that we a explored the potential of the GL Parent Child Hierarchy, the down side that most financial source systems don’t have a nice GL Hierarchy pre-built for you. You are going to have to get business consensus on the structure of the hierarchy and build it yourself. and you will also need to way to manage it once it is built. You will also have to create pseudo GL accounts that don’t already exist in the source system.  These pseudo GLs will be parent GLs, like a Balance Sheet or Net Income which be required to complete the calculations in the financial statements.

For example what to do when new GLs are created and where do they go in the hierarchy?

Fortunately, there is a solution for that through Master Data Management. Microsoft’s Master Data Management’s solution is called Master Data Services which handles GL Parent Child Hierarchies very nicely.

Building the GL hierarchy is not a trivial task. The one that I built recently was for a larger company had 180,000 plus GLs. All the GLs (members of the GL Account dimension) had put into exactly the correct order for the calculations to be correct.

The next post will discuss the other critical parts of the Financial DW Solution

Things to consider: Parent Child Hierarchies and Cube performance:

See:Analysis Services 2008 R2 Performance Guide

Note: Chris Webb has a good blog post about how to change the order of the named set (Income Statement) http://cwebbbi.wordpress.com/2013/07/31/ordering-of-named-sets-in-excel/

The Start of Something Great

July 20, 2011

The purpose of this BLOG is archive, as much as I can, the solutions that I have found/developed throughout my journey in Data Warehousing. These are things that I have found useful and so they might be useful for you as well.

Getting started:

Gathering a good knowledge base is a good place to start. While BLOGs are a good way find information on specific problems, I like to
read a few books on a subject before I start diving in. They provide me with good context for understanding the subject area. Listed below, in no particular order are the books that I have read that have helped me the most.

Kimball:

MDS

SQL Server

SSAS

SSIS:

SSRS:

Read, Develop, Break, Fix and Repeat …..