Archive for December, 2011

ETL from Oracle to SQL Server

December 21, 2011

I was recently asked to provide some performance tuning recommendations for an ETL process that extracts data from Oracle using Informatica (Unix) and loads it into some SQL Server Staging tables.

Initially the ETL process was taking  20+ hrs to complete. After some preliminary analysis it was determined that the first major bottle neck was that Informatica was doing row-by-row inserts into SQL Server stage tables !

Informatica could read data  at about 3,500 rows per second from a single Oracle table but was only inserting into SQL Server at about 300 rows per second. The night ETL load is  4+ terabytes so then it became apparent why it was taking 20+ hours to complete!

Note: This installation of  Informatica was on a Unix server and so it did not have the SQL drivers to do bulk inserts. If you install informatica on a windows server then you can take advantage of the SQL drivers and do bulk inserts.

Our recommendations came in the form of three options:

1.  Informatica Oracle flatfile dump with SSIS Load to SQL Server Staging Tables. see image below

option 1 Informatica flatfile dump

While this technique greatly improved ETL performance it would require completely redesigning the existing Informatica mappings and buying another server for just holding the flatfiles.

Plus it seems a tad outdated to still being doing flatfile dumps in 2011 between 2 perfectly good Database Engines simply because the ETL tool can’t do bulk inserts!

2. Upgrade DataDirect to version 6.1 which has a driver which can do a bulk insert into SQL Server. (Actually, DataDirect 6 ships with the driver disabled!!!). The downside is that costs some serious $$$ to upgrade but there is no additional development work needed once the upgrade in complete.

3. Use the SSIS Oracle Connector to pull directly form Oracle. Here are the performance results. There are many factors which can effect performance. So your results will vary.

Source Destination rows time min rows/sec
Oracle SQL Server DB Local Laptop 1,000,000 6 2,778
Oracle SQL Server DB 1,000,000 6.1 2,732
Oracle flatfile 1,000,000 5.4 3,086

Of course I prefer using the SSIS option because it performs roughly equal to Informatica but without all the additional cost. With more optimizations SSIS can even out perform Informatica. See links at the bottom of the post for more information.

Steps to Install SSIS Oracle Connector on dev box which uses BIDS

1. Download and install 32bit and 64bit SSIS Oracle Connector. BIDS debug runs in 32bit ! so you need both!

2. Download and install Oracle SQL Developer and Java Platform JDK

3. Setup the connection to the Oracle DB in SQL Developer

SQLDeveloperconnection

4. Setup the Oracle Connection in BIDS.

SSISOracleConnectionConfig

**Make sure that TNS service name is in the following format <IP>:<port>/<service name>. Also think about how to manage the credentials used to access oracle.

Also for ’Data access mode’ in the ‘Oracle Source’ dialogue box use “SQL command”. It performs a lot better!

OracleSourceConfig

**To install the SSIS Oracle connector follow the same step above but you won’t need to install the 32bit stuff 😉

Good Performance Tuning Articles that are related to SSIS and ETL:

The Data Loading Performance Guide:

This document described techniques for bulk loading large data sets into SQL Server. It covers both the available techniques as well as methodologies to performance tune and optimize the bulk loading process.

We Loaded 1TB in 30 Minutes with SSIS, and So Can You

In February 2008, Microsoft announced a record-breaking data load using Microsoft® SQL Server® Integration Services (SSIS): 1 TB of data in less than 30 minutes. That data load, using SQL Server Integration Services, was 30% faster than the previous best time using a commercial ETL tool. This paper outlines what it took: the software, hardware, and configuration used. We will describe what we did to achieve that result, and offer suggestions for how to relate these techniques to typical scenarios. Even for customers who don’t have needs quite like this benchmark, such efforts can teach a lot about getting optimal performance.

Considerations for High Volume ETL Using SQL Server Integration Services

Providing for high volume extract, transformation, and load (ETL) throughput requires consideration of both best practices for standard relational databases, and of high performance techniques for ETL.

This article presents a number of suggestions relating to high volume ETL throughput, specifically using SQL Server Integration Services. Aspects of both ETL performance and management are described

Fast Track Data Warehouse 3.0 Reference Guide

This paper defines a reference configuration model (known as Fast Track Data Warehouse) using an I/O balanced approach to implementing a symmetric multiprocessor (SMP)-based SQL Server data warehouse with proven performance and scalability expectations for data warehouse workloads. The goal of a Fast Track Data Warehouse reference configuration is to achieve a cost-effective balance between SQL Server data processing capability and realized component hardware throughput.

Best Practices for Data Warehousing with SQL Server 2008 R2

There is considerable evidence that successful data warehousing projects often produce a very high return on investment. Over the years a great deal of information has been collected about the factors that lead to a successful implementation versus an unsuccessful one. These are encapsulated here into a set of best practices, which are presented with particular reference to the features in SQL Server 2008 R2. The application of best practices to a data warehouse project is one of the best investments you can make toward the establishment of a successful Business Intelligence infrastructure.

Data Integration at Microsoft: Technologies and Solution Patterns

Microsoft SQL Server Integration Services Performance Design Patterns

Data Mining

December 14, 2011

Data Mining Data mining techniques can be used in virtually all business applications, answering various types of businesses questions. In truth, given the software available today, all you need is the motivation and the know-how. In general, data mining can be applied whenever something could be known, but is not. The following examples describe some scenarios:

Recommendation generation— What products or services should you offer to your customers? Generating recommendations is an important
business challenge for retailers and service providers. Customers who  are provided appropriate and timely recommendations are likely to be
more valuable (because they purchase more) and more loyal (because  they feel a stronger relationship to the vendor). For example, if you go to online stores such as Amazon.com or Barnesandnoble.com to purchase an item, you are provided with recommendations about other items you may be interested in. These recommendations are derived from using data mining to analyze purchase behavior of all of the retailer’s customers, and applying the derived rules to your personal information.

Anomaly detection — How do you know whether your data is ‘‘good’’ or not? Data mining can analyze your data and pick out those items that don’t fit with the rest. Credit card companies use data mining–driven anomaly detection to determine if a particular transaction is valid. If the data mining system flags the transaction as anomalous, you get a call to see if it was really you who used your card. Insurance companies also use anomaly detection to determine if claims are fraudulent. Because these companies process thousands of claims a day, it is impossible to investigate each case, and data mining can identify which claims are likely to be false. Anomaly detection can even be used to validate data entry—checking to see if the data entered is correct at the point of entry.

Churn analysis — Which customers are most likely to switch to a competitor? The telecom, banking, and insurance industries face severe competition. On average, obtaining a single new mobile phone subscriber costs more than $200. Every business would like to retain as many customers as possible. Churn analysis can help marketing managers identify the customers who are likely to leave and why, and as a result, they can improve customer relations and retain customers.

Risk management—Should a loan be approved for a particular customer? Since the subprime mortgage meltdown, this is the single most common question in banking. Data mining techniques are used to determine the risk of a loan application, helping the loan officer make appropriate decisions on the cost and validity of each application.

Customer segmentation —How do you think of your customers? Are your customers the indescribable masses, or can you learn more about  your customers to have a more intimate and appropriate discussion with them. Customer segmentation determines the behavioral and descriptive profiles for your customers. These profiles are then used to provide personalized marketing programs and strategies that are appropriate for each group.

Targeted ads — Web retailers or portal sites like to personalize their  content for their Web customers. Using navigation or online purchase
patterns, these sites can use data mining solutions to display targeted advertisements to their Web navigators.

Forecasting — How many cases of wine will you sell next week in this  store? What will the inventory level be in one month? Data mining forecasting techniques can be used to answer these types of time-related questions.

Resources:

Data Mining with Microsoft SQL Server 2008: Provides in-depth knowledge about Data Mining with SSAS. Covers the Algorithms, DMX and SSIS Data Mining solutions in detail. This is an absolute must for learning how to develop a Data Mining solution. It has tons of great examples.

Predixion Insight: Jamie MacLennan, Director of SSAS at Microsoft, left Microsoft and started Predixion which offers the best Data Mining SSAS based product on the market. The excel demo comes with great examples as well.

Data Mining Techniques: Provides good examples of how to develop a strategy for using Data Mining to solve business problems.

Data Mining Webcasts

Automation of Data Mining Using Integration Services: This article is a walkthrough that illustrates how to build multiple related data models by using the tools that are provided with Microsoft SQL Server Integration Services. In this walkthrough, you will learn how to automatically build and process multiple data mining models based on a single mining structure, how to create predictions from all related models, and how to save the results to a relational database for further analysis. Finally, you view and compare the predictions, historical trends, and model statistics in SQL Server Reporting Services reports.