Scale Out SSIS

There are many ways to parallelize SSIS ETL processing.  I am not going to discuss the variety of methods of SSIS parallelization. There are many examples on the net already.

One of the resources that describes some of the more standard methods is Matt Masson’s presentation SSIS Design Patterns.

What I would like to describe is one way to scale out SSIS ETL processing for a specific use case.

Scenario:

    • There are many sources that all have identical schemas
    • You have multiple SSIS servers

The standard approach is to develop a SSIS package for each source and then spread the SSIS packages among the SSIS servers. You then schedule them to run in parallel as needed. While this approach works good. I recently had to architect a solution where the number of sources was constantly changing. New sources where constantly being brought on online or being disabled and having to develop new SSIS packages would take too long.

What I developed was a Round_Robin table that contained a list of all the source databases along with their connection information. Since their schemas where identical for the data that was being extracted, I could develop one SSIS package and then deploy many copies of that same package to the SSIS servers.

The packages where developed is such a way that when  they where were executed they would first lookup a source  from the Round_Robin table with a status as “enabled”  get the connection information, extract the data, and then mark the the source as “complete”.  This same cycle would continue until there were no more sources left to be extracted.  Additionally, since there were multiple packages with the same logic on the server I could execute as many packages as needed and they would just grab the next source database from the Round_Robin table and loop through all the source databases until they were all complete.

Below is a simplified example:

Execute Demo Package to create Test DB and Round_Robin table.

image

Select all the rows from the Round_Robin table

image

Every source data base has been assigned an ID and has a status. An additional connection string column could be added so that the connection string information for the database can be passed to package. However, because this is a simple example I did not include it.

Notice that one of the databases is “disabled”. Any database not in “enabled” status will be ignored by the package by default.

Press the continue button in BIDS

image

Notice that the DB_ID variable is now 4

Select all the rows from the Round_Robin table

image

Notice that the source database has now been marked as “processing” so that no other packages running will try to use the same database.

Press the continue button in BIDS

image

BD_ID has been changed to 3

Select all the rows from the Round_Robin table

image

Database 4 has been marked as “complete” and Database 3 is marked as “processing”.

Keep pressing the continue button in BIDS until all the database have been processed.

image

image

Notice that the package loops through the source databases until it reaches the “none” database (DB_ID = 0). While this example only shows one package looping through the database, the package can be copied many times and deployed to multiple SSIS servers and adding new sources database is as simple as inserting a new record into the Round_Robin table. The same technique can be used on any other source like flatfiles. Additional functionality can be added as well such grouping sources into time slots to processed in groups at various times.

Sample SSIS Solution

Advertisements

2 Responses to “Scale Out SSIS”

  1. Ralph D. WIlson II Says:

    This is an excellent post. You provide a simple, yet elegant and extensible, solution to an often encountered problem.

    In addition to using this approach to handle the constantly changing resources, I am wondering wheher this might not also be a functional approach for also handling multiple packages with multiple data sources.

    For instance, adding two more columns, one to identify which package (i.e. set of packages on variou) would be executing and another for identifying the source (or target) file location, you could employ this approach to allow for, for instance, a set of packages that load incoming data into a staging table or that export a set of data to a file system. That way, you could have multiple instances of the import/export package running in order to optimize the overall process.

    I can think of several extensions of this concept that could server to make the overall processing of data much more efficient.

  2. Garrett Edmondson Says:

    –Ralph
    Thanks for the kind words. I am happy to see that it has value for someone else as well.
    It can absolutely be extended in the fashion that you described. AI have created frameworks that store many values in the “Round_Robin” table like dynamically change source queries. I have even implemented packages that have different control flows and/or data flows executed depending on the requirements.
    Example:
    Package A just does data extraction logic once all of the extractions are complete then which over to loading or transformation depending on what is left to be processed.
    You can also add BiML to dynamically create SSIS packages https://garrettedmondson.wordpress.com/2012/06/17/dynamically-generate-ssis-packages/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: