Archive for the ‘MPP’ Category

Performance Tuning SQL Server Data Warehouse Database

December 13, 2012

I was recently invited to give presentation about performance tuning SQL Data Warehouse databses  to the The Southeast Michigan SQL Server User Group. Blue Granite was kind enough to pay for my travel to the user group.

I don’t usually give many presentations. I am more of a heads down coder by heart, but I love sharing my ideas and experience with others! Let me know what you think.

My presentation is titled Data Warehousing from the Trenches. It is a  presentation of lessons learn based on my recent involvement with Microsoft’s Massive Parallel Processing (MPP) Appliance called Parallel Data Warehouse (PDW). In it I also cover some of the new features of SQL Server 2012 like ColumnStore Indexes and how they are going to greatly enhance SQL Serve’s Data Warehouse scalability and performance.

Presentation PowerPoint Slide Deck

Note: The PDW projects that I have worked on to date where as an Independent Contractor with B.I. Voyage. My Linked-in Profile provides an over view of my work experience.



November 28, 2012

There has been a lot of discussions recently around ‘Big Data’. Most of the conversations  that I have been involved with around ‘BIG DATA’  have, in general, been very productive and informative. They have naturally lead into Data Warehousing best practices and how the business can best utilize it’s IT resources to make better decisions.  (IT is a resource and can help generate revenue!!! NOT an expense)

However, a significant number of conversations, especially on the internet, seem to contain more hype than real  substance around Big Data. It is not unusual for me to see “Big Data” in the title and then have the article or presentation say nothing substantive about it !

It can be hard to filter the hype/marketing from the truth about how proper Big Data management can improve the bottom-line.

So I have compiled a brief list of resources to help with understanding Big Data. I will add to the list as new resources become available.


Incorporating Big Data Technologies into your BI & DW Strategy (Chicago): Microsoft in partnership with BlueGranite (full discloser: my employer) is putting together a Big Data presentation to help Business Decision Makers understand the value of Big Data.

RobKerr’s Blog: If you’re a #Microsoft #MSBI Pro, it’s time to learn about #BigData and #Hadoop: Here’s how! Great BIG DATA links !

BI Academy: If you can’t make the presentation in Chicago Rob Kerr (SQL MVP)  has put together series of great videos which provide an introduction to BIG DATA, Hadoop, and Data Warehousing and how they interact together.

Microsoft Big Data : Provides some great introductory material around BIG DATA. If you are a Data Warehouse Professional I would highly recommend that you download HDInsight and get comfortable with it. Yes I know yet another thing to learn but it it worth it!

Polybase: This is still bleeding edge stuff that will first appear in version 2 of SQL Server Parallel Data Warehouse PDW, but I am guessing that similar functionality might make it’s way into regular SQL Server SMP. It ability to query and join both structured and unstructured data is AWESOME!

SQL 2012 DW Learning Resources

July 8, 2012

There are a lot of things to learn with the release of SQL Server 2012.  While there are many good resources to learn about the new features, TechEd is one the best places to learn from the actual engineers that made the feature or from industry leaders:

TechEd North America

TechEd is Microsoft’s premier technology conference for IT professionals and developers, offering the most comprehensive technical education across Microsoft’s current and soon-to-be- released suite of products, solutions, tools, and services. This June we celebrate the 20th annual TechEd North America in Orlando. Experience four days of hands-on training from Microsoft experts.

The following are some of my favorite sessions:

Killer Real-World PowerPivot Examples -This session is even more jam packed with amazing PowerPivot report examples than last year’s presentation. This is *the* session to attend if you want to get the most out of PowerPivot visualizations. See real-world PowerPivot reports from clients in Education, Retail, Banking, and Telco, and learn a variety of visualization techniques including sparklines, slicers, and charts. You will leave this fun session armed with plenty of ideas for your next personal BI project. Oh, and watch out for flying Koalas. #TEDBI330

SQL Server AlwaysOn: Active Secondaries – One of the most highly requested features in Microsoft SQL Server 2012 is the ability to offload read workloads and backups to secondary replicas in AlwaysOn configurations. In this talk we describe, using demos, how this feature works, how to configure it, and what the advantages and limitations are.

Enterprise Information Management (EIM): Bringing Together SSIS, DQS, and MDS – Enterprise Information Management (EIM) is an industry term for managing your data for data integration, quality, and governance, and is an important part of the Microsoft SQL Server 2012 release. This session revolves around a demo which brings together our EIM functionality in SQL Server 2012 and which tells our Credible, Consistent Data story. We show you how SQL Server Integration Services (SSIS), Data Quality Services (DQS), Master Data Services (MDS) and other Microsoft technologies work together to provide a comprehensive EIM solution. #TEDBI310

SQLCAT: SQL Server HA and DR Design Patterns, Architectures, and Best Practices Using Microsoft SQL Server 2012 AlwaysOnThis session describes how to build practical end-to-end HA and DR solutions for mission critical applications using SQL Server 2012 AlwaysOn technologies. For example, how to combine Failover Cluster Instances with Availability Groups to provide a complete HA and DR solution, and how to use Availability Groups Multiple Secondaries to replace existing database mirroring + log shipping solutions. We share learnings from early customer engagements, and discuss some practical customer examples on why they picked a particular solution. We also discuss upgrade from existing solutions to new "Denali" solutions with minimal downtime. #TEDBI316

Incremental ETL Using CDC for SQL and Oracle with SQL Server Integration Services (SSIS) 2012 Exploring real-world customer scenarios, we take a look at how the new Change Data Capture (CDC) components for SSIS in Microsoft SQL Server 2012 simplify incremental ETL and Data Warehouse loads. We demo the new functionality in SSIS, and show off how the new tools make it easy to set up and configure CDC against both SQL Server and Oracle sources. We present the different CDC configuration options, and demonstrate design patterns that can be used to overcome common problems you’ll face while doing incremental loading of data. If your ETL solution needs to handle incremental loads from data sources that were designed explicitly to support them, then this is a session you can’t afford to miss. #TEDBI322

Optimizing Microsoft SQL Server Analysis Services for Big Data – Big Data is now a reality and more than ever we are asked to analyze greter volumes of data. Find out how to make Analysis Services work harder and more effectively in this new world of Big Data. Specific strategies will be discussed. #TEDBI412

Big Data, Big Deal? Are you ready for the exploding world of big data? Do you know the difference between Hive and Pig? Do you know why MapReduce is being taught in many universities rather than SQL? If not, pay attention because this talk will help get you started in understanding this new world. While sometimes the Hadoop toolkit (which includes HDFS, MapReduce, Hive, Pig, and Sqoop) is used as an alternative to relational database systems such as SQL Server, more frequently customers are using it as a complementary tool. Sometimes it may be used as an ETL tool or to perform an initial analysis of a freshly acquired data set to determine whether or not it is worth loading into the data warehouse, and sometimes to process massive data sets that are too big to even contemplate loading into all but the very largest data warehouses. In addition to covering the basics of the various parts of the Hadoop stack, this talk discusses the strengths and weakness of the Hadoop approach compared to that provided by relational database systems and explores how the two technologies can be used productively in conjunction with one another. #TEDBI209

SQL Server Columnstore Performance TuningColumnstore indexes, and a new, vector-based query execution method called “batch mode,” can speed up query processing in data warehouses and data marts using Microsoft SQL Server 2012 by a breathtaking 10X to 100X. This talk describes how to tune indexes, queries, and load processes to help your customers get the most out of columnstore indexes and query execution. Learn secrets that can help you turn modest performance gains into stunning, orders-of-magnitude improvements, and blow away the competition. We focus on how to work around constraints that limit the use of batch processing when outer joins, unions, and certain other constructs are present in queries. We also describe how to load columnstores effectively, and even perform real-time loads on columnstore indexes that are technically read-only.

Building the Fastest SQL ServersSQL Server speed boils down to how fast we can get data in and out of storage. Sooner or later, we’re going to run out of memory to cache. In this session, we take a lesson from Microsoft’s blazing-fast Fast Track Reference Architectures, look at why they work so well for the wildest queries, and explain how to build our own SQL Servers for quick queries. Microsoft Certified Master Brent Ozar shows you how to test your own servers and compares them to servers he’s worked with in the field.

Building BI Solutions with Microsoft SQL Server PDW AU3SQL Server is releasing AU3 version of Parallel Data Warehouse (PDW) with a new software architecture featuring a cost-based optimizer, stored procedure support, SQL Server security model, as well as the Microsoft TDS protocol that allows native integration with existing Microsoft tools. The new architecture delivers order of magnitude (10x or more) performance improvement compared to the existing AU2 product, and greatly improves connectivity and programmability functionality. This session focuses on the benefits to BI solutions using SQL Server PDW, showing how to integrate PDW AU3 into key BI solutions: Microsoft BI Platform as well as key third-party products –

BISM: Multidimensional vs. TabularWith the introduction of BISM Tabular in Microsoft SQL Server 2012, you have two options for a new project. BISM Multidimensional, formerly known as UDM, is still an option. In this session, see a comparison between the two “engines” available in Analysis Services to learn when they are similar and when they are different. Learn what to choose and why, depending on your requirements.

Optimizing Your BI Semantic Model for Performance and Scale – The release of Microsoft SQL Server 2012 allows standalone tabular models. This session goes into the internal architecture of the engine and describes implications on performance and scale of the tabular models and how they can be tuned effectively.

Massively Parallel Processing and the Parallel Data Warehouse

April 15, 2012

The text of this post came the Microsoft Technical Article “Implementing a Microsoft SQL Server Parallel Data Warehouse Using the Kimball Approachthat I thought was so import that I reproduce  here is my BLOG.

Data growth is related to Moore’s law. As computers get faster and more powerful, we are using them to process more data in more complex applications. Traditional sources such as ERP systems are processing more transactions as our organizations grow. Newer sources such as web browsing activity, mobile devices, and social media are creating data sets that are orders of magnitude larger than what we’ve dealt with in the past.

We call these large scale data sets “big data.” Large organizations have always had to deal with big data, but the boundaries that define big data have expanded with the raw power of computers. The quantity of data you have to deal with only becomes a problem when it becomes difficult to work within your existing systems environment. For some organizations, this could be as small as 100s of gigabytes. For larger, more experienced organizations, the cross-over point is more in the tens of terabytes.

The term big data is a major buzzword in IT today precisely because big data holds the potential for big value. However, your ability to extract value from large data sets in your transaction systems and social media interactions is predicated on your ability to actually manage the data. Companies such as Google and Yahoo! have helped pioneer new technologies such as MapReduce and Hadoop to help deal with the massive quantities of unstructured data they collect. At the same time, mainstream technologies for dealing with more structured big data sets, such as massively parallel processing (MPP) systems and column store databases, are experiencing significant growth.

The common big data challenge in most organizations is to figure out how to make the large-scale volume of transactions in their corporate systems available and usable in an analytic environment. Dealing with newer sources of data, such as social networking, is important, but it is not the number-one problem in most companies today.

In this section we begin with a brief exploration of the strengths and limits of single-server systems when it comes to big data. We then dig into the MPP approach and explore the Parallel Data Warehouse system architecture.

The Strengths and Limits of Single-Server Systems

Most of the “servers” you work with are single servers with shared resources. Each CPU core can work with any section of memory or disk, and all memory and disk is available to each core. This all-in-one architecture is known as symmetric multi processing (SMP). As shown in Figure 6, the CPU cores connect to the memory and disk via a system bus. This shared connection supports high speed inter-process communication, memory, and data sharing, and is easier to manage because it is a single physical unit.


Figure 6: SMP architecture

However, SMP systems have a limit when it comes to big data; they can scale only to the point where their system bus gets overloaded. Too many CPUs making simultaneous requests for data on the system bus creates a traffic jam. As usage grows, the system bus becomes a bottleneck and limits the total amount of processing that can take place on an SMP system. There are ways to mitigate this contention by creating localized subsets of CPUs and memory, but this only extends the limit.

The Massively Parallel Processing Alternative

One time-tested strategy for handling large amounts of data is to avoid this bottleneck by distributing data and processing across many servers, or nodes, each of which has its own memory and disk so they can share the workload. This approach, known as massively parallel processing (MPP), has been around for several decades and is the basis for many of the largest super computers in existence today. Due to their high cost and complexity, MPP systems have historically been used by the largest companies and governmental organizations.

This massively parallel architecture lies at the heart of Microsoft’s Parallel Data Warehouse system. Parallel Data Warehouse is a Microsoft SQL Server product designed to scale data warehouses from tens to hundreds of terabytes of data. It delivers the MPP architecture using an “appliance” model, providing preconfigured, optimized commodity hardware and software and a single point of support.

The SQL Server PDW Architecture

Figure 7 shows an abbreviated architecture for a SQL Server PDW MPP system. A user query request would come into the control node, which breaks the SQL into multiple parallel operations and distributes them out to the compute nodes where the actual data resides. A special module called Data Movement Services coordinates any needed data movement among nodes taking place between and handles any functions that need to be resolved centrally. When the compute nodes are finished, the control node handles post-processing and re-integration of results sets for delivery back to the users.


Figure 7: The SQL Server PDW massively parallel architecture

Each compute node is a separate SMP server running SQL Server. Compute nodes in current SQL Server PDW configurations ship with dual, hex-core CPUs, 96 GB of memory, and local tempdb workspace. They are connected together using dual InfiniBand network to support high-speed node-to-node data sharing for cross-node computations known as data shuffling. This network also connects the compute nodes to the control and administrative nodes to support high-speed data loading, extraction of query results, backup, and other administrative functions.

The disk subsystems for the compute nodes are managed by a storage area network (SAN) component with high-speed dual Fibre Channel connectivity. This data bus supports high-speed I/O, and failover redundancy. The compute nodes and disk drives are physically housed in the same rack, called a data rack.

There are three types of administrative service nodes that share the control rack with the control node. These include:

  • Management Nodes, which provide the DBA or data center operations interface to access and manage the overall solution and support the system’s internal network.
  • A Landing Zone Node, where cleansed data is staged and prepped before loading into the data warehouse.
  • A Backup Node and the appropriate associated storage. The Backup Node provides high-speed integrated backup at the database level. This is tied to the organization’s overall backup strategy and systems.

The SQL Server PDW is a large-scale enterprise class system and has built-in redundancies:

  • Primary data is stored as RAID1.
  • Hardware redundancy includes redundant power supplies, spare disks, compute nodes, control nodes, and management servers, mostly designed to support automatic failover.

A Scalable Appliance

SQL Server Parallel Data Warehouse is sold as a data warehouse appliance: a set of commodity hardware and Microsoft software pre-configured to meet the needs of a range of data sizes and performance. This makes sense because configuring the individual components, network and connectivity throughput, and disk subsystem performance is a significant effort, more than most IT shops would care to take on. With the appliance, all components and network connectivity are carefully designed, configured, and balanced for optimal performance, and necessary software on all nodes is pre-installed and pre-configured.

The MPP architecture can be scaled up by adding racks of compute nodes. The base system starts with one rack. On an HP appliance, for example, a full rack holds 10 nodes, and additional 10-node racks can be added up to a total of 40 nodes. The 40-node limit is more due to the definition of the product and not an inherent limit of the system design. SQL Server PDW uses its backup and restore facility to make expanding a SQL Server PDW is fairly straight forward: back up the database, add the new rack, reconfigure, and restore. The database restoration automatically redistributes the data across all nodes.

Microsoft is working with several hardware vendors to offer SQL Server PDW systems. HP is the first to market with a publicly available product at this writing.

SQL Server PDW Data Management

The physical architecture of distributed nodes with local data means the large data sets have to be distributed across the nodes in a way that will support both data load and query processes. The goal is to get each node and CPU core working as hard as possible on every query. In the data warehouse, fact tables are distributed evenly across nodes so each node will have work to do.

Efficient processing on nodes results when local fact table subsets can join to local dimensions tables, which can be achieved if dimension tables are replicated to all nodes. SQL Server PDW allows you to specify distributed or replicated tables at time of creation, and then transparently manages placing the appropriate data on the appropriate compute nodes at load time.

The Kimball Approach on SQL Server PDW

How does SQL Server PDW fit with the Kimball approach? When you compare it with our principles, it fits quite well. It provides good usability and flexibility because in most cases, you can build a set of atomic-level dimensional models with conformed dimensions. It performs well because the workload is distributed across all the compute nodes rather than bottlenecked on a single server. SQL Server PDW gets an additional performance boost at the node level because SQL Server has functions to support dimensional models, including star-join optimization. And SQL Server PDW’s support for replicated dimension tables allows many common query scenarios to be satisfied without more expensive data-shuffling operations.

Parallel Data Warehouse – Slowly Changing Dimension

April 1, 2012

Most of my day-to-day work is currently centered around Microsoft’s Massively Parallel Processing Appliance call PDW (Parallel Data Warehouse).  Many of the same techniques common to SMP (Symmetric Multiprocessing)systems such as Kimball method Slowly Changing Dimension (SCD) are still very important.

Identifing the deltas between an incoming dataset and the destination dimension table can be one most resource intensive, complex, and valuable processes in a SMP or PDW ETL system. Careful planning is necessary to ensure that ETL process preforms well  and is scalable. Nowhere is this more important than in a PDW that can process 10s to 100s of terabytes daily!

My previous post Building the Enterprise DW/BI System with SQL Server PDW describes the basic PDW components discussed below.

The following gives some real world SCD statistics and lessons learned.

SCD – Replicate vs. Distributed tables

Replicated Table ‘A’ – 28,977,757 rows – 34 min

Distributed Table ‘A’ – 28,977,757 rows – 21 seconds!

The difference between  SCD on a replicated and distributed table is dramatic. I expected that the distributed table might preform slightly faster because the table is distributed across all the compute nodes and so therefore it maximizes the parallelism of the PDW. However I didn’t expected such a huge difference. (Note: 34 minutes to process 28 million + rows using type 2 SCD is not bad either. Try doing that on your SMP )

I found the reason for the difference in the Query Plan. The Query Plan contains several steps. The most  costly steps are SCD DML operations (update expire rows) steps. Basically for each update statement the PDW;

1 – Creates a copy of the Target table /Dimension Table

2 – Preforms the update on the copy

3 – Publishes the updated copy to each node

See example below which shows part of the SCD logic of a 1.5 million row table:

SCd update replicated tables

Conclusion: Distributed tables can a have superior DML performance, especially for large tables, however this SCD performance increase will come at the cost of query performance which will be slower if the queries do not contain the distribution column. From my experience I would recommend considering the possibility of converting  a replicated tabled to a distributed table for SCD DML performance improvement if the table is great than 5-10 million rows.

Distribute Table on a Key Column

If you chose to make the Destination Table/Dimension Table a Distributed Table it is important to distribute the table on  one of the Traget Tables Key Column(s), other wise the SCD DML operations will incur a heavy performance hit of ShuffleMoveOperation because the processing nodes must shuffle/move data in order to satisfy the SCD DML queries.

72 million row Distributed Table

Not Distributed on a Key Column – 13 min.

Distributed a Key Column –   30 seconds!

Example: Not Distributed on a Key Column:

Distributed on non Key Columns

Example: Distributed on a Key Column

Distributed on Key Columns

There is still a small ShuffleMove but the most expensive SufflesMoves have been eliminated.  It might not always be possible eliminated all ShuffleMoves.


Microsoft’s Parallel Data Warehouse offers a huge advantage in Slowly Changing Dimension (DML) processing performance over traditional SMP systems.

Building the Enterprise DW/BI System with SQL Server PDW

March 25, 2012

The text of this post came the Microsoft Technical Article “Implementing a Microsoft SQL Server Parallel Data Warehouse Using the Kimball Approachthat I thought was so import that I reproduce  here is my BLOG.

Building the Enterprise DW/BI System with SQL Server PDW

Most readers considering a Parallel Data Warehouse already have a data warehouse in place and are looking for ways to help handle growing data and performance demands. Many of these next-generation, large-scale data warehouse/business intelligence systems are evolving from existing DW/BI systems that are designed based on the Kimball approach. In this case, the transition to SQL Server PDW will be straightforward.

In this section we run through the basic steps for converting an existing SMP-based Kimball data warehouse to a Parallel Data Warehouse server, including the impact of SQL Server PDW on the DBA. We’ll also explore additional roles SQL Server PDW can play, including serving as the central source or hub, in a distributed data warehouse environment, as an ETL transformation engine, and as a platform for providing real-time analytic data.

Preparation and Installation

The SQL Server PDW system must live in a data center and involves at least two racks, so you should do some planning with your server management group before the truck shows up on installation day. Since it uses InfiniBand, any other servers you want to benefit from fast data transfer functions will need InfiniBand connections and to be located close enough to the SQL Server PDW server to meet any cable limitations.

Vendor installation is usually part of the purchase and takes a few days depending on what issues show up.

Part of planning should include some consideration of your overall conversion strategy. The options range from directly converting the existing data warehouse to completely re-architecting the system as part of the migration process. We will focus on the direct conversion approach in this section and discuss the re-architecting options later in this paper.

Data Migration

Once the machine is up and running, the next step is to create the new database, instantiate the target objects and their properties, and copy over the data. The Parallel Data Warehouse database is a SQL engine, but it is a bit different from the SMP-based SQL Server database. This is mostly because it is a parallel processing system, and some things don’t work quite the same. Certain functions have an underlying assumption of serial processing that doesn’t work in a parallel environment. Other functions, such as distributing data across nodes for parallel execution, don’t exist in the SMP environment.

If you are converting an existing SMP SQL Server database to SQL Server PDW, you can use a tool the Microsoft PDW team has built to help. It creates tables, adjusts indexes and partitioning, suggests distribution strategies for the fact tables, identifies problems such as data types that do not have direct equivalents in SQL Server PDW, and generates the actual BCP out scripts to get data from SQL Server and load scripts to load data into SQL Server PDW.

If your existing data warehouse is not SQL Server, the initial data migration is still fairly straightforward as long as you have a solid set of dimensional models. It shouldn’t take more than a few hours depending on the number of tables involved.

One big advantage of the SQL Server PDW system from the DBA’s perspective is the simplification it brings to physical data management. The physical location of data, including filegroups, disk layout, LUNs, and tempdb location, is all handled automatically as part of the core SQL Server PDW system.

There is one high-level physical decision to be made when moving to a massively parallel environment: how the tables should be split up across the nodes. There are two primary ways to physically instantiate tables in SQL Server PDW: replicated or distributed. The CREATE TABLE DDL includes a distribution clause where this is specified.

Replicated Tables

A replicated table looks like a single table to anyone who accesses SQL Server PDW, but it is actually replicated out to all compute nodes on the server. That is, there is one copy of the table on each node.

The purpose of replicating tables is to improve performance by having local copies of data on each node to support local joins. Replicated tables are generally used for dimensions and lookup tables to support local joins to the fact tables.

The replicated tables are managed by the system transparently. From the DBA’s perspective, the CREATE TABLE syntax is pretty simple:


CustomerKey int NOT NULL,

Name varchar(50),

ZipCode varchar(10))



The default is REPLICATE if the distribution clause is omitted.

Distributed Tables

The rows of a distributed table are spread across all nodes as evenly as possible. Each row is written out to a distribution which is a storage location on a node. There are eight distributions on each compute node, each with its own disks. In other words, no copies are made; each row in the source table ends up in only one distribution on one compute node. The rows are mapped to the distributions using a hash function on a column from the table.

The goal of distribution is to improve performance by maximizing parallel processing. Fact tables are usually the largest tables in the data warehouse, and are usually distributed.

Figure 9 shows a simplified version of the distribution of a Sales Fact table across eight compute nodes based on the CustomerKey column.


Figure 9: Fact table distribution

The Customer Key from each row from the Incoming Sales Fact Data in the upper left is put through a hash function. The hashed values map to a single distribution on a single node. For example, the row for customer key 44 hashes to 0x1C, which maps to the last distribution of the first compute node. Here is the DDL for the distributed table shown in Figure 9:



CustomerKey INT,

DollarAmount MONEY)



The choice of the distribution column is key, so to speak. If a few customers accounted for a large percentage of sales, using Customer Key would lead to an imbalance in the data distribution. One or two distributions would end up with a larger percentage of the data. This imbalance is called data skew. One or a few distributions with 10% more rows than average may cause problems, and a difference of greater than 30% will lead to poor performance. This makes sense because each query has to wait for all nodes to complete, and any node with significantly more data will take longer than the others when processing queries involving skewed data.

The primary criteria for selecting a good column for distribution are high cardinality and even row counts. There are other considerations for choosing the distribution column. For example, it’s not a good idea to choose a column that is often constrained to a single value in user queries. If users typically constrain on a single day, then the DateKey column is not a good candidate because all the rows for that day will end up in a single distribution. Other factors come into play when selecting a distribution key, such as distributing multiple fact tables that may need to be joined together to support certain analytics.

The parallel processing power of the SQL Server PDW system allows you to test your distribution key choice. Pick a distribution key, load the table, and run some distribution queries and a representative set of user queries against it. If you find a problem, you can create another version of the distributed table from the first version by using the CREATE TABLE AS SELECT statement and changing the column in the DISTRIBUTION = HASH () clause. This is generally much faster than you would expect because of the parallel processing. Of course, you need enough space to make multiple copies of your large fact tables, even if they are only experimental.

Dealing with Very Large Dimensions

As we said, dimensions are almost always replicated in the SQL Server PDW data warehouse. As a rule of thumb, dimension tables that are 5 GB uncompressed or smaller should be replicated. You do have to allow for space on each node for the replicated table. A 5-GB dimension would compress to around 2 GB, which would take up a total of 20 GB once it is replicated across a 10-node rack. By the way, compression is automatic and mandatory in SQL Server PDW.

Just to get a sense of the dimension table size that qualifies for replication, a Product dimension like the one shown in Figure 5 with a 500-byte uncompressed row size could hold about 10 million rows before you might consider other options.

Dimensions larger than 5 GB uncompressed are not unheard of, especially when dealing with big data. If you have a dimension that exceeds the replication threshold, you have two main options in a parallel environment: distribution or normalization.

Distributing a large dimension leverages the same parallel processing power with the fact table. However, if the rows needed to resolve a query are not on the same node as the associated fact table, the required dimension keys must be “shuffled” between nodes. SQL Server PDW is designed to move data rapidly when necessary for a query processing step, but it’s always faster to stay local.

In some cases, it may be possible to distribute the dimension using the same surrogate key as the fact table. This shared distribution key means the joins remain local because the required dimension rows are on the same nodes as corresponding fact rows.

The second option is to normalize very large dimensions to reduce their size and make their replication less burdensome. If the product dimension shown in Figure 5 had 10 million rows, it would require about 2 GB (depending on column widths and compression ratios), which is around the replication boundary. The normalized product table shown in Figure 4 would only require about 325 MB to hold 10 million rows. Obviously 325 MB is going to be easier to copy out to 10 or 20 nodes than 2 GB.

If most queries against a large dimension only return or constrain against a few columns, consider creating an outrigger dimension. That is, the core dimension will contain the commonly used columns. The rest of the columns are put into a separate table, called an outrigger, with the same surrogate key. The core dimension can then be replicated, and will join locally to the fact tables. Queries that require the less common attributes can bring them in with a single join to the replicated outrigger dimension. This is an easy way to get back into the range where replication works without having to completely normalize the dimension.

You can insulate users from the complexity of a normalized or outrigger design by providing views that re-combine the normalized columns back into a single dimension. Test these views to make sure they do not negatively impact performance.

You may have heard that normalization is a requirement for MPP systems. Some historical context might help explain this. Early MPP systems had tighter space constraints, lower bandwidth between nodes, and more costly storage. This led to a default practice of normalizing the dimensions in order to reduce the amount of data replicated onto each node. MPP vendors glossed over this need to normalize by arguing that you should use a normalized model because it is the “industry standard” for an enterprise data warehouse. Do not be fooled by this reverse logic. Normalizing dimensions is an MPP design choice made to improve performance by reducing the amount of data that must be replicated and stored across the nodes. Again, the need to normalize a dimension has been a rare occurrence in SQL Server PDW implementations to date.

Additional DDL

There are a few additional design decisions to make in defining the data warehouse tables. There are typically far fewer indexes on an MPP system because they are not needed. Do use clustered indexes where it makes sense. In most cases, this means creating a clustered index on the surrogate key of the dimension tables, and on the same column used for partitioning the fact tables. Use non-clustered indexes with care. In many cases, they are not needed because of the parallel processing speed, and they add maintenance, slow the load process, and take up space.

Fact tables may be partitioned for the same reasons you would partition on an SMP system, such as rolling window management or load isolation that uses a SWITCH operation. Partitioning is conceptually simpler in SQL Server PDW because it is fully specified as part of the table creation DDL rather than through a separate partition function and scheme.

Create an ETL System to Load the Target Model

Once you have tables defined in SQL Server PDW, the next step is to load data into them. The initial data transfers will most likely use scripts to bulk copy the existing data warehouse history into the SQL Server PDW. Moving forward, if you were using SQL Server Integration Services, your ETL system should function essentially the same with SQL Server PDW as it did with your prior data warehouse. For example, SQL Server PDW has its own source and destination connections you will use in your Integration Services packages. However, there are a few product differences that will impact your ETL system.

Surrogate Key Assignment

The IDENTITY property of an integer field is not supported in SQL Server PDW. This makes sense when you realize rows in a distributed table will be inserted across many separate nodes. The cost of keeping track of incremental identity assignments across multiple nodes in a parallel process would dramatically slow any insert process. If you were using the IDENTITY property to assign surrogate keys to your dimensions, you will need to manage this either in your ETL process by keeping surrogate key values in a table and assigning them incrementally, or in the INSERT statement by using the ROW NUMBER ACROSS function.

Cached Lookups Only

If you use Integration Services Lookup transformations in your existing ETL packages, make sure you select Full cache in the Cache mode section when querying SQL Server PDW, which pre-populates the lookup cache. Using the Lookup transformation to perform a non-cached SELECT operation against incoming Integration Services pipeline rows is inefficient with SQL Server PDW.

The Landing Zone

The SQL Server PDW system has a separate staging server as part of the control rack called the Landing Zone. Incoming data from the Integration Services connections or the SQL Server PDW bulk loader (DWLoader.exe) flow through the Landing Zone prior to being distributed to the compute nodes for permanent storage. The Landing Zone quickly reads incoming rows from files or Integration Services and sends them off to compute nodes in a round-robin fashion using a module called the Data Movement System (DMS) which, not surprisingly, handles data movement around the system. On each compute node, a DMS instance will hash the rows and send them back out to the DMS instance of the node to which they map. This receiver DMS inserts the row into a staging table where any sorting and indexing takes place. The final step uses SELECT INTO to copy the data from the staging table to the target table. All this happens behind the scenes and is managed by the system.

This whole flow keeps data loading in a highly parallel fashion and minimizes any processing work actually performed on the Landing Zone.

One benefit of parallel processing is the load process can run while users are querying the data. The loader processes get lower priority, so they have little impact on user queries. This means you can process yesterday’s load without having to limit user access. It also means you could do near-real time data loads to give access to current data where it’s needed.

Transact-SQL Compatibility

SQL Server PDW has its own variant of SQL with extensions to support parallel processing. Some functions in the SMP SQL Server product have not been implemented in SQL Server PDW. Some of these were omitted because they are functions that do not translate well into a parallel environment. For example, the IDENTITY property is not supported as described in the ETL section.

Transact-SQL compatibility with SQL Server SMP is not yet fully complete, and Microsoft continues to add functionality through frequent updates. You will want to test any existing scripts or stored procedures that are part of your current operations against the latest functionality provided by SQL Server PDW.

System Management and Tuning

SQL Server PDW has its own Central Administration console that provides easy management and monitoring of the system. It uses a product similar to SQL Server Management Studio that is aware of the multi-node nature of the system and monitors sessions, queries, loads, backups, node activity, and alerts and errors.

From a tuning perspective, it’s best to take a simple approach on SQL Server PDW, starting with minimal indexes as described in the physical design section and testing performance with a representative set of user queries once you get the data loaded. If it works, no problem. If not, you can use the Central Administration console to inspect individual query plans to see where the bottlenecks are. For example, your fact table distribution may be skewed, so most of the processing is on a single node. In this case, you can try a different distribution column using the CREATE TABLE AS SELECT statement as described earlier. You may need indexes for specific types of queries. For example, one customer needed to query individual phone numbers from the Customer dimension for some of their lookup reports. A non-clustered index on phone number did the trick. This is something you would typically consider for queries that are used often and have a large impact on the user community.

Additional Opportunities

There are several additional roles and requirements Parallel Data Warehouse can take on beyond hosting the enterprise data warehouse. From an enterprise information perspective, SQL Server PDW can integrate with existing systems by serving as the system of record for analytic data and providing that data to downstream bulk consumers. From an ETL processing perspective, SQL Server PDW can act as a large-scale ETL engine to manage the bulk transformation of big data sets. SQL Server PDW can also support near real-time data warehousing, which is critical for certain analytics.

Integration with Existing Systems

There are many situations where the enterprise data warehouse needs to feed large data sets to downstream systems. In many cases, these are extensions of the DW/BI system in the form of data marts which can be fed from SQL Server PDW in a hub-and-spoke fashion. The definition of data mart is quite fluid; it often describes a component that exists for historical and/or political reasons and adds significant work without adding much value. Data marts and other downstream data consumers can also include purpose-built architectural components. For example, it may make sense to create a subset of enterprise data on a separate server to allow integration with business unit or divisional data. We’ve also seen large chunks of data exported from the EDW to support research or data mining on a dedicated server. Operational systems such as a sales force automation system, or customer relationship management system, often import large subsets of the EDW to provide context to their processes. We’ve also seen subsets created for business-specific reasons. For example, one company wanted to provide sales data to their customers, but decided to create a separate data mart for each customer for security reasons.

If you need to integrate with existing systems, SQL Server PDW can help. Remote Table Copy is a high-speed table copying function that can transfer tables from the SQL Server PDW to SQL Server running on a locally connected SMP server. Data transfer rates can be as fast as 400 GB per hour. Once the data is in the target SQL Server machine, you would complete the ETL process to properly integrate it into the database with appropriate indexes, partitioning, and any other required constraints.

An Opportunity for Improvement

If you have downstream data marts that were created for historical performance and/or political reasons, and which no longer serve a true business need, we encourage you to examine them carefully. This multi-layered, multi-model approach adds significant work, time, redundancy, and cost to the enterprise DW/BI system implementation. Implementing a SQL Server PDW system offers a chance to re-architect these leftover appendages into a more efficient and effective enterprise information environment.

This platform improvement strategy seeks to replace the existing DW/BI system by unplugging the existing data marts and redirecting or rewriting BI queries and reports to pull directly from the SQL Server PDW. This approach is usually more disruptive and requires more effort, but ultimately it leads to a simpler, more robust, more responsive enterprise information resource. Simply integrating SQL Server PDW into the existing environment sounds appealing because it is low impact in the short term. However, in the long term, you may be perpetuating systems that are inefficient, confusing, and costly.

SQL Server PDW as the Transformation Engine

Organizations dealing with particularly large data sets and operating with narrow load windows may not have time to use a separate ETL system to process the data before loading it into the SQL Server PDW. In these cases, SQL Server PDW can serve as a large-scale transformation engine as part of an overall EDW architecture. This approach generally involves loading the data directly into tables in the SQL Server PDW database, and then performing ETL lookups as INSERT-SELECT operations joining staging tables to dimension tables to lookup surrogate keys in bulk. This approach applies the full power of the parallel environment to the core ETL processes.

Real Time Options

While most of the analytic data in the data warehouse does not need to be loaded on a less-than-24-hour basis, some business opportunities require more frequent data loads. SQL Server PDW’s parallel load process supports “near real time loading” under the Read Uncommitted isolation level (Dirty Reads). Loads can be run while users query tables and these data loads have a low impact on the overall performance of concurrently-running queries.


SQL Server Parallel Data Warehouse offers a viable platform for supporting large-scale data warehouses into the hundreds of terabytes. The appliance nature of the system makes it relatively easy to configure, install, tune, manage, and expand. SQL Server PDW provides parallel processing of queries against dimensional models on atomic data to address the Kimball approach’s goals of query performance, usability, and flexibility on an enterprise information resource.

For more information: Parallel Data Warehouse on SQL Server Web Site Fast Track Data Warehouse on SQL Server Web site HP Business Data Warehouse Appliance on SQL Server Web site HP Business Decision Appliance at SQL Server Web site SQL Server Web site SQL Server TechCenter SQL Server DevCenter