Archive for April, 2012

Enterprise Data Warehouse Architecture Options

April 25, 2012

The text of this post came the Microsoft Technical Article “Implementing a Microsoft SQL Server Parallel Data Warehouse Using the Kimball Approachthat I thought was so import that I reproduce  here is my BLOG.

Enterprise Data Warehouse Architecture Options

Of course, your architecture depends on your business requirements along with technical, historical, and political factors. While business requirements are usually business-specific, we can start with a list of broad requirements, or architecture goals, that most organizations would support:

1. Excellent query performance for users

2. High ease of use

3. Flexibility

4. Enterprise-wide usage and value

5. Good maintainability

From these broad goals, we can derive the primary components of an enterprise data warehouse architecture as follows:

Performance

Excellent query performance is a given requirement. What counts as excellent depends on your user expectations; obviously response times in the seconds are desirable when possible, though in some cases, response times in the minutes or even hours may be considered excellent, given the data volumes and query complexities. Just to be clear, excellent performance should be accomplished in a fashion that is entirely transparent to the user. There should be no need for the user to learn which aggregates to use, or which data mart has the required data, or how to add an optimizer hint.

Ease of Use

The DW/BI industry generally agrees that the dimensional model is the easiest to work with. Ease of use from a business user perspective is ultimately determined by the BI tools that front-end the data warehouse database. Developers, or any users, who are writing reports and applications that directly access the data warehouse database generally end up writing their access code in SQL. Ease of use from a developer’s perspective is driven mostly by the physical database model.

It is much easier to present a dimensional model in the BI tool metadata layer if the underlying atomic data model is already dimensional.

Flexibility

The first determinant of flexibility stems from the grain of the fact tables you create. If your fact tables are captured at the lowest level of detail available, known as the atomic level, you can always aggregate the data up to any attribute of any dimension. Therefore, your DW/BI system must capture the atomic detail to provide maximum flexibility.

Conformed dimensions are also a contributor to flexibility. They allow users to query data from separate business processes such as sales and inventory, and correctly combine the results on shared dimension attributes, such as product or region. In effect, they allow users to compare apples to apples across the enterprise.

Note that a correctly defined dimensional model has the exact same flexibility from an analytic query perspective as a normalized model; they are relationally equivalent. A query to sum up Sales by Region will give the exact same answer in either model.

Enterprise Resource

A true enterprise information resource has three main components: all data is available to all users, data is aligned across disparate business processes, and there is one analytic system of record for each data element. Let’s examine each of these statements in turn.

All data must be available to any analyst who might need it because all data is useful to everyone in the organization. The analyst in Logistics needs to know sales by geography and distribution center. The analyst is Sales needs to know sales by customer and region. The analyst in Marketing needs to know sales by product. These are all different summary queries on the same atomic sales fact data. Do not let these departmental differences lead you to think departmental data marts might be a good solution. As soon as you limit Marketing to product summaries, they will insist on customer detail to support a customer segmentation analysis. Every analyst ends up needing access to all the data at the atomic level at some point.

Data must be aligned across business processes because that’s what allows users to combine data from multiple, disparate sources across the enterprise in a fast and correct manner. The data in the enterprise data warehouse needs to be integrated via the enterprise set of conformed dimensions identified in the bus matrix. Conformed dimensions and the work that goes into creating and maintaining them is a major component of this enterprise resource; conformed dimensions are the struts that hold the enterprise data framework together.

A single source reduces the confusion and wasted time that results from having multiple data marts with overlapping data content. Having a single analytic system of record may involve multiple physical copies for performance reasons, but this is a compromise. If multiple copies of the data are needed, these copies must be built from a single, central data warehouse database. While there may be transformations for analytic reasons, if the same tables and attributes exist in multiple places, they must be presented with the same names and definitions to avoid error and confusion.

The large-scale server power of a product like SQL Server PDW allows you to provide a true enterprise information resource: a single version of the truth without the extra time, resources, and maintenance required to copy data out to multiple data marts.

Maintainability

The simpler your architecture, the easier it will be to operate and maintain. A single, high-performance database with atomic-level detail and fast summarization based on a dimensional model is the simplest way to meet the broad enterprise requirements.

Architectural Compromises

You may need to adjust your ideal DW/BI system architecture if it cannot meet the enterprise requirements. Performance is the most common area requiring compromise; if the ideal architecture is not working, it’s not so ideal. Before throwing in the towel on performance, it is important to make sure your ideal architecture is properly tuned. If it still doesn’t work, the most common compromise is data distribution.

Performance Tuning

Performance is platform-dependent. Indexes and aggregates are the two standard performance tools in the DW/BI system, and these vary widely across database product and platform. For example, in an SMP environment, it can take a long time to run a query that asks for total sales for the last five years. In this case, it makes sense to create aggregated tables once during the ETL process that can be used over and over to answer summary level queries. (Note that these aggregate tables need to be transparent to the user to maintain ease of use.)

However, the MPP environment offers a third performance tool: parallel processing. Distributing query tasks across multiple nodes may allow summary level queries to be answered on the fly. This greatly simplifies the design, tuning, and maintenance of indexes and aggregates. You can generally rely on the brute force power of the underlying parallel processing architecture for excellent query performance in the MPP environment.

Distributed Processing

If performance tuning or parallel processing isn’t enough, you may have to create separate subsets of the data warehouse and host them on downstream servers. These data marts may be departmental in focus; data is often limited to a few subject areas and summarized. (If the data marts contained atomic-level data from all business processes, you would be back where you started with the enterprise data warehouse.) From a performance point of view, the idea is to offload a subset of users and queries to a dedicated platform. This is a crude form of distributed processing, and is probably less effective than simply adding another rack to the SQL Server PDW machine where it could be used by the entire organization when needed.

There are times when this distributed strategy makes sense. Certain data may be useful or interesting only to a small analytic community. Other data may be sensitive and require strict physical access limitations. In some cases, the desire for a separate server is organizational; a certain department may insist on having its data on its own server. As we will describe in the implementation section, you may have existing data marts with extensive reports and applications built on them. In this case, it’s much easier to initially populate these downstream marts from the SQL Server PDW rather than rewrite the reports and applications to work directly from the SQL Server PDW.

In these cases, the SQL Server PDW can act as the central source of the distributed data warehouse. SQL Server PDW has a Remote Table Copy feature that will propagate tables to these downstream SQL Server systems at high speeds. The target systems need to be physically located close enough to the SQL Server PDW so they can connect to the InfiniBand network, since this is part of the speed component. If the downstream systems are designed based on Microsoft’s Fast Track architecture, the data transfer rates can be significant. The downstream systems can also be any data mart running SQL Server 2008 or above. This includes the new HP Business Data Warehouse, optimized for SQL Server 2008 R2, and the HP Business Data Warehouse, a BI appliance also optimized for SQL Server 2008 R2.

Extended Analytic Functionality

There are several ways to extend the functionality of the core data warehouse within the SQL Server platform. SQL Server Analysis Services online analytic processing (OLAP) brings more advanced analytic functionality and improved performance for complex queries. Analysis Services data mining brings predictive analytics that can leverage the ability of the parallel processing server to draw out valuable patterns and relationships from vast quantities of transactions.

Analytic Marts

Most organizations who have been working with DW/BI systems for several years have advanced beyond simple reporting. They are building complex analytic applications using predictive analytics and multi-faceted dashboards that draw key performance indicators from across the enterprise. It can be difficult to create the queries that populate these advanced BI tools because they need to make multiple passes against multiple fact tables at different levels of summary. In this environment, it is usually easier to create additional data sets that pre-integrate and calculate most of the analytics.

This can be done either as tables in the SQL Server PDW, or as a separate OLAP data mart. These types of calculated data sets in the relational data warehouse are often called snapshot fact tables or accumulating fact tables. Common snapshot tables include inventory balances at a point in time, or month end account balances in financial services.

The OLAP option is particularly compelling because OLAP databases, such as SQL Server Analysis Services, are designed to perform more advanced analytic calculations, and provide a performance boost through the creation and management of aggregates. The language used to access Analysis Services, called multidimensional expressions (MDX), was created to support analytics. It has a built in understanding of date relationships such as current month, year to date, and prior year. It also can navigate hierarchies, such as moving from district to region to country.

In either case, the SQL Server PDW core data warehouse would serve as the data foundation, and these analytic tables or marts would be built from its cleaned and conformed data store.

In an interesting combination of functionality, Analysis Services can also be used as a query management layer for SQL Server PDW in what is known as relational OLAP (ROLAP) mode. In this mode, Analysis Services retrieves data directly from SQL Server PDW at query time, rather than using a pre-loaded OLAP database within Analysis Services. It also provides full access to the advanced analytic capabilities offered by the MDX language. User queries are submitted to Analysis Services from the BI tool layer, translated into SQL, and submitted to the SQL Server PDW database.

Data Mining

Microsoft’s data mining functionality can drive a range of interesting predictive analytics including forecasting, recommendation engines, and customer segmentation. The data mining component itself is an Analysis Services feature, and runs on an Analysis Services server.

Parallel Data Warehouse can serve as the data source that feeds the data mining engine, thus enabling models based on the vast amounts of transaction level detail stored in SQL Server PDW, in combination with the richly attributed dimensions. Generating the input data sets is often the hard part of data mining because these data sets typically involve multiple full-table scans to identify behaviors and changes in behavior over time.

BI Reporting and Applications

One of the original principles of the Kimball approach listed in the first section of this white paper is to provide a complete DW/BI solution. This includes providing user access for ad-hoc exploration and BI reports and applications that deliver value to the business that was identified in the requirements gathering process. Microsoft offers a set of reporting and analysis tools as part of its overall DW/BI product stack, and Parallel Data Warehouse is a fully participating member of this ecosystem. Reporting Services and Report Builder queries and reports, and third-party BI tools, can draw from SQL Server PDW like any other SQL Server database. The same is true for Microsoft Office tools such as Microsoft Excel and Microsoft PowerPivot. All of these user access methods can be hosted in Microsoft SharePoint and delivered in the context of a rich BI portal experience. .NET applications can access SQL Server PDW via ADO.NET drivers, and third-party tools can communicate with SQL Server PDW using OLE DB and ODBC. All of these drivers accompany the SQL Server PDW product.

Enterprise Data Warehouse Architecture Summary

In summary, the above goals lead us to the following components of the ideal enterprise data warehouse architecture:

Components Goals Addressed
Atomic data Flexibility, enterprise resource
Single data store Enterprise resource, maintainability
Parallel processing and/or aggregates Performance
Dimensional model Ease of use for all user communities
Conformed dimensions Enterprise resource, integration
Attribute change tracking Enterprise resource, ease of use, accurate history

Table 1: Architecture components and goals

I include attribute change tracking even though it is more a function of the ETL process because it is mandatory from a business perspective, and its ease of implementation is a function of the underlying data model. Therefore, we list is as part of the core data warehouse architecture. A graphical model of this architecture can be depicted as follows:

clip_image002

Figure 8 – High-level enterprise data warehouse in the DW/BI system architecture

Parallel Data Warehouse occupies center stage in this architecture. In many cases it can provide a no-compromise solution, with a single set of atomic-level data stored in dimensional models, using parallel processing to provide performance, and organized as an enterprise resource based on the bus matrix and conformed dimensions. SQL Server PDW can also take on many of the hard core ETL processes if need be, a function we will discuss in the implementation section coming up.

Advertisements

Massively Parallel Processing and the Parallel Data Warehouse

April 15, 2012

The text of this post came the Microsoft Technical Article “Implementing a Microsoft SQL Server Parallel Data Warehouse Using the Kimball Approachthat I thought was so import that I reproduce  here is my BLOG.

Data growth is related to Moore’s law. As computers get faster and more powerful, we are using them to process more data in more complex applications. Traditional sources such as ERP systems are processing more transactions as our organizations grow. Newer sources such as web browsing activity, mobile devices, and social media are creating data sets that are orders of magnitude larger than what we’ve dealt with in the past.

We call these large scale data sets “big data.” Large organizations have always had to deal with big data, but the boundaries that define big data have expanded with the raw power of computers. The quantity of data you have to deal with only becomes a problem when it becomes difficult to work within your existing systems environment. For some organizations, this could be as small as 100s of gigabytes. For larger, more experienced organizations, the cross-over point is more in the tens of terabytes.

The term big data is a major buzzword in IT today precisely because big data holds the potential for big value. However, your ability to extract value from large data sets in your transaction systems and social media interactions is predicated on your ability to actually manage the data. Companies such as Google and Yahoo! have helped pioneer new technologies such as MapReduce and Hadoop to help deal with the massive quantities of unstructured data they collect. At the same time, mainstream technologies for dealing with more structured big data sets, such as massively parallel processing (MPP) systems and column store databases, are experiencing significant growth.

The common big data challenge in most organizations is to figure out how to make the large-scale volume of transactions in their corporate systems available and usable in an analytic environment. Dealing with newer sources of data, such as social networking, is important, but it is not the number-one problem in most companies today.

In this section we begin with a brief exploration of the strengths and limits of single-server systems when it comes to big data. We then dig into the MPP approach and explore the Parallel Data Warehouse system architecture.

The Strengths and Limits of Single-Server Systems

Most of the “servers” you work with are single servers with shared resources. Each CPU core can work with any section of memory or disk, and all memory and disk is available to each core. This all-in-one architecture is known as symmetric multi processing (SMP). As shown in Figure 6, the CPU cores connect to the memory and disk via a system bus. This shared connection supports high speed inter-process communication, memory, and data sharing, and is easier to manage because it is a single physical unit.

clip_image002

Figure 6: SMP architecture

However, SMP systems have a limit when it comes to big data; they can scale only to the point where their system bus gets overloaded. Too many CPUs making simultaneous requests for data on the system bus creates a traffic jam. As usage grows, the system bus becomes a bottleneck and limits the total amount of processing that can take place on an SMP system. There are ways to mitigate this contention by creating localized subsets of CPUs and memory, but this only extends the limit.

The Massively Parallel Processing Alternative

One time-tested strategy for handling large amounts of data is to avoid this bottleneck by distributing data and processing across many servers, or nodes, each of which has its own memory and disk so they can share the workload. This approach, known as massively parallel processing (MPP), has been around for several decades and is the basis for many of the largest super computers in existence today. Due to their high cost and complexity, MPP systems have historically been used by the largest companies and governmental organizations.

This massively parallel architecture lies at the heart of Microsoft’s Parallel Data Warehouse system. Parallel Data Warehouse is a Microsoft SQL Server product designed to scale data warehouses from tens to hundreds of terabytes of data. It delivers the MPP architecture using an “appliance” model, providing preconfigured, optimized commodity hardware and software and a single point of support.

The SQL Server PDW Architecture

Figure 7 shows an abbreviated architecture for a SQL Server PDW MPP system. A user query request would come into the control node, which breaks the SQL into multiple parallel operations and distributes them out to the compute nodes where the actual data resides. A special module called Data Movement Services coordinates any needed data movement among nodes taking place between and handles any functions that need to be resolved centrally. When the compute nodes are finished, the control node handles post-processing and re-integration of results sets for delivery back to the users.

clip_image004

Figure 7: The SQL Server PDW massively parallel architecture

Each compute node is a separate SMP server running SQL Server. Compute nodes in current SQL Server PDW configurations ship with dual, hex-core CPUs, 96 GB of memory, and local tempdb workspace. They are connected together using dual InfiniBand network to support high-speed node-to-node data sharing for cross-node computations known as data shuffling. This network also connects the compute nodes to the control and administrative nodes to support high-speed data loading, extraction of query results, backup, and other administrative functions.

The disk subsystems for the compute nodes are managed by a storage area network (SAN) component with high-speed dual Fibre Channel connectivity. This data bus supports high-speed I/O, and failover redundancy. The compute nodes and disk drives are physically housed in the same rack, called a data rack.

There are three types of administrative service nodes that share the control rack with the control node. These include:

  • Management Nodes, which provide the DBA or data center operations interface to access and manage the overall solution and support the system’s internal network.
  • A Landing Zone Node, where cleansed data is staged and prepped before loading into the data warehouse.
  • A Backup Node and the appropriate associated storage. The Backup Node provides high-speed integrated backup at the database level. This is tied to the organization’s overall backup strategy and systems.

The SQL Server PDW is a large-scale enterprise class system and has built-in redundancies:

  • Primary data is stored as RAID1.
  • Hardware redundancy includes redundant power supplies, spare disks, compute nodes, control nodes, and management servers, mostly designed to support automatic failover.

A Scalable Appliance

SQL Server Parallel Data Warehouse is sold as a data warehouse appliance: a set of commodity hardware and Microsoft software pre-configured to meet the needs of a range of data sizes and performance. This makes sense because configuring the individual components, network and connectivity throughput, and disk subsystem performance is a significant effort, more than most IT shops would care to take on. With the appliance, all components and network connectivity are carefully designed, configured, and balanced for optimal performance, and necessary software on all nodes is pre-installed and pre-configured.

The MPP architecture can be scaled up by adding racks of compute nodes. The base system starts with one rack. On an HP appliance, for example, a full rack holds 10 nodes, and additional 10-node racks can be added up to a total of 40 nodes. The 40-node limit is more due to the definition of the product and not an inherent limit of the system design. SQL Server PDW uses its backup and restore facility to make expanding a SQL Server PDW is fairly straight forward: back up the database, add the new rack, reconfigure, and restore. The database restoration automatically redistributes the data across all nodes.

Microsoft is working with several hardware vendors to offer SQL Server PDW systems. HP is the first to market with a publicly available product at this writing.

SQL Server PDW Data Management

The physical architecture of distributed nodes with local data means the large data sets have to be distributed across the nodes in a way that will support both data load and query processes. The goal is to get each node and CPU core working as hard as possible on every query. In the data warehouse, fact tables are distributed evenly across nodes so each node will have work to do.

Efficient processing on nodes results when local fact table subsets can join to local dimensions tables, which can be achieved if dimension tables are replicated to all nodes. SQL Server PDW allows you to specify distributed or replicated tables at time of creation, and then transparently manages placing the appropriate data on the appropriate compute nodes at load time.

The Kimball Approach on SQL Server PDW

How does SQL Server PDW fit with the Kimball approach? When you compare it with our principles, it fits quite well. It provides good usability and flexibility because in most cases, you can build a set of atomic-level dimensional models with conformed dimensions. It performs well because the workload is distributed across all the compute nodes rather than bottlenecked on a single server. SQL Server PDW gets an additional performance boost at the node level because SQL Server has functions to support dimensional models, including star-join optimization. And SQL Server PDW’s support for replicated dimension tables allows many common query scenarios to be satisfied without more expensive data-shuffling operations.

The Kimball Approach

April 8, 2012

The text of this post came the Microsoft Technical Article “Implementing a Microsoft SQL Server Parallel Data Warehouse Using the Kimball Approach” that I thought was so import that I reproduce  here is my BLOG.

The Kimball Approach

There are a lot of misconceptions about dimensional modeling and the Kimball approach to building a DW/BI system. It’s worth reading this section even if you are already familiar with the Kimball approach; you might be surprised at what you learn. The Kimball approach to creating an enterprise data warehouse has several core principles:

1. Follow a proven methodology; I recommend the Kimball Lifecycle.

2. Understand business requirements so you can engage the business, prioritize your efforts, and deliver business value.

3. Design the data warehouse data sets for flexibility, usability, and performance.

4. Build and deliver quick, business process-based increments within an enterprise data framework known as the data warehouse bus matrix.

5. Design and build a DW/BI system architecture based on your business requirements, data volumes, and IT systems environment.

6. Build out the extract, transformation, and loading (ETL) system with standard components to deal with common design patterns found in the analytic data environment.

7. Provide the complete solution, including reports, query tools, applications, portals, documentation, training, and support.

All of these principles are explored at length in The Data Warehouse Lifecycle Toolkit book, Second Edition (Wiley, 2008). We will examine a few of them here in detail.

Follow a Proven Methodology: Lifecycle Steps and Tracks

The Kimball Lifecycle is a detailed methodology for designing, developing, and deploying data warehouse/business intelligence systems, as described in The Data Warehouse Lifecycle Toolkit, Second Edition. The diagram in Figure 1 summarizes the key steps in the Lifecycle.

The Lifecycle is an iterative approach, with each pass delivering a coherent set of data and an initial set of associated reports and applications. Each pass can typically be completed in 6 to 9 months, depending on the data complexity. Building out the full DW/BI system takes multiple iterations, each one loading a new data subject area, which plugs into the overall enterprise data framework called the bus matrix.

clip_image002

Figure 1: The Kimball DW/BI Lifecycle

The Kimball approach starts with understanding business requirements and determining how best to add value to the organization. The organization must agree on what the value of this data is before deciding to build a data warehouse to hold it. For example, capturing web browsing activity may allow you to gain deep insight into your customers’ behaviors and preferences, opening up new ways to better meet their needs. If you clearly identify and deliver business value, the resulting impact on the business should easily justify your SQL Server PDW investment.

The ideal starting point for most organizations is to perform an initial set of interviews to gather and prioritize enterprise-wide high level business requirements for information. The result is a priority ordered list of business processes that generate data, along with high value analytic opportunities supported by that data.

Once the list of business processes and associated opportunities has been identified and prioritized, the next step is to take the highest priority business process and gather detailed business requirements related to it. This second pass at requirements is much more focused on understanding the specifics around the required data source, including attributes, definitions, business rules, data quality, and the range of analytics and applications that will be built on top of this data set.

Once these detailed requirements are in place, the Lifecycle moves into the implementation phase beginning with design steps across three different tracks. The top track in Figure 1 is the technology track. The main goal here is to identify the functionality and associated tools needed to meet the identified business requirements.

The middle track in Figure 1 is the data track. The initial step is to define the logical data model needed to support the analytic requirements. In the Kimball approach, this is a dimensional model. Once the logical model is in place, the team can build the target database in the database environment. The nature of the physical model depends on the target platform. Many database products work best with a physical dimensional model, although a more normalized model may make sense on a few platforms. The last data step is to create the ETL system that will populate the target database as required. The ETL system is a significant effort, often consuming a majority of the initial project resources.

The bottom track in Figure 1 is concerned with the BI applications: the initial set of reports and analyses that will deliver business value to the organization. This track is split into two steps; the first is the design step where a small set of high value applications and reports are identified and specified in detail. The second step is the actual implementation where these applications and reports are built. This step often has to wait until near the end of the ETL development when data is actually available in the database. Note that these reports and analyses only serve as a starting point that helps solve a high-value problem. The dimensional model is not limited in any way to this subset of reports.

Once the three implementation tracks are complete, the Lifecycle comes back together to deploy the query tools, reports, and applications to the user community. This involves extensive communication, training, documentation, and support.

The next Lifecycle iteration usually begins during the deployment of the previous iteration, when the business analysts and designers can gather detailed requirements for the next highest priority business process, create the associated dimensional model, and start the process all over again. The Lifecycle’s incremental approach is a fundamental element that delivers business value in a short timeframe, while building a long-term, enterprise information resource.

The Data Warehouse Bus Matrix

The Enterprise Data Warehouse Bus Matrix is the data framework for the enterprise data warehouse. Figure 2 shows a simplified version of a bus matrix for a retail organization with a customer affinity program.

clip_image004

Figure 2: Example bus matrix

The row headers down the left side of the bus matrix define the organization’s primary business processes. A good way to think about these business processes is to think about the value chain of the organization. What are the activities in which your organization engages to provide your customers with the goods or services they need?

The column headers of the bus matrix represent the primary objects that participate in those business processes. Typical examples include customer, account, product, store, employee, patient, and date. These objects are called dimensions, and they must be pre-integrated to work with all the relevant business processes.

This pre-integration is called conforming, and it involves the hard organizational data governance work of deciding the standard names, descriptions, mappings, hierarchies, and business rules that will apply across the DW/BI system. This is essentially what master data management (MDM) is meant to do, and the DW/BI system benefits greatly from a separate MDM effort. In the absence of an MDM system, the DW/BI team must shoulder this dimension conforming effort. Once this definitional work is done, these dimensions become reusable components that can be applied to every associated business process. Most importantly, the conformed dimensions are the necessary framework for integration, where the results from two or more business process can be combined into a single BI deliverable.

Each row on the bus matrix is a business process data set that corresponds to a unit of work for the ETL system developers. Each business process data set needs a dedicated ETL module to extract the transaction facts, associate them with the conformed dimensions, and tie them together into a flexible dimensional model.

Data Model Design

The Kimball approach to data modeling takes a pragmatic look at the underlying database platform and chooses the appropriate physical model based on usability, flexibility, performance, and maintenance on that platform.

What Is a Dimensional Model?

Almost all dimensional models are classic star schemas, as shown in Figure 3. The numeric measurements (“facts”) of a business process are concentrated in the central fact table, and the context of the measurement is represented as a set of denormalized dimension tables, which surround the fact table. They keys that implement the joins between the dimension tables and the fact table should be anonymous integer keys. We call these surrogate keys.

clip_image006

Figure 3: An orders business process star schema

Usability

All camps are in agreement that the most user-accessible data model in the data warehouse is the dimensional model. For example, a 2006 study in the journal Decision Support Systems found that dimensional models were significantly easier to understand and remember how to use than other more normalized models.

Flexibility

There is a school of thought that calls for a normalized, third normal form model at the atomic level of the data warehouse. Its proponents argue that this gives the most flexibility. While this may be true from a transaction processing perspective, it is important to remember we are building an analytic database. Most transaction systems are based on third normal form data models with the atomic level detail transactions captured in normalized fact tables. The third normal form school keeps this model as the data foundation of the enterprise data warehouse. This then requires additional transformation steps to get the data into its presentation form for user consumption, often involving another physical layer of departmental data marts.

There is a commonly held belief that dimensional models are based on a set of reports or analyses and are therefore less flexible. This is false and has never been part of the Kimball approach. The normalized model and a properly designed, atomic-level dimensional model are relationally equivalent. They can answer the exact same set of analytic queries.

Flexibility comes in part from the level of detail captured in the model. Another common misconception is that dimensional models are summary only. In fact, a strong design goal in the dimensional model is to always capture data at the lowest level of detail available, called the atomic level. The presence of atomic-level data allows users to roll the data up to any level of summarization required. Any aggregation prior to inclusion in the enterprise data warehouse means some detail will not be available, thus reducing flexibility.

Performance and Maintenance

The dimensional model keeps the atomic-level fact tables in their normalized form (by normalizing the dimension tables out of the fact table) for smaller size and better performance, but keeps each dimension in denormalized (flat) form. Note that such flat dimension tables contain exactly the same information as fully normalized (snowflaked) dimension tables but do not implement the separate tables and extra keys required to complete the normalization process. The dimensional model simplifies the physical design by dramatically reducing the number of tables and joins required for a given analytic query, which improves performance on most market leading database products running on single servers. In fact, all the major SMP database products, including SQL Server 2008, have built-in performance optimizations that leverage the dimensional model (search the web for “star join optimization” for more information on this). Using a dimensional model at the physical level is also easier to manage than a normalized model. Because it is already dimensional there is no need for a translation layer or separate data marts to make it user-presentable.

As we’ll explore in the architecture section, parallel processing platforms such as the Parallel Data Warehouse work a bit differently. Data is distributed across many independent query nodes across the server. Each of these nodes may hold a subset of the fact data, which may need to join to all of the dimensions. On SQL Server PDW, the standard approach is to replicate all of the dimensions out to each node so the node can perform local joins, thus preserving the physical dimensional model. However, in rare cases it may make sense to normalize and/or distribute very large dimensions on SQL Server PDW to save time in the replication process and to save space on each compute node.

Let’s be clear on what we are saying. In the best of all possible worlds, you would load the atomic-level detail into the data warehouse for flexibility. The user data model would be dimensional for usability, and the physical data model would also be dimensional for simplicity and performance. Our experience in the real world backs this up. We have found the dimensional model to be the most usable, flexible, best performing, and most maintainable data structure for analytic purposes on most platforms. We only compromise this design when the platform requires it for performance and the users can be transparently shielded from any increase in complexity.

Dimensions and Facts

As we described in the bus matrix discussion, dimensions are the objects that participate in an organization’s business processes. We generally model these as one table per object. Building the dimension in the ETL system involves joining the various normalized description and hierarchy tables that populate the dimension attributes and writing the results into a single table.

Figure 4 shows an example of typical product-related attributes in a normalized model.

clip_image008

Figure 4: The normalized source tables for Product attributes

The base table is called Product and it connects to the Sales fact table at the individual product key level. From this set of tables, it’s possible to create analytic calculation such as SUM([Sales $ Amount]) by CategoryName, or by ProductColorGroupDescr, or any other attribute in any of the normalized tables that describe the product. It’s possible, but it is not easy.

In the dimensional version of the Product table, we would join the product-related tables from Figure 4 once, during the ETL process, to produce a single Product dimension table. Figure 5 shows the resulting Product dimension based on the tables and attributes in Figure 4.

clip_image010

Figure 5: The denormalized Product dimension

clip_image011Obviously, it is still possible to calculate SUM([Sales $ Amount]) by CategoryName, or by ProductColorGroupDescr, or any other attribute in the Product dimension, which includes all the product related attributes from the normalized model. Note that the two models are equivalent from an analytic perspective.

Usability is significantly improved for BI application developers and ad-hoc users with the dimensional version. In this simple example, the ten tables that contain the 12 product attributes are combined into a single table. This 10 to 1 reduction in the number of tables the user (and optimizer) must deal with makes a big difference in usability and performance. When you apply this across the 15 or 20 dimensions you might typically find associated with a Sales business process, the benefits are enormous.

The main difference between the two approaches is that the normalized version is easier to build if the source system is already normalized; but the dimensional version is easier to use and will generally perform better for analytic queries.

Tracking Attribute Variations over Time

Every analytic data store must provide a means to accurately track dimension attributes as they change over time. Tracking attribute changes allows the business to report on the state of the world as it was at any point in time, answering question like “What were sales by territory as of December 31st last year?” It also supports accurate causal analysis by associating the attribute values that were in effect when an event occurred with the event itself. For example, what postal code did a customer live in when they bought a certain product two years ago?

The most efficient way to capture these changes from both an ease of use and performance perspectives is to add a row to the dimension whenever an attribute changes by assigning a new surrogate key and capturing the effective date and end date for each row. These are commonly referred to as slowly changing Type 2 dimensions. You can see these control columns at the bottom of the Product dimension in Figure 5.

While tracking attribute changes over time places a burden on the ETL process, it improves performance for user queries because the joins between the facts and dimensions are simple equijoins on integer keys. This also improves ease of use because the BI semantic layer does not have to handle more complex, multi-column unequal joins in order to retrieve the correct dimension row for any given historical fact event.

Tracking changes over time is a mandatory business requirement, regardless of the underlying data model you use. It is possible to track changes in a normalized model, but the complexity of keeping multiple versions across dozens of tables associated with a single dimension is much greater than dealing with changes in a single, denormalized dimension table.

For a basic description of slowly changing dimensions, search http://www.kimballgroup.com for an article titled Many Alternate Realities (http://www.kimballgroup.com/html/articles_search/articles2000/0002IE.htmll).

For a discussion of more advanced change tracking techniques, see the article titled Slowly Changing Dimensions Are Not Always as Easy as 1, 2, 3 (http://www.kimballgroup.com/html/articles_search/articles%202005/0503IE.html).

For more information on tracking changes in a normalized model, please see Design Tip #90 Slowly Changing Entities at http://www.kimballgroup.com (http://www.kimballgroup.com/html/07dt/KU90SlowlyChangingEntities.pdf).

Performance Depends on the Platform

While the dimensional model provides the best performance and usability in a majority of standard scale DW/BI systems, it is not a one-size-fits-all answer. The underlying physical data structure needed to get the best performance on a given platform is the right choice, as long as it does not compromise the ease-of-use principle.

As we explore the Parallel Data Warehouse system, we will discuss the design and performance tradeoffs, and hopefully end up with the best of all possible worlds.

Parallel Data Warehouse – Slowly Changing Dimension

April 1, 2012

Most of my day-to-day work is currently centered around Microsoft’s Massively Parallel Processing Appliance call PDW (Parallel Data Warehouse).  Many of the same techniques common to SMP (Symmetric Multiprocessing)systems such as Kimball method Slowly Changing Dimension (SCD) are still very important.

Identifing the deltas between an incoming dataset and the destination dimension table can be one most resource intensive, complex, and valuable processes in a SMP or PDW ETL system. Careful planning is necessary to ensure that ETL process preforms well  and is scalable. Nowhere is this more important than in a PDW that can process 10s to 100s of terabytes daily!

My previous post Building the Enterprise DW/BI System with SQL Server PDW describes the basic PDW components discussed below.

The following gives some real world SCD statistics and lessons learned.

SCD – Replicate vs. Distributed tables

Replicated Table ‘A’ – 28,977,757 rows – 34 min

Distributed Table ‘A’ – 28,977,757 rows – 21 seconds!

The difference between  SCD on a replicated and distributed table is dramatic. I expected that the distributed table might preform slightly faster because the table is distributed across all the compute nodes and so therefore it maximizes the parallelism of the PDW. However I didn’t expected such a huge difference. (Note: 34 minutes to process 28 million + rows using type 2 SCD is not bad either. Try doing that on your SMP )

I found the reason for the difference in the Query Plan. The Query Plan contains several steps. The most  costly steps are SCD DML operations (update expire rows) steps. Basically for each update statement the PDW;

1 – Creates a copy of the Target table /Dimension Table

2 – Preforms the update on the copy

3 – Publishes the updated copy to each node

See example below which shows part of the SCD logic of a 1.5 million row table:

SCd update replicated tables

Conclusion: Distributed tables can a have superior DML performance, especially for large tables, however this SCD performance increase will come at the cost of query performance which will be slower if the queries do not contain the distribution column. From my experience I would recommend considering the possibility of converting  a replicated tabled to a distributed table for SCD DML performance improvement if the table is great than 5-10 million rows.

Distribute Table on a Key Column

If you chose to make the Destination Table/Dimension Table a Distributed Table it is important to distribute the table on  one of the Traget Tables Key Column(s), other wise the SCD DML operations will incur a heavy performance hit of ShuffleMoveOperation because the processing nodes must shuffle/move data in order to satisfy the SCD DML queries.

72 million row Distributed Table

Not Distributed on a Key Column – 13 min.

Distributed a Key Column –   30 seconds!

Example: Not Distributed on a Key Column:

Distributed on non Key Columns

Example: Distributed on a Key Column

Distributed on Key Columns

There is still a small ShuffleMove but the most expensive SufflesMoves have been eliminated.  It might not always be possible eliminated all ShuffleMoves.

Conclusion:

Microsoft’s Parallel Data Warehouse offers a huge advantage in Slowly Changing Dimension (DML) processing performance over traditional SMP systems.