L1 L2 and L3 caches

May 19, 2013

 

I recently upgraded the OS on my laptop to windows 8 and there are things that I like and things that I don’t like about it. I think that most of the things that don’t like have more to do with some of the new navigation features and should become second nature over time.

One of my favorite surprises in win8 is the new Task Manager. When I open it up, I am presented with a much cleaner interface that also provides more information than the previous versions of Task Manager. One these new goodies is now you can see the sizes of the L1, L2, and L3 caches:

clip_image001

These CPU caches act like stepping stones for data as it travels from main memory (RAM) to the CPU and the closer the cache is to the CPU the faster the data can be processed by the CPU. For example L1 and L2 caches are orders of magnitude faster than the L3 cache and as you can tell from the above image the closer the cache is to the CPU (L1) the smaller the cache size is.

If you have an application that you want to process faster on the CPU then a good way to do it would be to have the application divide up its instructions and data in batches that would ideally fit in the L1 or L2 caches. Then magically your application would run faster (there is more to it than just that but I am keeping things simple).

Well that is great but what do L1, L2, and L3 caches have to with SQL Server and data warehousing? Well in a word ALOT! The new SQL Server 2012 Columnstore Indexes not only compress and store data (tables and indexes) in a column-wise format but this columnar data is also broken down into “batches” that just so happen to match the sizes of the L1 and L2 caches on the server. This means that all things being equal, the same data set will actually process faster on the CPU using Columnstore Index’s data “batches” than had the same data been process without Columnstore Indexes.

And guess what, you can actually see this happen in SSMS when you run query on a tables that has a Columnstore Index and view the Execution Plan.

Example:

clip_image003

The table dbo.CSI_100m_Fact has a Columnstore Index and at execution time the query uses Actual Execution Mode = BATCH for the Columnstore Index Scan operator which mean that it is doing its L1 and L2 magic behind the scenes and your queries now run much faster!

Note: If you create a columnstore index and your data warehouse query does not use the columnstore index then that mean that your data model is wrong (fact and dimension tables) or your query does not match typical data warehouse workloads. See: Columnstore Indexes on MSDN.

Table Partitioning with Fact Tables

May 5, 2013

 Table partitioning for fact tables is one of the best ways to improve processing for fact data loads during the ETL process and it also can dramatically improve query response time as well. Table partitioning should be used on all but the most trivial fact tables (and even then I would still use it). However, despite its great advantages and the fact that it has been around since SQL Server 2005, I still find it under-utilized.

It compliments Kimball Style Transaction or Snapshot Fact tables very nicely (See Slowly Changing Facts).  In SQL Server 2012 the number of possible partitions per tables has even been increase to 15,000! This great news because that means if you partition your fact table by days that means that the fact table could hold more than 40 years’ worth of data without changing the partition scheme !!! And if the fact table is loaded nightly with a single day’s data then partition switching becomes a breeze.

I won’t go into all the details but partition switching allows inserts and deletes to a partitioned fact tables, no matter how large, to occur almost instantly. Updates can occur orders of magnitude faster as well (but I don’t recommend updating fact tables).

Example:

I am not going to go through all the steps. See Script for details

1.      Create two tables and populate them with data:

1 .     dbo.TablePatitionDemoFact – holds historical fact data

2.      dbo.TablePatitionDemoFact_switch  –  simulates new rows to be loaded into fact table from nightly ETL process

 

2.      selectmax([OrderDateKey])asTablePartitionDemoFact

3.      fromTablePartitionDemoFact

4.      unionselectcount(1)fromTablePartitionDemoFact

5.       

6.       

7.      selectdistinct[OrderDateKey]  asTablePartitionDemoFact_switch

8.      fromTest.dbo.TablePartitionDemoFact_switch

9.      unionselectcount(1)fromTablePartitionDemoFact

 

TablePartitionDemoFact:

      60398 – rows

      20040731 – max transaction date

TablePartitionDemoFact_switch:

     60398 – rows

     20040801 - max transaction date

 

Both tables have about 60,000 rows but the _switch table has rows that need to be inserted in the fact.

2. To switch the rows  by simply using the Alter Table… switch command

ALTERTABLE[dbo].[TablePartitionDemoFact_switch]SWITCHPARTITION 2

        TO[dbo].[TablePartitionDemoFact]  PARTITION (2)

 

Rerun the queries from before and you will see that rows have now been switched to the fact table in < 1 Second!

 

TablePartitionDemoFact:

    120,796 – row count

    20040801

 

TablePartitionDemoFact_switch:

    0 

 

The process can easy be reversed if the data needs to be modified or archive/deleted by simply reversing the tables is the switch command.

 

ALTERTABLE[dbo].[TablePartitionDemoFact]SWITCHPARTITION 2

        TO[dbo].[TablePartitionDemoFact_switch]  PARTITION (2)

 

TablePartitionDemoFact:

   60398 – row count

   20040731

TablePartitionDemoFact_switch:

   60398 – row count

   20040801

 

Note: Both table schemas must be identical. This includes indexes as well! During the switch process no table or indexes updates are required. There is just a simple pointer/metadata operation and the magic happens almost instantly!

 

For the purpose of this demo I didn’t create any indexes just to make things simpler but now you might understand why Columnstore Indexes SQL Server were not updateable! Because the assumption was that any decently sized fact table is already partitioned, so therefore no need to make columnstore indexes updateable.

 

The best practice is to bulk load the data, in parallel if possible, into the switch/stage table, using minimal logged transactions. Create the required indexes and then simply switch the partitions. It is that simple. This process will perform well on fact tables with billions or even trillions rows!

 Script

Bret Ozar has put together a wonderful list of the most import Table Partitioning Resources

I personally like using SQL Server Partition Management tool from Code Plex

Slowly Changing Facts?

April 21, 2013

This post is a follow-up to the previous post Update Fact Tables? Recently I have seen people try to convert transactional fact tables into quasi-slowly changing transactional fact tables. These quasi-transactional fact tables will most likely have something like an EffectiveDate, ExpiryDate and maybe even an IsCurrent indicator columns. Below is a generic representation:

clip_image002

In many cases this done because the measure or fact values change over time and the analyst want not only the most recent values but the historical values as well. While something like the above diagram does address those issues, it does so at a very large cost, which will make the quasi-transactional fact unable to scale into the future.

Unless you are told by the source system which fact rows have changed, all the source rows will have to be compared against the latest row versions in the quasi-Transactional Fact Table. In other words, whether in SSIS (please don’t use SSIS to do this) or on the data base engine, both data sets of fact rows will have to be filter, sorted, joined by some key, and then compared on a column by column basis to find the changes (or you could use Hashbytes, never us Checksum !!!). The new row versions are then inserted and the old row versions have to be updated with new expiry dates.

While this might perform well on a small scale, let’s say with only a few thousand rows, imagine what would happen if the tables were increased to millions, billions, or even trillions of rows. Once you run out of memory things will get very slow very quickly!!

The correct solution is to change the Quasi-Transactional Fact Table into a true Snapshot Fact Table. In my experience too many data warehouse developers don’t understand the difference between these two very important types of fact tables.

If you have not already done so please read or re-read The Data Warehouse Lifecycle Toolkit. It should be required reading for anyone before they sit down and design their data warehouse. I will not rehearse what that book says because it does such a good job, but I will use one very important chart:

image

Notice that both Transactional and Snapshot Fact Tables are insert only tables.

<rant> Accumulating Snapshot Fact Tables are very rare and I have only used them once in my 10 years of experience. In my experience if you are justifying the update of your fact table by calling it an Accumulating Snapshot you are most likely doing something wrong. </rant>

Now with this knowledge we can go back and convert the original quasi-transactional fact table to a true Snapshot Table.

clip_image004

Some might protest that now that all of the source fact rows are now just inserted every day that the fact table will become huge. To that I say great! You want fact tables with lots of rows. The more rows the better! This is because there are very neat ways of managing large tables like Table Partitioning (see: Table Partitioning and Columnstore Indexes).

Update Fact Tables?

April 7, 2013

I have been asked on several occasions if rows/values in a fact table should be updated. The answer to this question in almost all cases is NO! Facts should not be updated. There a several really good reasons why facts values should never be updated.

1. Consistent Reporting: If I run a report today which reports on last week’s values I should get the exact same values if I run the same report tomorrow. The end users will not be happy if their historical numbers keep changing every day that they run the report!

2. Fact Table Processing: Preforming updates on tables can be a very costly operation on the data warehouse server! Fact tables can and should be very large. They can contain 100’s millions, billions, or trillions of rows and they should have one index on each surrogate key column as well (useless you are fortunate enough to have SQL Server 2012 then just create one Columnstore Index which includes all the surrogate keys). All that makes updating anything in a fact table very costly and should be done with great care and thought.

But I also realize that we also live in the real world. And in the real world sometimes fact transaction values change (even though they shouldn’t!!). I won’t name names but there are plenty of source systems which allow fact transaction values to be updated. So what can you do? You need to record those changing facts, but you also want to preserve reporting consistency and you don’t what to exponentially increase the time it takes to process the transactional fact table.

Well there are a few options:

1. Covert the transactional fact table to a snapshot fact table (more on that later)

2. Create the ‘Adjustment’ facts in the ETL process and insert them into the Fact table

The second option works especial well if the source systems actually tells you when the fact was last updated i.e. LastUpdatedDate. Personally, I like to create another TransactionType dimension which will hold a member representing the ‘normal’ non-adjusted facts and a member for the ‘adjustment’ facts that have to be created in the ETL. In the end the ERD would look something like the following:

clip_image002

An example scenario using a flatten fact table for simplicity:

Let say that the fact table starts at 01/01/2013 with only 3 transactions (one for each product)

DateKey DimProduct DimTransType Measure
20130101 ProdA normal 100
20130101 ProdB normal 100
20130101 ProdC normal 100

However on 01/02/2013 it was realized that the data entered on 01/01/2013 for ProdA was incorrect and the real value should have been 75. The ETL should hopefully be able to identify the LastUpdatedDate of the transaction and calculate the difference between the historical ‘normal’ value and the new target value. That difference should then be inserted in the fact table as an ‘adjustment’ transaction.

DateKey DimProduct DimTransType Measure
20130101 ProdA normal 100
20130101 ProdB normal 100
20130101 ProdC normal 100
20130101 ProdA adjustment -25

Now normal + adjustment still sums to the correct value of 75 but report consistency can still be preserved because the report or cube can choose to see the fact values with or without the adjustment records. This will also greatly reduce the ETL fact load processing requirements as oppose to just updating the facts in place. Note: having a good table partitioning strategy is critical for large fact tables.

I also like preserving the adjustment date value as well, as that will allow the business can see how far back transactions are being corrected. In many cases I have seen managers shocked by the frequency of these ‘rare’ corrections Smile

DateKey AdjDateKey DimProduct DimTransType Measure
20130101 20130101 ProdA normal 100
20130101 20130101 ProdB normal 100
20130101 20130101 ProdC normal 100
20130101 20130103 ProdA adjustment -25

PASS Business Analytics Conference

March 3, 2013

Just in case you have been living under a rock for the last few months, PASS along with Microsoft are holding the first ever PASS conference dedicated solely to Business Analytics, in other words data warehousing, Business Analytics (aka BI), and Predicative Analytics (aka Data Mining or Machine learning). A (not so) new guy called Big Data and unstructured data will be making his appearance as well.

These are exciting times that we are living in. There are so many cool things going around Big Data and also a lot of miss-information as well. This conference will contain some great information from some of the leaders in BA (AKA BI). Be there or be square (or least watch the web broadcasts)

PASS BUSINESS ANALYTICTS CONFERENCE

image

ETL the Right Way?

February 10, 2013

The good thing about SQL Server is that there can be many ways to perform a certain task and the bad thing about SQL Server is that there can many ways to perform a certain task. Sometimes it can be difficult to choose the best option for your situation and so most people naturally go with what they are most familiar with.

Take choosing an ETL process as an example. At its most basic level an ETL process simply moves data from point A (source) to point B (data warehouse) but there are many ways to do this and each one has its own inherent advantages and disadvantages. The hard part is that sometimes it is not immediately obvious that you have choosen the incorrect ETL process until months or years later when you start having performance and scalability issues with your ETL and the data warehouse. In some instances people get so frustrated that they just throw more hardware at the problem. While your hardware vendor might love this approach, in most cases this doesn’t solve the root cause of the problem, which in most cases is more of an architectural problem than a hardware problem.

The following is a list of some of the ways that an ETL process can be setup with SQL Server. Note: this is not a complete list. I just tried to list the most common methods that I have seen.

OLTP based ETL-ish

  • Mirroring or 2012  Database Availability Groups
  • Replication (Transactional, Merge, Peer-to-Peer )
  • Log Shipping

Data Warehouse ETL

  • Change Data Capture – CDC
  • Integration Services – SSIS
  • BCP and Flat-Files

OLTP based ETL

The OLTP based ETL techniques are usually employed to off load the overhead of reporting/analytical queries to a secondary backup or “data warehouse” server. While they do accomplish this goal reasonable well they all come with inherent disadvantages that make them inappropriate for most data warehouse ETL scenarios. The first is that every single DML operation is replayed on the “data warehouse” server. This is great if you want a transactionally consistent warm-backup however it will negatively affect the performance if that server happens to be a ‘data warehouse’. For example: If a row is inserted into the source and then updated 100 times throughout the day then all those DML transactions will need to be performed on the secondary “data warehouse” server which also must support all the reporting and analytical queries as well. You may be able to squeeze a little bit more performance out of the “data warehouse” server by using little tricks like changing the isolation level to READUNCOMITTED but this is a band aide fix to a much larger issue.

Scalability can also quickly become an issue. If your “data warehouse” server, which only has 5 sources, runs fine right now with only few minor issues what happens  if the number of sources is increased to 50 or 500? That is 10 to 100 times an increase in volume of DML transactions that need to be committed on the data warehouse server. I hope that your resume is up to date ;)

Aside from the issues mentioned above, additional processing will have to done on the “data warehouse” server to find the deltas so that a traditional star schema (fact and dimension tables) can be populated.

To be honest most the environments that use OLTP based ETLs have not implemented basic data warehouse best practices such as star schemas and so they will naturally pay a price in terms of performance and scalability in the future.

Data Warehouse ETL

The Date Warehouse ETLs are more aligned to daily ETL workloads. They extract the data as it exists at a specific point of time. This is important because the data warehouse server does not have to process all of the DML transactions up to a specific point in time. After the nightly extraction and transformation processes are complete, all of the server’s resources can be dedicated to reporting and analytical queries. This allows for better performance and scalability.

Using incremental data extraction techniques, such as Change Data Capture (CDC), scale very well because only the deltas are extracted from the source and not the full data sets. CDC also has the additional benefit of being able to tell you the DML operation that made the new row version and even which columns were changed! This great because it can greatly reduce the processing required for the transformation process to convert the data into a dimensional format.

Of course this approach assumes that you have gone through the effort to build solid dimensional model in the first place. If you have done that then things like the Fast Track Reference Appliances, Parrellel Data Warehouse, Page Comperssion, and ColumnStore Indexes will work beautifully   to  scale your data warehouse into multiple petabytes if needed!

In short use the right tool for the right job and in the long term that will save you a lot of extra work and pain. The old adage is correct: When you have a hammer everything looks like a nail. So don’t try hammering a screw or you will get screwed!

Data Warehousing Best Practices

December 13, 2012

I was recently invited to give presentation about Data Warehouse best practices to the The Southeast Michigan SQL Server User Group. Blue Granite was kind enough to pay for my travel to the user group.

I don’t usually give many presentations. I am more of a heads down coder by heart, but I love sharing my ideas and experience with others! Let me know what you think.

My presentation is titled Data Warehousing from the Trenches. It is a  presentation of lessons learn based on my recent involvement with Microsoft’s Massive Parallel Processing (MPP) Appliance called Parallel Data Warehouse (PDW). In it I also cover some of the new features of SQL Server 2012 like ColumnStore Indexes and how they are going to greatly enhance SQL Serve’s Data Warehouse scalability and performance.

Presentation PowerPoint Slide Deck

Note: The PDW projects that I have worked on to date where as an Independent Contractor with B.I. Voyage. My Linked-in Profile provides an over view of my work experience.

BIG DATA !?

November 28, 2012

There has been a lot of discussions recently around ‘Big Data’. Most of the conversations  that I have been involved with around ‘BIG DATA’  have, in general, been very productive and informative. They have naturally lead into Data Warehousing best practices and how the business can best utilize it’s IT resources to make better decisions.  (IT is a resource and can help generate revenue!!! NOT an expense)

However, a significant number of conversations, especially on the internet, seem to contain more hype than real  substance around Big Data. It is not unusual for me to see “Big Data” in the title and then have the article or presentation say nothing substantive about it !

It can be hard to filter the hype/marketing from the truth about how proper Big Data management can improve the bottom-line.

So I have compiled a brief list of resources to help with understanding Big Data. I will add to the list as new resources become available.

Resources:

Incorporating Big Data Technologies into your BI & DW Strategy (Chicago): Microsoft in partnership with BlueGranite (full discloser: my employer) is putting together a Big Data presentation to help Business Decision Makers understand the value of Big Data.

RobKerr’s Blog: If you’re a #Microsoft #MSBI Pro, it’s time to learn about #BigData and #Hadoop: Here’s how! Great BIG DATA links !

BI Academy: If you can’t make the presentation in Chicago Rob Kerr (SQL MVP)  has put together series of great videos which provide an introduction to BIG DATA, Hadoop, and Data Warehousing and how they interact together.

Microsoft Big Data : Provides some great introductory material around BIG DATA. If you are a Data Warehouse Professional I would highly recommend that you download HDInsight and get comfortable with it. Yes I know yet another thing to learn but it it worth it!

Polybase: This is still bleeding edge stuff that will first appear in version 2 of SQL Server Parallel Data Warehouse PDW, but I am guessing that similar functionality might make it’s way into regular SQL Server SMP. It ability to query and join both structured and unstructured data is AWESOME!

PAGEIOLATCH_XX is evil

November 18, 2012

Ok well is it probably not nice to pass a moral judgment on something that, through no fault of it’s own, behaves exactly the way that it was designed to behave. But with that said PAGEIOLATCH waits can be a symptom of a disk I/O subsystem issue. What this means is that SQL Server gets your wonderfully written query and first looks in memory to see  if all the data that it needs is in memory. If all the data that SQL Server needs for the query is not in memory then it has to get it from disk.

SQL Server then  reserves some space in memory for the data that it needs from the disk. It sends out a request for that data to the storage engine and waits for the data to come back from disk.

If SQL Server has to wait along time on the data then that means that it is trying to suck a lot of data through a tiny “straw” (the connection between SQL server and it precious data). In other words there is too little throughput  see Why is my DW Query so SLLLOW?!

Scenario:

The CIO walks into your cubical and says that while he was gulfing with the CEO yesterday ..err I mean in an “important meeting” yesterday. The CEO said that his business analyst are complaining that the data warehouse is  preforming  so poorly that the analyst could not deliver the reports that CEO needs for his very “important meetings”.

Using SSMS you discover that there are significant PAGEIOLATCH waits occurring on the server.

select top 1000 * from  sysprocesses
where lastwaittype like ‘PAGEIOLATCH%’

select * from SYS.DM_OS_WAIT_STATS where wait_type like ‘PAGEIO%’

image

To verify what you have discovered you open up Profiler and locate the longest running query that you can find to see if it is waiting Disk IO

image

Executing the query with:

DBCC dropcleanbuffers;
SET STATISTICS IO ON; SET STATISTICS TIME ON;

yields the following results:

Table ’100millionFact’. Scan count 9, logical reads 2040817, physical reads 7, read-ahead reads 2040416, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 68845 mselapsed time = 154302 ms.

2040817 * 8 / 1024 = 15,943.88 MB

Total query execution time is about 2.5 minutes but the CPU is only working for about 1 minute. The rest of the time (about 1.5 minutes) SQL Server is waiting for the data from disk. It can also be derived from the results that the data warehouse is only getting about 100 MB/sec in throughput.

Another way to view the IO problem is with Performance Monitor using Avg. Disk Bytes/Read (black line)

image

Notice that while Avg. Disk Bytes/Read is high my CPU is getting bored waiting for the data from disk!!!

Conclusion:

If the CEO wants to have the data warehouse performance increased then the throughput capabilities will have to be increased.

To ETL or To ELT ?

November 4, 2012

ELT v.s. ELT has been something that has been talked about a lot over the years but I still find that it is still an esoteric subject to most people. One reason for this might be because, in the beginning days of SSIS, there were some mixed signals which has lead to some confusion that still persists to today.

Just to make sure that we are all on the same page, ETL , for the purposes of this post, means that you are using SSIS  to first Extract data from your source system(s)and then SSIS to Transform the data into a dimensional format using all those pretty transforms in  SSIS Toolbox. The last step is then to Load the data into the target dimension or fact tables.

image

Figure 1: Pretty Transforms

They are even prettier in SSDT. Awwwe …they are so cute!

ELT then means first Extracting the data from your sources and then Loading it to a staging location on the data warehouse server. The data is then Transformed into a dimensional format not with SSIS but with the DATABASE ENGINE. In other words you won’t get to use many of those pretty transforms in the Data Flow – SSIS toolbox. This is usually accomplished by using  the Control Flow – Execute T-SQL Statement to execute store procedures or SQL commands.

image

Figure 2: ELT SSIS Package

As shown in Figure 2 it is best to use the T-SQL Merge command to dimensionalize the data because it preforms much faster than separate update, insert, and delete commands.

Which is Better ?

So now the question becomes, which is better for data warehousing loads ETL or ELT? The answer is… ELT !!!! or in other words ALWAYS ELT !!! The reason for this is strait forward, the database engine can preform asynchronous transformations much more efficiently than SSIS !

The SQL Server Database Engine has been around much more longer than SSIS and has been tuned a lot over the years to preform things like sorts and joins (which are need to dimensionalize data) efficiently. That is kind of the whole point of the Query Engine and Query Optimizer!

Bottom Line

Let the Database Engine do what it is best at namely TRANSFORMATIONS and let SSIS do what it does best  namely data pipeline and workflow management. The right tool for the right job!

SSIS Transforms

In the world of SSIS not all transforms are created equal in terms of performance. If you must use them, then you should be aware of which ones will cause the most performance issues.

image

Figure 3: Data Flow Transformations

The worst offenders starting on the right  are the fully-blocking asynchronous transformations followed by the partially-blocking ones which are not as bad but they can still cause issues. The synchronous data flow transformation preform good but you still need to watch out for memory pressure as the datasets increase in size.

Last note: NEVER use the OLE DB Command for a data warehouse batch load process it is pure evil because  it does DML commands on a row-by-row basis. Good luck loading a lot rows with that!


Follow

Get every new post delivered to your Inbox.