Archive for November, 2012

BIG DATA !?

November 28, 2012

There has been a lot of discussions recently around ‘Big Data’. Most of the conversations  that I have been involved with around ‘BIG DATA’  have, in general, been very productive and informative. They have naturally lead into Data Warehousing best practices and how the business can best utilize it’s IT resources to make better decisions.  (IT is a resource and can help generate revenue!!! NOT an expense)

However, a significant number of conversations, especially on the internet, seem to contain more hype than real  substance around Big Data. It is not unusual for me to see “Big Data” in the title and then have the article or presentation say nothing substantive about it !

It can be hard to filter the hype/marketing from the truth about how proper Big Data management can improve the bottom-line.

So I have compiled a brief list of resources to help with understanding Big Data. I will add to the list as new resources become available.

Resources:

Incorporating Big Data Technologies into your BI & DW Strategy (Chicago): Microsoft in partnership with BlueGranite (full discloser: my employer) is putting together a Big Data presentation to help Business Decision Makers understand the value of Big Data.

RobKerr’s Blog: If you’re a #Microsoft #MSBI Pro, it’s time to learn about #BigData and #Hadoop: Here’s how! Great BIG DATA links !

BI Academy: If you can’t make the presentation in Chicago Rob Kerr (SQL MVP)  has put together series of great videos which provide an introduction to BIG DATA, Hadoop, and Data Warehousing and how they interact together.

Microsoft Big Data : Provides some great introductory material around BIG DATA. If you are a Data Warehouse Professional I would highly recommend that you download HDInsight and get comfortable with it. Yes I know yet another thing to learn but it it worth it!

Polybase: This is still bleeding edge stuff that will first appear in version 2 of SQL Server Parallel Data Warehouse PDW, but I am guessing that similar functionality might make it’s way into regular SQL Server SMP. It ability to query and join both structured and unstructured data is AWESOME!

SQL Server Database Performance Tuning: PAGEIOLATCH_XX is evil

November 18, 2012

Ok well is it probably not nice to pass a moral judgment on something that, through no fault of it’s own, behaves exactly the way that it was designed to behave. But with that said PAGEIOLATCH waits can be a symptom of a disk I/O subsystem issue. What this means is that SQL Server gets your wonderfully written query and first looks in memory to see  if all the data that it needs is in memory. If all the data that SQL Server needs for the query is not in memory then it has to get it from disk.

SQL Server then  reserves some space in memory for the data that it needs from the disk. It sends out a request for that data to the storage engine and waits for the data to come back from disk.

If SQL Server has to wait along time on the data then that means that it is trying to suck a lot of data through a tiny “straw” (the connection between SQL server and it precious data). In other words there is too little throughput  see Why is my DW Query so SLLLOW?!

Scenario:

The CIO walks into your cubical and says that while he was gulfing with the CEO yesterday ..err I mean in an “important meeting” yesterday. The CEO said that his business analyst are complaining that the data warehouse is  preforming  so poorly that the analyst could not deliver the reports that CEO needs for his very “important meetings”.

Using SSMS you discover that there are significant PAGEIOLATCH waits occurring on the server.

select top 1000 * from  sysprocesses
where lastwaittype like ‘PAGEIOLATCH%’

select * from SYS.DM_OS_WAIT_STATS where wait_type like ‘PAGEIO%’

image

To verify what you have discovered you open up Profiler and locate the longest running query that you can find to see if it is waiting Disk IO

image

Executing the query with:

DBCC dropcleanbuffers;
SET STATISTICS IO ON; SET STATISTICS TIME ON;

yields the following results:

Table ‘100millionFact’. Scan count 9, logical reads 2040817, physical reads 7, read-ahead reads 2040416, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 68845 mselapsed time = 154302 ms.

2040817 * 8 / 1024 = 15,943.88 MB

Total query execution time is about 2.5 minutes but the CPU is only working for about 1 minute. The rest of the time (about 1.5 minutes) SQL Server is waiting for the data from disk. It can also be derived from the results that the data warehouse is only getting about 100 MB/sec in throughput.

Another way to view the IO problem is with Performance Monitor using Avg. Disk Bytes/Read (black line)

image

Notice that while Avg. Disk Bytes/Read is high my CPU is getting bored waiting for the data from disk!!!

Conclusion:

If the CEO wants to have the data warehouse performance increased then the throughput capabilities will have to be increased.

SQL Server Database Performance Tuning: To ETL or To ELT ?

November 4, 2012

ELT v.s. ELT has been something that has been talked about a lot over the years but I still find that it is still an esoteric subject to most people. One reason for this might be because, in the beginning days of SQL Server SSIS, there were some mixed signals which has lead to some confusion that still persists to today.

Just to make sure that we are all on the same page, ETL , for the purposes of this post, means that you are using SSIS  to first Extract data from your source system(s)and then SSIS to Transform the data into a dimensional format using all those pretty transforms in  SSIS Toolbox. The last step is then to Load the data into the target dimension or fact tables.

image

Figure 1: Pretty Transforms

They are even prettier in SSDT. Awwwe …they are so cute!

ELT then means first Extracting the data from your sources and then Loading it to a staging location on the SQL Server data warehouse server. The data is then Transformed into a dimensional format not with SSIS but with the DATABASE ENGINE. In other words you won’t get to use many of those pretty transforms in the Data Flow – SSIS toolbox. This is usually accomplished by using  the Control Flow – Execute T-SQL Statement to execute store procedures or SQL commands.

image

Figure 2: ELT SSIS Package

As shown in Figure 2 it is best to use the T-SQL Merge command to dimensionalize the data because it preforms much faster than separate update, insert, and delete commands.

Which is Better ?

So now the question becomes, which is better for data warehousing loads ETL or ELT? The answer is… ELT !!!! or in other words ALWAYS ELT !!! The reason for this is strait forward, the database engine can preform asynchronous transformations much more efficiently than SSIS !

The SQL Server Database Engine has been around much more longer than SSIS and has been tuned a lot over the years to preform things like sorts and joins (which are need to dimensionalize data) efficiently. That is kind of the whole point of the Query Engine and Query Optimizer!

Bottom Line

Let the Database Engine do what it is best at namely TRANSFORMATIONS and let SSIS do what it does best  namely data pipeline and workflow management. The right tool for the right job!

SSIS Transforms

In the world of SSIS not all transforms are created equal in terms of performance. If you must use them, then you should be aware of which ones will cause the most performance issues.

image

Figure 3: Data Flow Transformations

The worst offenders starting on the right  are the fully-blocking asynchronous transformations followed by the partially-blocking ones which are not as bad but they can still cause issues. The synchronous data flow transformation preform good but you still need to watch out for memory pressure as the datasets increase in size.

Last note: NEVER use the OLE DB Command for a SQL Server data warehouse batch load process it is pure evil because  it does DML commands on a row-by-row basis. Good luck loading a lot rows with that!