I get asked this question a lot and in most instances the answer is… the query is returning a lot of data.
Most Data Warehouse and/or Analytical queries return a lot of data because they aggregate transactions (facts) from a low level of granularity (day) to a higher level of granularity (month). The higher target level of granularity, the greater the number of transactions that need be aggregated.
On other words, in order to find out what my total sales for this year is, I have to aggregate all of the sales transactions for this year. This could be mean that millions, billions, or even TRILLIONS of sales transactions need to be aggregated to produce the Sales Grand Total for the year.
From a hardware perspective all of your transactions are sitting in your storage/disks (DAS/NAS). At query execution time all of that data needs to be moved from storage to the processors to be aggregated.
The “pipe” that connects storage to the CPUs needs to be wide enough to flood the processors with data so that CPUs/users are not sitting idle waiting for data! This “pipe” width or throughput is measured in MB/sec.
There are standard to tools measure your Data Warehouse System’s throughput in MB/sec. As explained in a previous post.
Once you know the DW throughput in MB/sec you then need to measure the amount of data returned by your query. By comparing the throughput (MB/sec) to the amount of data returned, it should then not be so surprising why your queries are taking so long!
Example:
- DW MB/sec = 1 MB/sec
- Query data = 100 MB
Minimum query response time = 100 seconds
If you need reduce the query response by 1/2 to 50 seconds then you need double the Data Warehouse’s throughput to 2 MB/sec (or reduce the amount of data returned think: ColumnStore Indexes/Page Compression
)
That is great, but how do I measure the amount of data that is returned by a query? Glad you asked !
Create/Populate test table: – (use AdventureWorks)
select top 10000000 a.*
into test10m
from [dbo].[FactInternetSales] a
,[dbo].[FactInternetSales] b
Very Simple Query Example:
DBCC dropcleanbuffers;
GO
SET STATISTICS IO ON;
GO
SET STATISTICS TIME ON;
GO
SELECT [SalesTerritoryKey], sum([SalesAmount])
FROM [dbo].[NoCompression_10m]
group by [SalesTerritoryKey]
Option (MAXDOP 8);
GO
SET STATISTICS IO OFF;
GO
SET STATISTICS TIME OFF;
GO
Message Results:
Table ‘test10m’. Scan count 9, logical reads 204082, physical reads 0, read-ahead reads 204082, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 8016 ms, elapsed time = 14522 ms.
Follow this simple formula:
Logical Reads * 8 (single DB page) / 1024 (MB)
204,082 * 8 / 1024 = 1,594 MB
Amazingly if you check the size of the table in SSMS it reads: 1,594 MB as well !! In other words this query returns all the data from the table.
Summary:
Data Warehouses are built to move large amounts of data from storage to the CPUs and then aggregate the data. If you know the throughput of the that connects storage to the processors (MB/sec) and the amount of data returned by the target query, then you can start the process of planning how to build a scalable and predictable Data Warehouse. Now you can accurately set expectations upfront with the users.
Note: there many other factors that will effect the DW’s performance, this is just one of the most fundamental. No amount of query tuning or index creation will ever make query preform faster than the DW’s throughput MB/sec.
September 24, 2012 at 7:54 am |
So, is there any merit to performing what might be termed pre-aggregation? For instance, if your initial level of granularity is at a raw transaction level and you know that you can expect to need the aggregations at day, week, month, quarter, and year level then, at least for historical (i.e., say, last year and older) data, might it be worthwhile to perform those aggregations one time into a set of fact tables so that ad hoc and reporting queries that are analyzing day over day, wek over week, etc., don’t have to constantly aggregate and re-aggregate the results?
September 24, 2012 at 9:51 am |
To handle pre-aggregations I would highly recommend using SSAS multi-dimensional cubes. Cubes are much easier to develop and maintain than aggregate tables, provided that the Data Warehouse is architected correctly (facts/dimensions). They also have many other features that can be leveraged by the users i.e. Pivot Tables for ad-hoc analysis. In SQL Server 2012 the paradigm shifts to Columnstore Indexes plus SSAS tabular cubes.
September 24, 2012 at 10:07 am |
As you may have guessed, I am not well versed in SSAS. I was partially basing my question on my experience with a rather massive data warehouse on Oracle. Apparently, they had run into some processing issues and had created a set of tables to handle the aggregates at various time intervals which they then used in doing their analysis. I kind of thought there had to be a better way, though.
I know that it posed an interesting problem for _me_ because, at the time, I was developing an application that would allow the marketing staff to define queries, pretty much in their own terminology, that could be saved and/or executed to refine the deployment of product based upon historical information. Because of the _sets_ of tables, I had to handle slight variations in column names and datatypes as well as handling non-trivial variations in the JOINs..
November 18, 2012 at 10:35 am |
[...] If SQL Server has to wait along time on the data then that means that it is trying to suck a lot of data through a tiny “straw” (the connection between SQL server and it precious data). In other words there is to little throughput see Why is my DW Query so SLLLOW?! [...]
April 29, 2013 at 1:11 pm |
Very good blog! Do you have any recommendations for aspiring writers?
I’m planning to start my own blog soon but I’m a little lost on everything.
Would you advise starting with a free platform like WordPress or go for
a paid option? There are so many options out there that I’m totally overwhelmed .. Any tips? Thank you!
May 1, 2013 at 7:31 am |
My buddy James just wrote a BLOG post about that.