I get asked this question a lot and in most instances the answer is… the query is returning a lot of data.
Most Data Warehouse and/or Analytical queries return a lot of data because they aggregate transactions (facts) from a low level of granularity (day) to a higher level of granularity (month). The higher target level of granularity, the greater the number of transactions that need be aggregated.
On other words, in order to find out what my total sales for this year is, I have to aggregate all of the sales transactions for this year. This could be mean that millions, billions, or even TRILLIONS of sales transactions need to be aggregated to produce the Sales Grand Total for the year.
From a hardware perspective all of your transactions are sitting in your storage/disks (DAS/NAS). At query execution time all of that data needs to be moved from storage to the processors to be aggregated.
The “pipe” that connects storage to the CPUs needs to be wide enough to flood the processors with data so that CPUs/users are not sitting idle waiting for data! This “pipe” width or throughput is measured in MB/sec.
There are standard to tools measure your Data Warehouse System’s throughput in MB/sec. As explained in a previous post.
Once you know the DW throughput in MB/sec you then need to measure the amount of data returned by your query. By comparing the throughput (MB/sec) to the amount of data returned, it should then not be so surprising why your queries are taking so long!
- DW MB/sec = 1 MB/sec
- Query data = 100 MB
Minimum query response time = 100 seconds
If you need reduce the query response by 1/2 to 50 seconds then you need double the Data Warehouse’s throughput to 2 MB/sec (or reduce the amount of data returned think: ColumnStore Indexes/Page Compression )
That is great, but how do I measure the amount of data that is returned by a query? Glad you asked !
Create/Populate test table: – (use AdventureWorks)
select top 10000000 a.*
from [dbo].[FactInternetSales] a
Very Simple Query Example:
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT [SalesTerritoryKey], sum([SalesAmount])
group by [SalesTerritoryKey]
Option (MAXDOP 8);
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
Table ‘test10m’. Scan count 9, logical reads 204082, physical reads 0, read-ahead reads 204082, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 8016 ms, elapsed time = 14522 ms.
Follow this simple formula:
Logical Reads * 8 (single DB page) / 1024 (MB)
204,082 * 8 / 1024 = 1,594 MB
Amazingly if you check the size of the table in SSMS it reads: 1,594 MB as well !! In other words this query returns all the data from the table.
Data Warehouses are built to move large amounts of data from storage to the CPUs and then aggregate the data. If you know the throughput of the that connects storage to the processors (MB/sec) and the amount of data returned by the target query, then you can start the process of planning how to build a scalable and predictable Data Warehouse. Now you can accurately set expectations upfront with the users.
Note: there many other factors that will effect the DW’s performance, this is just one of the most fundamental. No amount of query tuning or index creation will ever make query preform faster than the DW’s throughput MB/sec.