Many times I am asked about how to create indexes on a data warehouse SQL Server Database so that queries execute much faster. Unfortunately, unless they are able to upgrade to SQL Server 2012 in order to use ColumnStore Indexes, there is usually very little that I can do to speed up the queries. This is usually do to the fact that the Data Warehouse Server was not architected for data warehouse workloads to begin with and therefore is physically constrained on its throughput (MB/sec).
So I have created this little demonstration to illustrate the point:
I created a 10 million row fact table based on the good old AdventureWorks DB and setup a very simple scan centric SQL Server data warehouse query:
With logical reads 208337 that means that 1,627 MB of data was returned from disk to satisfy the query. And magically, if you check the size of the table in SQL Server Management Studio it is also reads 1,627 MB. This is due to the fact that there were no indexes to read from and so the SQL Server had to scan the entire table:
Now indexes can be created to reduce the amount of data that needs to be scanned by only scanning the index data.
Note: The best practice is to actually create a separate index for each fact table column, but just to keep things simple I only created an index for the columns that will be queried.
Now if you run the same query again there the message box reads logical reads 36812 or 287 MB. Creating indexes drops by over 500% in the amount of data required for this query. This reduction in data being read is primarily due to the fact that the data is being read from the indexes, which are much smaller, and therefore less time consuming than read all the data directly from the table. See Index Scans:
[ix_NoCompression_10m_SalesAmount] – 27,176 * 8 / 1024 = 212 MB/sec
[ix_NoCompression_10m_SalesTerritoryKey] – 22,276 * 8 / 1024 = 174 MB/sec
However, the limitation to this approach is that the data being returned will never be smaller than indexes that are created on the fact table. Real world fact tables have billions of rows and so those indexes will not all fit into memory therefore the data must be retrieved directly from disk which will effect query performance negatively, especially if the data warehouse server has not been specifically tuned to deliver good throughput (MB/sec).
In fact even if query patterns (something which you cannot control) change, the amount of data required by the queries will remain about the same. In the following example I commented out 2 predicate values (SalesTerritory). One would think that now the data required would drop by about ½ as well.
–or SalesTerritoryKey = 6
–or SalesTerritoryKey = 7
However the logical reads only dropped from 36,812 (287 MB) to 33612 (262 MB) or only about 25 MB less data required by removing ½ of the predicate values. Even if the number of predicates values were too drop to 1, the logical reads are still 30128 or 235 MB. This is due to the fact that no matter how many predicate values are used the index for that predicate still has to be scanned. Remember Data Warehouse queries require a lot of historical data and so are scan centric. While scanning the index might not seem all that bad for this small 10 million row fact table imagine what would happen if you increased the number rows of rows by a factor of 100 or even 1,000 that would be potentially 10s of billions of rows and 100s of GB for the indexes!
Some of you might be tempted to say that an index could be created for each predicate value (SalesTerritory) while that is true it would be very quickly become unmanageable and in most cases it is nearly impossible to predict all the ways that the users will query the data for historical analysis.
The real answer is to correctly architect the data warehouse server ahead of time in order to deliver the query results in a predictable and acceptable period of time. However, with that said there are a few things that you can do to increase the ability data warehouse server to deliver higher MB/second than it normal could under normal conditions.
Page Compression works by cramming more data on to each database page which means that fewer pages need to be read from disk.
Note the indexes new sizes:
ix_NoCompression_10m_SalesAmount 14279 = 11 MB
ix_NoCompression_10m_SalesTerritoryKey 14350 = 11 MB
Due to using Page Compression the indexes are significantly smaller and so it is no surprise that if the same query is run again even less data needs to be scanned from the indexes:
Logical Reads = 20475 or 159 MB
Page compression reduces the data required from 287 MB to 159 MB or a saving of about 55%. In the real world I see about a 33% savings.
And just in case you were wondering, if you upgrade to SQL Server 2012 Columnstore Indexes the data required drops from 287 MB to an incredible 1.06 MB !!!
Complex indexing strategies for Data Warehouse workloads provide no additional benefit. The Data Warehouse Server needs to be carefully architected upfront to deliver the required MB/second throughput so that the users are happy with the performance of their queries.