Posts Tagged ‘columnstore’

Great BLOG posts part 2

October 21, 2012

some great resources have been made available with the release of SQL Server 2012 which provide a deep understanding to some the most important SQL Server data warehousing features.

Inside SQL Server 2012’s ColumnStore Index :

This talk will describe how the new ColumnStore index technology in SQL Server 2012 makes queries go faster. It will cover details of the storage and execution model down to how this model interacts with modern CPUs to deliver significant performance benefits on Data Warehouse queries.

This is an excellent presentation that explains why my results for measuring Columnstore Index Throughput in a previous post was so impressive.

DW Design With The Product Team

This talk will explain the patterns that we recommend to customers when designing data warehouses. This talk will help people learning about DW for the first time and also give insight for those who wish to learn more.

This a great introduction to Data Warehousing done the right way from a couple of the most knowledgeable people in the industry.

Fast Track: improving performance through correct LUN Mapping and Storage Enclosure configuration

Reference architectures and appliances give us greatly balanced configurations which help to speed up Data Warehouse deployments and vendors give very clear guidance on how to set it up for better performance. However, the old know wisdom “Trust but verify” is still true and may help your setup look way better.

I particularly like the following graphics which illustrate a problem that I have run across several times in the past:

Bad Mapping:

Good Mapping:

PDW How-To: Avoid ShuffleMove and PartitionMove Operations 

Reducing shufflemoves in PDW is one the most important ways  to get the best prossible perfromance out of PDW. When used correctly, CTAS  can help unleash the full potential of an MPP architecture.

Advertisements

SQL Server Database Performance Tuning: SQL Server Columnstore Index Throughput

September 2, 2012

SQL Server 2012 ColumnStore Indexes are the best thing to be added to SQL Server for Data Warehousing since … well ever. One of the greatest benefits  of ColumnStore Indexes is an incredible increase in performance. According to SQL Server Columnstore Index FAQ ColumnStore Indexes now rebranded as xVelocity can increase SQL query performance by a factor of 10x to 100x.

From my own experience these performance gains  are not unrealistic but the exact performance gains are dependent on the data cardinality (number of distinct values) and data types in the columns. The lower the data cardinality (few distinct values) the greater the performance gains.

Here is the claim from SQL Server Columnstore Index FAQ:

“When data is stored in column-wise fashion, the data can often be compressed more effectively than when stored in row-wise fashion. Typically there is more redundancy within a column than within a row, which usually means the data can be compressed to a greater degree. When data is more compressed, less IO is required to fetch the data into memory. In addition, a larger fraction of the data can reside in a given size of memory. Reducing IO can significantly speed up query response time. Retaining more of your working set of data in memory will speed up response time for subsequent queries that access the same data.”

Let’s test it out.

Scenario

Create a fact table

select top 100000000 a.*

into [100millionFact]

from [AdventureWorksDWDenali].[dbo].[FactInternetSales] as a,
[AdventureWorksDWDenali].[dbo].[FactInternetSales] as b ;
Go

I created a 100 million row fact table based on the 2012 AdventureWorks FactInternetSales table. It might not be very real world but it is still useful for demonstrating the capabilities of Columnstore Indexes.

Create a Columnstore Index on all columns

CREATE NONCLUSTERED COLUMNSTORE INDEX [NonClusteredColumnStoreIndex-20120723-215255] ON [dbo].[100millionFact]
(
[ProductKey],

[ShipDate]
)WITH (DROP_EXISTING = OFF) ON [PRIMARY]
GO

Execute Simple Aggregate Query

DBCC dropcleanbuffers;
GO
SET STATISTICS IO ON;
GO
SET STATISTICS TIME ON;
GO

SELECT [SalesTerritoryKey], sum([SalesAmount])
FROM [Test].[dbo].[100millionFact]
group by [SalesTerritoryKey]
OPTION (IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX)
GO
SET STATISTICS IO OFF;
GO
SET STATISTICS TIME OFF;
GO

Notice that I used dropcleanbuffers to remove anything from cache and I enabled to IO and TIME stats so that we can see the important query stats.

Query without Columnstore

image

2,040,817 logical reads * 8K per page / 1024 = 15,943 MB of data scanned into memory and then aggregate by the CPU in > 2 minute.

Query with SQL Server 2012 Columnstore Indexes

Now run the the same SQL query without the IGNORE_NONCLUSTERED_COLUMNSTORE_INDEX hint.

image

1020 logical reads * 8k per page / 1024 = 7.9 MB of data scanned into memory and then aggregated in < 1 second.

Summary

In the simplistic example above, the throughput requirements for the aggregate SQL query dropped from about 16 GB to just 8 MB and CPU processing time from over 2 minutes to under 1 second by implementing a SQL Server 2012 Columnstore Index. Columnstore Indexes represent a tremendous performance increase for SQL Server data warehousing databases  because the most expensive, complex, and important part of  a SQL Server data warehouse server is getting the maximum throughput from storage to the CPU.

Code

Great BLOG posts

August 19, 2012

I don’t usually do this but there have been recent BLOG posts that due to the importance the information being shared, I would like to call attention too and preserve their conclusions on my BLOG as well.

Thomas Kesjer has created a great series of posts explaining some of the fundamentals of Columnstore Indexes. I particular enjoy the his latest post entitled “What is the Best Sort Order for a Column Store?” because it gives some SQL queries that can be used to calculate such things as data entropy.

example:

p(xi)

SELECT X
, CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER (PARTITION BY NULL) pX
FROM Table
GROUP BY X

 H(X)

SELECT -1 * SUM(pX * LOG(pX, 2)) AS H
FROM
(
SELECT X
, CAST(COUNT(*) AS FLOAT) / SUM(COUNT(*)) OVER (PARTITION BY NULL) pX
FROM Table
GROUP BY X
) AS p

The Data Warehouse Junkie also wrote a post entitled “Design SSIS for Performance and Scale – Baseline tests” which demonstrates some good and proven conclusions based on the Fast Track recommendations.

SSIS Degrees of parallelism impact data fragmentation: “the DOP setting has no influence on our SSIS performance and this also means that the BULK API is single threaded (because performance does not increase). As you can also see we don’t get any fragmentation of our data when we increase the DOP setting.”

This is very good to know.

I also think that his conclusion about enabling page compression matches what I have seen with the Fast Track systems that I have worked with as well. Enabling Page Compression comes at the cost of increasing load times but gaining better I/O throughput (sequential reads/writes in MB/sec).  For most DW systems this is a good trade off but if your DW  is not I/O and you have a very tight load window then you might to consider not using Page Compression.

In either case the most important thing to due is to test it first before you make any decisions!

Columnstore Indexes

March 15, 2012

See full whitepaper: Columnstore Indexes for Fast DW

” The SQL Server 11.0 release (2012) introduces a new data warehouse query acceleration feature based on a new type of index called the columnstore. This new index, combined with enhanced query optimization and execution features, improves data warehouse query performance by hundreds to thousands of times in some cases, and can routinely give a tenfold speedup for a broad range of queries fitting the scenario for which it was designed. It does all this within the familiar T-SQL query language, and the programming and system management environment of SQL Server. It’s thus fully compatible with all reporting solutions that run as clients of SQL Server, including SQL Server Reporting Services.

A columnstore index stores each column in a separate set of disk pages, rather than storing multiple rows per page as data traditionally has been stored. We use the term “row store” to describe either a heap or a B-tree that contains multiple rows per page. The difference between column store and row store approaches is illustrated below:
image

Figure 1: Comparison between row store and column store data layout
5
The columns C1…C6 are stored in different groups of pages in the columnstore index. Benefits of this are:

  •  only the columns needed to solve a query are fetched from disk (this is often fewer than 15% of the columns in a typical fact table),
  •  it’s easier to compress the data due to the redundancy of data within a column, and
  •  buffer hit rates are improved because data is highly compressed, and frequently accessed parts of commonly used columns remain in memory, while infrequently used parts are paged out.

The columnstore index in SQL Server employs Microsoft’s patented Vertipaq™ technology, which it shares with SQL Server Analysis Services and PowerPivot. SQL Server columnstore indexes don’t have to fit in main memory, but they can effectively use as much memory as is available on the server. Portions of columns are moved in and out of memory on demand.

SQL Server Denali columnstore indexes are “pure” column stores, not a hybrid, because they store all data for separate columns on separate pages. This improves I/O scan performance and buffer hit rates. SQL Server is the first major database product to support a pure columnstore index. Others have claimed that it was impossible to leverage pure columnstore technology in an established database product with a broad market. We’re happy to prove them wrong and we think you’ll be glad we did!
Using Columnstore Indexes

To improve query performance, all you need to do is build a columnstore index on the fact tables in a data warehouse. If you have extremely large dimensions (say more than 10 million rows) then you may wish to build a columnstore index on those dimensions as well. After that, you simply submit queries to SQL Server, and they can run much, much faster.”

Note: ColumnStore Index was just renamed to “xVelocity

Resources:

Columnstore Index Wiki – contains the latest information from white papers, code examples, to presentations.