Archive for February, 2012

Cartesian/Cross Join

February 19, 2012

I was once asked in a job interview if I knew what a Cartesian/Cross Join was.

A more interesting question is: How can a Cross Join be useful?

There are actually many good reasons to use a Cross Join. Outlined below are a couple the most common scenarios that I have used in Data Warehousing:

1. Junk Dimension

In addition to the advantages listed in the Kimball article, Junk Dimensions can greatly reduce the overhead of processing a large dimension. Consider the following real life scenario:

An existing type 1 slowly changing customer dimension with 10 + million members, needs  to have a set of Flag/Boolean attributes (one for each day of the week) added to it. and just make things worse these attributes need to be type 2. Do you change the already large dimension to SCD 2 on attributes that change might frequently? NO!

Solution: Create a Junk Dimension with all possible combinations of the Flags/Boolean indicators. see screen shots:

–create table to hold weekday values
declare @WeekFlag table
(Sunday bit,
Monday bit,
Tuesday bit,
Wednesday bit,
Thrusday bit,
Friday bit,
Saturday bit)
–insert boolean values for the each weekday
insert into @WeekFlag values( 0,0,0,0,0,0,0)
insert into @WeekFlag values( 1,1,1,1,1,1,1)
–look at table
select * from @WeekFlag
— self join table by using cross join
select a.Sunday,b.Monday,c.Tuesday,d.Wednesday,e.Thrusday,f.Friday,g.Saturday
from (select Sunday from @WeekFlag) as  a
cross join (select Monday from @WeekFlag) as  b
cross join (select Tuesday from @WeekFlag) as  c
cross join (select Wednesday from @WeekFlag) as  d
cross join (select Thrusday from @WeekFlag) as  e
cross join (select Friday from @WeekFlag) as  f
cross join (select Saturday from @WeekFlag) as  g

image

2. Time Dimension

“Time Dimension” here does not mean a Time Dimension with the granularity of a day (like AdventureWorks) but a Dimension with a granularity LESS THAN A DAY. Example: analyze transactions down to the millisecond.

Don’t try to combine a Dimension with a granularity of a day with one of a millisecond you will end up with billions or more members!!

Solution:

Cross Join each level of the Time Dimension.

Milliseconds –> Seconds –> Minutes –> Hours

here is an example of the first cross join in the series:

declare @MilliSecTable table ( MilliSecValue int)
declare @Msec as int = 0
    insert into @MilliSecTable values (@Msec)
while (select MAX(MilliSecValue) from @MilliSecTable) < 999
begin
    insert into @MilliSecTable values((select MAX(MilliSecValue) from @MilliSecTable) + 1)
end
—————————————————————————————–
declare @SecTable table ( SecValue int)
declare @Sec as int = 0
    insert into @SecTable values (@Sec)
while (select MAX(SecValue) from @SecTable) < 59
begin
    insert into @SecTable values((select MAX(SecValue) from @SecTable) + 1)
end

select SecValue,MilliSecValue from @SecTable cross join @MilliSecTable

image

**Note: The above code also performs a lot better than simply each incrementing each member by 1 millisecond. That could literally take days 😉

Resources:

The Many-to-Many Revolution: This takes the basic idea of T-SQL Cross Joins and applies them to SSAS. It is a great read!

CHECKSUM vs HashBytes

February 1, 2012
Hashing can be useful in Data Warehousing  as well 😉 It can give you the ability to break large problems into smaller more manageable sizes  or scale out your ETL process 😉 PDW even uses a similar method to distribute data internally among all the compute nodes.

In reality there is no competition between CHECKSUM and HashBytes. CHECKSUM is a clear loser all the way around as BOL clearly? states:

“CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For this definition, null values of a specified type are considered to compare as equal. If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.”

The CHECKSUM algorithm is very simple but fast. Depending on the data types used or if there are null values collisions can frequently occur. I have used GUIDs to produce collisions in just a few thousand rows!

Thomas Kejser has written a great series of articles about Hash functions

Exploring Hash Functions in SQL Server:

Hash distributing rows is a wonderful trick that I often apply. It forms one of the foundations for most scale-out architectures. It is therefore natural to ask which hash functions are most efficient, so we may chose intelligently between them.

In this blog post, I will benchmark the build in function in SQL Server. I will focus on answering two questions:

  1. How fast is the hash function?
  2. How well does the hash function spread data over a 32-bit integer space

I know there is also the question about how cryptographically safe the function is, but this is not a necessary property for scale-out purposes – hence, that aspect is out of scope for this blog.

Writing New Hash Functions for SQL Server: In this blog, I will explore ideas for extending SQL Server with new, fast hash functions. As will be shown, the high speed, built in functions CHECKSUM and BINARY_CHECKSUM are not always optimal hash functions, when you require the function to spread data evenly over an integer space. I will show that it can be useful to extend SQL Server with a new CLR User Defined Function (UDF).