CHECKSUM vs HashBytes

Hashing can be useful in Data Warehousing  as well 😉 It can give you the ability to break large problems into smaller more manageable sizes  or scale out your ETL process 😉 PDW even uses a similar method to distribute data internally among all the compute nodes.

In reality there is no competition between CHECKSUM and HashBytes. CHECKSUM is a clear loser all the way around as BOL clearly? states:

“CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For this definition, null values of a specified type are considered to compare as equal. If one of the values in the expression list changes, the checksum of the list also generally changes. However, there is a small chance that the checksum will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.”

The CHECKSUM algorithm is very simple but fast. Depending on the data types used or if there are null values collisions can frequently occur. I have used GUIDs to produce collisions in just a few thousand rows!

Thomas Kejser has written a great series of articles about Hash functions

Exploring Hash Functions in SQL Server:

Hash distributing rows is a wonderful trick that I often apply. It forms one of the foundations for most scale-out architectures. It is therefore natural to ask which hash functions are most efficient, so we may chose intelligently between them.

In this blog post, I will benchmark the build in function in SQL Server. I will focus on answering two questions:

  1. How fast is the hash function?
  2. How well does the hash function spread data over a 32-bit integer space

I know there is also the question about how cryptographically safe the function is, but this is not a necessary property for scale-out purposes – hence, that aspect is out of scope for this blog.

Writing New Hash Functions for SQL Server: In this blog, I will explore ideas for extending SQL Server with new, fast hash functions. As will be shown, the high speed, built in functions CHECKSUM and BINARY_CHECKSUM are not always optimal hash functions, when you require the function to spread data evenly over an integer space. I will show that it can be useful to extend SQL Server with a new CLR User Defined Function (UDF).


Tags: , , ,

5 Responses to “CHECKSUM vs HashBytes”

  1. way0utwest Says:

    Hashbytes is definitely better for encryption purposes, but using SHA1 only, not MD5.

    However for your purposes, I think it works well in some cases, but not great. Of course, it depends on your data, you might get enoug of a spread to handle the scale out processing.

    Looking forward to seeing the CLR version.

    • garrettedmondson Says:

      @wayoutwest – good point in the past I used only MD5 and it worked pretty well, until I ran into a situation where one of the source systems used only GUIDs for primary keys! 😦 With MD5 I got a lot of collisions so I changed the hashing to one of the SHA algorithms and didn’t have any problems with collisions. BTW Microsoft’s Parallel Data Warehouse Appliance, which I am working on, uses hashing to distribute data over all of the compute nodes .

  2. Slowly Changing Facts? | Garrett Edmondson Says:

    […] by some key, and then compared on a column by column basis to find the changes (or you could use Hashbytes, never us Checksum !!!). The new row versions are then inserted and the old row versions have to be […]

  3. Aaron West Says:

    I noticed some time ago how truly terrible CHECKSUM is. I know it’s meant for shortening the size of indexes, where the original columns must be added to the query, but I was still appalled.

    The algorithm is easily inferred as shift-left-4 then xor. This means that any string over over 8 characters may start cancelling out its own bits by xoring them back to zero.

    The following returns 0x12345678:

    select cast(CHECKSUM(char(1)+char(2)+char(3)+char(4)+char(5)+char(6)+char(7)+char(8)) as binary(4))

    But any of these three return 0, as well as any 8 character string that is repeated twice:

    select cast(CHECKSUM(replicate(char(1),16)) as binary(4))
    select cast(CHECKSUM(‘DEADBEEFDEADBEEF’) as binary(4))
    select cast(CHECKSUM(‘DEADHEADDEADHEAD’) as binary(4))

  4. GV Says:

    Very good run with pretty high security :
    select * from a1resident a1 where a1.residentid in
    (select convert(numeric,dbo.DB_DataDecrypt(AResRefEx.residentid )) from AResRefEx where
    convert(date,dbo.DB_DataDecrypt(AResRefEx.DOBsec )) = convert(date,’12/21/1939′)
    AND (HASHBYTES(‘SHA2_512′ , convert(varchar(10),convert(date,’12/21/1939′))) /1) = AResRefEx.DOB_hash)

    while this one 10 minute run:
    select * from a1resident a1
    where convert(date,dbo.DB_DataDecrypt(a1.DOBsec )) = convert(date,’12/21/1939’)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: