Use hash keys instead of string indexes

from the Artful SQL Server & Access Tips List


A common application requirement is support of searches based on some lengthy string, or even worse, a concatenation of a string and one or two integers, e.g., customer_name + start_date + stop_date. In a small table, you might not notice the impact. But suppose the table of interest contains 50 million rows. You will notice the impact – in both storage requirements and in search performance.

You don’t have to support such searches directly. Even though you accept these criteria as your input values, you don’t have to search 50 million rows to find the matching rows. There is a very slick alternative, known as “hash buckets” or “hash keys”.

A hash is the result of an algorithm applied to a given string that results in an integer. You feed said algorithm the string of interest, and you get back an integer.

Hash functions do not guarantee that their hash values are unique. Rather, they simply compress the expressed value into an integer. For example, an index based on customer_name (varchar(40))+ start_date (datetime)+ stop_date (datetime) will occupy 40 + 4 + 4 = 48 bytes per row. 50 million rows times 48 bytes yields 2,400,000,000 bytes.

Granted, in order to house 50 million rows of even the most basic data, you need serious hardware. But this index imposes a huge storage penalty – not to mention the performance cost of searching it.

Hash buckets to the rescue!

A hash bucket “hashes” the value of its input, returning an integer. Integer searches are very fast. Therefore, when confronted with indexing a very large number of rows on such a large index key, an excellent solution is to hash the key rather than create an index based literally on it.

Let’s explore this concept. The hash-bucket index corresponding to the example key described above would result in 50 million values times four bytes each = 200,000,000 bytes – a tenth the size, give or take a little.

The former index is guaranteed to be unique, but look at its cost. The latter index is not guaranteed to be unique. Rather, it uses the hash-bucket value to radically reduce the number of candidate rows. The index on the hash-bucket will typically contain duplicates, but using it as the first criterion reduces the range of visited rows dramatically.

Now, the question becomes, how to generate a hash-key? SQL Server provides an easy way, in the function Checksum(). Checksum() accepts a collection of parameters and does its work, returning an integer.

Try it out using the AdventureWorks sample database that ships with SQL Server 2005. Run this query:

USE AdventureWorks
SELECT Name, GroupName, Checksum(Name,GroupName)AS HashKey
FROM Adventureworks.HumanResources.Department
ORDER BY HashKey

This results in the following rows (clipped to 10 for brevity):

Name

GroupName

HashKey

Tool Design

Research and Development

-2142514043

Production

Manufacturing

-2110292704

Shipping and Receiving

Inventory Management

-1405505115

Purchasing

Inventory Management

-1264922199

Document Control

Quality Assurance

-922796840

Information Services

Executive General and Administration

-904518583

Quality Assurance

Quality Assurance

-846578145

Sales

Sales and Marketing

-493399545

Production Control

Manufacturing

-216183716

Marketing

Sales and Marketing

-150901473


In this case, given such a small table, the results happen to be unique, but there is no guarantee of uniqueness in calling the Checksum() function. Therefore, you should assume duplicates. But this is not a problem. Suppose the above query is written as a stored procedure which accepts two arguments, @Name and @GroupName. Suppose that the hash-bucket is indexed. Then imagine this:

USE AdventureWorks
SELECT Name, GroupName 
FROM Adventureworks.HumanResources.Department
WHERE Checksum(Name,GroupName)= Checksum(@Name,@GroupName)
  AND Name = @Name
  AND GroupName = @GroupName

Conceivably, the specified hash-key value will result in duplicate rows. However, this integer search will be very fast, and the additional qualifiers will then reduce the result-set to a single row. So in 50 million rows, you might initially visit 10 rows, then reduce the 10 to 1 row.

Chances are, you do not have a table handy that contains 50 million rows. But you can exploit this technique with many fewer rows than this example presents.

The final thing to add to this scenario is that you don’t even have to store the hash keys! Instead, create a calculated column whose value is the hash of your columns, then index the calculated column. This will result in lightning-quick searches of that index as the starting point of your search, which is then refined by the additional qualifiers. The gain in performance is truly dramatic.

Return to the Artful SQL Server & Access Tips page