Wednesday, March 30, 2011

Avoid Hash Aggregation

I’d been wondering about this, but Simon Sabin’s recent post just spelt it out for me: Hash Aggregation is all too easy to do, but eats performance for breakfast. We’re all used to the problem of SARGable arguments, right? You perform any operations on the parameter not the column in order to correctly hit the indexes you’d expect to hit. So for example you wouldn’t do this:

select blah
from table
where year(datecolumn) = @myyear


but instead you’d do this:

select blah
from table
where datecolumn >= @myyear and datecolumn < dateadd(year, @myyear, 1)


Well exactly the same problem occurs with aggregates, but without the fix. Say datecolum was datetime2(0) – ie down to the second, and you want an annual aggregate of something. For example:

select year(datecolumn) as Year, itemId, sum(cost)
from table
group by year(datecolumn), ItemId


Look what we did. We had to apply the function to the column, and hey presto, despite our underlying table having a clustered index in datecolumn, itemId order, our aggregation is a Hash Aggregation because SQL can’t infer the relationship between a Year and a DateTime. It can’t use Stream Aggregation unless the data is in the right order, which it now can’t infer, and it thinks the hash is going to be cheaper than a full sort.

But Hash Aggregation is expensive, as you only have to read the doco to realise:
For the distinct or aggregate operators, use the input to build the hash table (removing duplicates and computing any aggregate expressions). When the hash table is built, scan the table and output all entries. http://technet.microsoft.com/en-us/library/ms189582.aspx

That hash just buffered all our data in order to do the aggregation. Opps, TempDb spill here we come…

A Stream Aggregation is much more performant because it can perform the aggregation ‘on the fly’ because the data is in order, but to do this in our example we’d have to have a calculated, indexed column ‘Year’, and indexing means it’s persisted, so bang we just added 4 bytes to our fact rows and increased my table size by about 20% (my table is not very wide, but very, very long). And I still need the existing clustered index, so we’re creating a non-clustered index and the relative cost difference for scanning the table might mean we still don’t get what we want.

How much worse is the Hash? I cooked up an example with some of our data against three scenarios:

  • doing an annual aggregate based on munging dates into years (the group by above)

  • doing the aggregate based on adding the year column and putting it in the clustered index

image

88% : 12%. Quite a difference (though the rowcount here is fairly small so this may not be entirely representative).

Ideally Sql would ‘see through’ the Year (and other date related functions) and deduce that it can still use the date-ordered clustered index to service a stream aggregation, and this is exactly what Simon’s opened a Connect issue asking for.

That article seems to suggest that there are workarounds, like joining to the date dimension table and grouping based on columns from that, which I tried in vain. However, this may only be relevant when you’re limiting to a subset of the data. More importantly, this is only viable if the original column was of type Date: if you had a DateTime2(0) you’d need to join against a combined Date-And-Time dimension (getting fairly large). If you had DateTime2(2), you’d need to join against a date dimension that went down to the 100’th of a second. That’s a big dimension (3,153,600,000 rows / year).

Ultimately given my real world scenario is doing lots of aggregations from the sub-minute up to the minute grain, I had to bite the bullet, and change the column in the fact table to smalldatetime (and put the seconds portion in a different column). And that seems like a really poor outcome for what you would have thought would be a bread-and-butter warehousing requirement[1].

Here’s a really excellent write up of the two different operators from Craig Freedman’s blog:


[1] I’m only doing any of this at all because of limitations in SSAS and trying to avoid calc-and-sum performance issues, so it’s all a bit frustrating.

No comments:

Popular Posts