Monday, May 23, 2011

More whinging about Hash Aggregate

I previously posted on why Hash Aggregate is to be avoided if possible. Unfortunately Hash Aggregate vs. Stream Aggregate appears to be an ‘all or nothing’ proposal. And that sucks.

A Stream Aggregate can be relatively cheap (in terms of memory pressure) because it can leverage the ordering (typically clustered index ordering) of the input data and aggregate on-the-fly. By contrast the Hash aggregate has to cope with completely disordered data streams, so must buffer all the output aggregates in memory until the input data has been entirely read:

"The other aggregation operator, hash aggregate, is similar to hash join.  It does not require (or preserve) sort order, requires memory, and is blocking (i.e., it does not produce any results until it has consumed its entire input).  Hash aggregate excels at efficiently aggregating very large data sets.”
[Craig Freedman - ]

This is a good strategy because:

  • It copes with totally unordered data
  • Provided you are aggregating to a significantly higher grain than the source data, the volume of aggregates in the hash (and hence memory consumption) can be relatively small.

In my case, of course, the latter isn’t true: I’m aggregating across a significant amount of data, but not aggregating very much, so I am experiencing vast quantities of tempdb spill.

My input data is partially aligned to the aggregation (the most significant column is sorted correctly), so you’d expect that the aggregation would be able to leverage that smartly, to output aggregates when they were clearly ‘done’. So, for example, if you had a clustered index on Month/Day/Category, and you were doing sum(something) GROUP BY Month, Category, you and I both know that after you’ve finished spooling data for any given month, you can output all those buckets, rather than hang on to them waiting for more. Unfortunately this doesn’t happen, which means given enough data, something will spill to disk:

“if we run out of memory, we must begin spilling rows to tempdb.  We spill one or more buckets or partitions including any partially aggregated results along with any additional new rows that hash to the spilled buckets or partitions”
[Craig’s blog again, same link as before]

One can only hope it’s the older buckets that get spilt first, but even so, when this happens you are taking the IO hit of streaming all that aggregate data to disk and then subsequently loading it all back in again. Ouch. Or you force the stream aggregate (using a hint) and take the pain of the sort. More ouch.

For many workloads this probably isn’t much of an issue, but for my data warehouse this is starting to look like a major, major bottleneck.

No comments:

Popular Posts