Wednesday, March 30, 2011

Avoid Hash Aggregation

I’d been wondering about this, but Simon Sabin’s recent post just spelt it out for me: Hash Aggregation is all too easy to do, but eats performance for breakfast. We’re all used to the problem of SARGable arguments, right? You perform any operations on the parameter not the column in order to correctly hit the indexes you’d expect to hit. So for example you wouldn’t do this:

select blah
from table
where year(datecolumn) = @myyear

but instead you’d do this:

select blah
from table
where datecolumn >= @myyear and datecolumn < dateadd(year, @myyear, 1)

Well exactly the same problem occurs with aggregates, but without the fix. Say datecolum was datetime2(0) – ie down to the second, and you want an annual aggregate of something. For example:

select year(datecolumn) as Year, itemId, sum(cost)
from table
group by year(datecolumn), ItemId

Look what we did. We had to apply the function to the column, and hey presto, despite our underlying table having a clustered index in datecolumn, itemId order, our aggregation is a Hash Aggregation because SQL can’t infer the relationship between a Year and a DateTime. It can’t use Stream Aggregation unless the data is in the right order, which it now can’t infer, and it thinks the hash is going to be cheaper than a full sort.

But Hash Aggregation is expensive, as you only have to read the doco to realise:
For the distinct or aggregate operators, use the input to build the hash table (removing duplicates and computing any aggregate expressions). When the hash table is built, scan the table and output all entries.

That hash just buffered all our data in order to do the aggregation. Opps, TempDb spill here we come…

A Stream Aggregation is much more performant because it can perform the aggregation ‘on the fly’ because the data is in order, but to do this in our example we’d have to have a calculated, indexed column ‘Year’, and indexing means it’s persisted, so bang we just added 4 bytes to our fact rows and increased my table size by about 20% (my table is not very wide, but very, very long). And I still need the existing clustered index, so we’re creating a non-clustered index and the relative cost difference for scanning the table might mean we still don’t get what we want.

How much worse is the Hash? I cooked up an example with some of our data against three scenarios:

  • doing an annual aggregate based on munging dates into years (the group by above)

  • doing the aggregate based on adding the year column and putting it in the clustered index


88% : 12%. Quite a difference (though the rowcount here is fairly small so this may not be entirely representative).

Ideally Sql would ‘see through’ the Year (and other date related functions) and deduce that it can still use the date-ordered clustered index to service a stream aggregation, and this is exactly what Simon’s opened a Connect issue asking for.

That article seems to suggest that there are workarounds, like joining to the date dimension table and grouping based on columns from that, which I tried in vain. However, this may only be relevant when you’re limiting to a subset of the data. More importantly, this is only viable if the original column was of type Date: if you had a DateTime2(0) you’d need to join against a combined Date-And-Time dimension (getting fairly large). If you had DateTime2(2), you’d need to join against a date dimension that went down to the 100’th of a second. That’s a big dimension (3,153,600,000 rows / year).

Ultimately given my real world scenario is doing lots of aggregations from the sub-minute up to the minute grain, I had to bite the bullet, and change the column in the fact table to smalldatetime (and put the seconds portion in a different column). And that seems like a really poor outcome for what you would have thought would be a bread-and-butter warehousing requirement[1].

Here’s a really excellent write up of the two different operators from Craig Freedman’s blog:

[1] I’m only doing any of this at all because of limitations in SSAS and trying to avoid calc-and-sum performance issues, so it’s all a bit frustrating.

Tuesday, March 29, 2011

Static Lookup Data: Prefer CASE to JOIN when possible

Getting into the nitty-gritty of fine-tuning the datamart I’m working on now, so spending a lot of time looking at execution plans, and wondering ‘why?’. Today: lookup data.

Normally in a datamart you’d denormalize as much as possible, ideally avoiding JOINs altogether, but there’s a trade-off with space and in some cases the space cost just can’t be justified. Sometimes you still have to add some lookup data (cost rates etc…) in your fact view.

Given a tiny amount of lookup data (10’s of rows), I thought putting it in a little table and doing a join would be pretty performant. The lookup table will basically just sit in RAM all the time, and it’s tiny compared to the fact data so the optimizer can’t possibly it the wrong way round. And I thought searched CASE statements were expensive. Turns out they are nothing on a JOIN, even in the optimal case.

Adding columns using a CASE statement performs significantly better than the static lookup table strategy. Even when the CASE statement is rolled up into a table-valued function, the costs in my case are about 2:1 in favour (the TVF is successfully ‘optimized away’).

In both strategies SQL has to look at each-and-every row, so I’m surprised it’s that different, but I guess maybe the Hash Match is optimized for larger amounts of data (on the right side), and can’t match a CASE for trivial lookups. There’s also the cardinality to consider: a CASE devolves to a COMPUTE SCALAR, which means no change in the number of rows, but a Hash Match might output a row more than once (or drop rows) based on the number of matches on the right side of the query. I’m guessing this constraint means there’s a fundamentally cheaper way to process the data in memory.

Here’s the before and after, with a relatively small number of rows (70,000 on my dev box, but it’s near exactly the same with 100+ million in the test environment):


Now you’ll have to excuse me, because I have some views to change…

Friday, March 25, 2011

Enabling Sql Server CLR Integration if not currently enabled

I don’t know, is calling RECONFIGURE unnecessarily an issue? Anyway, if you really want to only enable the integration if it’s not already enabled (and maybe want to do something else at the same time):

NAME sysname,
minimum INT,
maximum INT,
config_value INT,
run_value int

EXEC sp_configure 'clr enabled'

IF EXISTS (SELECT run_value FROM @config WHERE run_value = 0)
PRINT 'Enabling CLR integration'
EXEC sp_configure 'clr enabled', 1;
EXEC sp_configure 'clr enabled'

Wednesday, March 16, 2011

What bits to download for SQL Cumulative Updates

If you look at the download page for Sql 2008 R2 CU 6, you’ll a choice of six different download packages all for your platform:


Whu? Actually it turns out that all but one of these are subset packages of the main hotfix, that you’d only need to apply if – say – you only wanted to patch PowerPivot or SQL Native Client, but you didn’t want to have to download the whole update*.

Quite why this isn’t deemed worth mentioning anywhere in the relevant knowledge base article or on the download page itself is beyond me, but fortunately I happened to stumble upon an article on the SQL CSS team blog that explained it all (after I’d already downloaded half of them) so now I know.

And now, so do you.


[*] Note the rules are different for Sql 2005 – you do have to download all of them. So no potential confusion there.

Wednesday, March 09, 2011

Extract Zip from PowerShell

Adapted from an old Powershell Guy post, and flipped around to unzip the files:

function unzip($zipPath, $destination){
$shell = new-object -com shell.application;
$zip = $shell.NameSpace($zipPath);
foreach($item in $zip.items()){

Evil use of shell object (where is the managed API?), but seems to work quite nicely actually. Note carefully the brackets after $zip.Items().

Now I’ve worked it out it seems really easy to find others doing the same thing. Oh well.

Tuesday, March 01, 2011

Configuring HTTP Access for Analysis Services 2008 R2 64 bit

It’s basically no different from 2005, so follow these instructions, ensuring that (if needed) you download the OLE-DB provider for AS from the 2008 R2 feature pack site, and not one of the vanilla 2008 releases, and also ensuring that don’t mark your app pool as ‘enable 32 bit’.

You may find that the web role services ‘ISAPI extensions’ or ‘Windows Authentication’ isn’t even enabled on your box, and you will need those. You can get these up by right-clicking the web server node:


A good way to test this is to try and connect using Management Studio. If it fails with a ‘unsupported data format’ error, and you put the pump on a different box to the SSAS instance, it’s probably a double hop issue, and you’re going to have to send yourself insane talking to IT about configuring SSAS to use Kerberos authentication, constrained delegation, SPNs and all the normal fun stuff. I feel for you.

I bit the bullet and installed IIS directly on the SSAS box, and got a bit further. A good way to test at this point is to try and connect to SSAS locally using Management Studio, using the http://<server>/olap/msmdpump.dll path. If that works, at least you know the SSAS/IIS end is ok. But then I still couldn’t connect from Excel on my workstation, in a different domain, which was why I was doing all this in the first place.

I used the old ‘create local user with matching login/password’ trick, and ran Excel as that, but that which didn’t work. Guessing the lack of trust between the domains would be a problem I attempted to disable Kerberos and force NTLM (which seems to have changed a bit from how it used to be done on Windows 2003), but that still didn’t work either (IIS logs helped here). Eventually I just enabled anonymous access to the website, changed the app pool credentials to use Local Service (Network Service would have done), granted IUSR access to the cube (this bit I don’t quite follow, why not Local Service?) and finally it all worked. But note that now I was using anonymous access I didn’t need to install IIS at all, since I could have done this from any box. Ho hum.

Finally go and vote for this to be supported out-of-the-box. With all the work that went into separating out the HTTPSys bits (for WCF et al), this should be a total no-brainer.

Popular Posts