Had a bit of fun a while back implementing a rolling average for an aggregation over a large table in a data warehouse.
Like you I’d been drilled into the ‘avoid cursors and use sets’ mentality, and when I see something like a rolling average I immediately think of it in set-based terms: come up with a way to join the table to itself using a window, and average over that window.
Something like this:
select group, avg(value)
from table t1
inner join table t2 on (t1.group = t2.group and t1.date > t2.date and t1.date < (t2.date – 15 days))
group by group
And this all works fine in dev, but once the data volumes start stacking up, WHOAH have you got a problem and ohmygod is that really the number of logical reads being done!?
This problem is generally known as a triangular join, and can be found anytime there is a non-equality join predicate between two tables, without any further conditions applied. It’s a big gotcha for the ‘sets good, cursors bad’ mantra, because in many cases a cursor based approach would be significantly better.
In this case particular case, even whilst it’s not an unlimited triangular join (each row should only be joined to the previous 15 days, not to all prior rows), you get the same effect. The optimizer figures: ‘hey – I might just do the full cross join, and then filter the rows’. And for a large data warehouse table this is very much a no-no.
So what’s the answer. Well, cursors, or even the quirky update approach if you are brave. But to provide a partitioned rolling average, both of these approaches require the data to be sorted as group,date. The cursor can apply that sort as an ‘order by’, the quirky update requires the destination table to be in that order (and some other things too).
My table, like many in a data warehouse, is sorted by date,group, because inserts are done incrementally over time, and so never cause any page splits. Sorting the data for a table this size (several TB’s) is expensive.
So I am thinking there must be better way. One that can scan the table in the order it exists, store rows in buckets indexed by the ‘group’ column, prune as necessary, and quickly provide an average for the last 15 days for any incoming row. And I thought about how I could do it in T-SQL, and I thought about how easy it would be in C#…
…and then I remembered about CLR stored procs, and about 30 mins later I had a working solution that took my process down from 5+ hours to some handful of minutes. The implementation was trivial (think Dictionary<int,List<SomeValueWithDate>> and you’re there already), and deployment was amazingly easy thanks to Visual Studio database projects (aka GDR2) which just do it all for you.
The only slight problem was that, like a normal stored proc, about all you can do with the output is INSERT – you can’t use it in a JOIN for example. I settled for capturing the output into a table variable, then doing the update as a second step back in T-SQL land.
Subscribe to:
Post Comments (Atom)
Popular Posts
-
Summary: Even if you think you know what you're doing, it is not safe to store anything in a ThreadStatic member, CallContext or Thread...
-
I love PowerShell, and use it for pretty much everything that I’m not forced to compile. As a result I’ve got fairly competent, and people h...
-
Look, it’s really not that hard. Programs are still in the same place, in %ProgramFiles%, unless you need the 32 bit version, which is in...
No comments:
Post a Comment