Tuesday, July 19, 2011

Partitioned, Rolling Averages on SQL Server 2008

Had a bit of fun a while back implementing a rolling average for an aggregation over a large table in a data warehouse.
Like you I’d been drilled into the ‘avoid cursors and use sets’ mentality, and when I see something like a rolling average I immediately think of it in set-based terms: come up with a way to join the table to itself using a window, and average over that window.
Something like this:
select group, avg(value)
from table t1
inner join table t2 on (t1.group = t2.group and t1.date > t2.date and t1.date < (t2.date – 15 days))
group by group
And this all works fine in dev, but once the data volumes start stacking up, WHOAH have you got a problem and ohmygod is that really the number of logical reads being done!?
This problem is generally known as a triangular join, and can be found anytime there is a non-equality join predicate between two tables, without any further conditions applied. It’s a big gotcha for the ‘sets good, cursors bad’ mantra, because in many cases a cursor based approach would be significantly better.
In this case particular case, even whilst it’s not an unlimited triangular join (each row should only be joined to the previous 15 days, not to all prior rows), you get the same effect. The optimizer figures: ‘hey – I might just do the full cross join, and then filter the rows’. And for a large data warehouse table this is very much a no-no.
So what’s the answer. Well, cursors, or even the quirky update approach if you are brave. But to provide a partitioned rolling average, both of these approaches require the data to be sorted as group,date. The cursor can apply that sort as an ‘order by’, the quirky update requires the destination table to be in that order (and some other things too).
My table, like many in a data warehouse, is sorted by date,group, because inserts are done incrementally over time, and so never cause any page splits. Sorting the data for a table this size (several TB’s) is expensive.
So I am thinking there must be better way. One that can scan the table in the order it exists, store rows in buckets indexed by the ‘group’ column, prune as necessary, and quickly provide an average for the last 15 days for any incoming row. And I thought about how I could do it in T-SQL, and I thought about how easy it would be in C#…
…and then I remembered about CLR stored procs, and about 30 mins later I had a working solution that took my process down from 5+ hours to some handful of minutes. The implementation was trivial (think Dictionary<int,List<SomeValueWithDate>> and you’re there already), and deployment was amazingly easy thanks to Visual Studio database projects (aka GDR2) which just do it all for you.
The only slight problem was that, like a normal stored proc, about all you can do with the output is INSERT – you can’t use it in a JOIN for example. I settled for capturing the output into a table variable, then doing the update as a second step back in T-SQL land.

Friday, July 01, 2011

Model-first with Entity Framework and Oracle 11g (Part 2)

So having given up on the Oracle provider for now, and because people had said good things about them, I tried devArt’s dotConnect:

-- Table "Customers"
CREATE TABLE "Customers" (
"Id" NUMBER(10) NOT NULL,
"FirstName" VARCHAR2(50) NOT NULL,
"AnInteger" NUMBER(10) NOT NULL,
"ALong" NUMBER(18) NOT NULL,
"IsSomethingTrue" NUMBER(1) NOT NULL,
PRIMARY KEY ("Id")
)
/
CREATE SEQUENCE "Customers_SEQ"
/
CREATE OR REPLACE TRIGGER "Customers_INS_TRG"
BEFORE INSERT ON "Customers" FOR EACH ROW
BEGIN
SELECT "Customers_SEQ".NEXTVAL INTO :NEW."Id" FROM DUAL;
END;
/








Better. We got our sequence triggers, and some slightly saner data type choices. We've still got the casing problem (and no check constraints for our boolean flag), but we can fix that now because the devArt DDL generation template is a real T4 template, not just a stub:









image









You’ll find yours in C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE\Extensions\Microsoft\Entity Framework Tools\DBGen. There’s not loads in there, all the real work is in the include file in .\Entity Framework Tools\Templates\Includes\Devart SSDLToOracle.ttinclude









A quick case-conversion modification to the the OracleGenerator class:








	public override string GetValidIdentifier(string identifier) {

identifier = base.GetValidIdentifier(identifier);
return ToOracleCase(identifier);
}

public static string ToOracleCase(string value){
StringBuilder output = new StringBuilder();
bool waslower = false;
foreach (var c in value.ToCharArray())
{
bool isLower = char.IsLower(c);
if(waslower && !isLower)
output.Append('_');

output.Append(Char.ToUpper(c));
waslower = char.IsLower(c);
}
return output.ToString();
}











...and we are ready to go (ok, it's a crude implementation, but look, we are off and away):









-- Table "CUSTOMERS"
CREATE TABLE "CUSTOMERS" (
"ID" NUMBER(10) NOT NULL,
"FIRST_NAME" VARCHAR2(50) NOT NULL,
"AN_INTEGER" NUMBER(10) NOT NULL,
"ALONG" NUMBER(18) NOT NULL,
"IS_SOMETHING_TRUE" NUMBER(1) NOT NULL,
PRIMARY KEY ("ID")
)
/
CREATE SEQUENCE "CUSTOMERS__SEQ"
/
CREATE OR REPLACE TRIGGER "CUSTOMERS__INS_TRG"
BEFORE INSERT ON "CUSTOMERS" FOR EACH ROW
BEGIN
SELECT "CUSTOMERS__SEQ".NEXTVAL INTO :NEW."ID" FROM DUAL;
END;
/








Much better.



Update: The next day I realised I’d made a cardinal error. All any of this does is change the generated DDL, not the store schema in the EDMX model. I had to do that separately, which rather defeated the point of what I was trying to do. In the end I gave up, went database-first and wrote an XSLT that performs a PascalCasing conversion on the conceptual model object / attribute names. And re-mapped any badly mapped types. What a PITA.

Popular Posts