This is WiX:
candle.exe
dark.exe
heat.exe
light.exe
lit.exe
melt.exe
pyro.exe
smoke.exe
torch.exe
Jesus wept. Even the ‘fresh ground .jar of Java beans’ crowd outgrew this childishness…
.net dregs from my mental teacup
This is WiX:
candle.exe
dark.exe
heat.exe
light.exe
lit.exe
melt.exe
pyro.exe
smoke.exe
torch.exe
Jesus wept. Even the ‘fresh ground .jar of Java beans’ crowd outgrew this childishness…
I’m still not 100% sure I understand this, but here goes:
I swapped one of my packages to use a ‘table or view’ for a data source, rather than (as previously) a named query, and the performance dive-bombed. Think less ‘put the kettle on’, more ‘out for lunch and an afternoon down the pub’.
Profiler shows that when you use ‘table or view’, SSIS executes a select * from that_table_or_view on your behalf. But pasting exactly the same select * from profiler into a management studio window still ran pretty good, and completed the query in 3000-odd reads, compared to the 6 million + that SSIS had burnt before I cancelled it.
The view in question is just a straight join of two tables, fairly simple stuff. But profiler showed that SSIS was getting a very different execution plan from running the query in management studio, joining the tables in the opposite (wrong) order. This presumably explains why the number of reads went from 3000-odd to SSIS’s 6 million+. And why it was running a bit s..l..o..w.
Bizarrely even when I put the same select * back into SSIS using ‘Sql Command’ mode, it still ran quickly, provided there was a line break before the FROM. This made me think there must be a bad execution plan getting used somehow, and that extra whitespace was just enough to avoid it:
Looking closer at profiler, it appear that when using ‘Table or View’ SSIS first issues the same query with a SET ROWCOUNT 1 on, in order to get the metadata. This doesn’t happen when SSIS uses ‘Sql Command’ as its source: it executes sp_prepare and seems to get all its metadata from that.
So the only conclusion I can come to is that the execution plan is being poisoned by executing it with SET ROWCOUNT 1 on, picking a plan that’s more appropriate for one row than many. In ‘Sql Command’ mode this doesn’t appear to be an issue because SSIS gets the metadata a different way.
Which makes me think I will be avoiding ‘table or view’ like the plague from now on.
[Update 2009-06-19]
For the sake of completeness, here are the extra screenshots I wanted to put in originally, but my screen clipper was playing up (yet again).
Here’s the ‘right’ version you get from management studio or SSIS in ‘SQL Command’ mode (last two columns are ‘READS’ and ‘DURATION’ respectively):
And here’s what SSIS produces in ‘table or query mode’ (sorry I clipped off the bit above showing it executing the same query with SET ROWCOUNT 1 on, but it did happen):
…a very different plan.
I thought a good proof here would be to log in as a different user (plans are cached per-user), and execute the equivalent of what SSIS produced:
set rowcount 1
select * from lm.vw_RoadDatumWithRoadDetails
go
set rowcount 0
select * from lm.vw_RoadDatumWithRoadDetails
…unfortunately that didn’t seem to work (or rather fail), so there must be something else in the SSIS preamble that’s also involved in this case :-(
However Microsoft UK’s Simon Sabin has already posted a repro based on the Adventure Works 2008 database (which was pretty quick work). He says:
“Nested loop joins don’t perform if you are processing large numbers of rows, do to the lookup nature”
…which is exactly the strategy I see in the ‘bad’ plan above.
So who’s ‘fault’ is all this? You could make a good case for saying that SQL server should play safe and include the ROWCOUNT as part of its ‘does-a-query-plan-match?’ heuristic (along with a basket of other options it already includes), and that would probably be a good thing generally.
But I think the main question has got to be why does SSIS use SET ROWCOUNT at all? Any time I’ve ever wanted to get metadata I’ve always done a SELECT TOP 0 * FROM x. The ‘top’ is part of the query, so already generates a unique query plan, and it’s not like SSIS has to parse and re-write the query to insert the TOP bit : SSIS is already generating that whole query, based on the ‘table or view’ name: adding the ‘top’ clause would be trivial.
I feel a Connect suggestion coming on.
The default TFS drop folder scheme is a train-wreck where all the binaries and output files from the entire solution are summarily dumped into a single folder (albeit with some special-case handling for websites). Its just gross.
What I wanted, of course, was a series of folders by project containing only the build artefacts relevant for that project: ‘xcopy ready’ as I like to say. Pretty sensible yes? Quite a popular request if one Googles around: one wonders why the train-wreck scheme was implemented at all. But I digress.
Contrary to what you (and I) may have read, you actually have to do this using AfterBuild, not AfterCompile (which fires way too early in the build process[1]). So the canned solution is:
Put this in your TFSBuild.proj:
<PropertyGroup>
<CustomizableOutDir>true</CustomizableOutDir>
<!-- TEAM PROJECT
(this turns off the ‘train-wreck’ scheme, and goes back to each project’s output going to their bin\debug or whatever folder)
Put this in any project file that you want to capture the output from:
<Target Name="AfterBuild" Condition=" '$(IsDesktopBuild)'!='true' ">
<Message Text="Copying output files from $(OutDir) to $(TeamBuildOutDir)" />
<ItemGroup>
<FilesToCopy Include="$(OutDir)\**\*.*" />
</ItemGroup>
<Copy SourceFiles="@(FilesToCopy)" DestinationFiles="@(FilesToCopy ->'$(TeamBuildOutDir)\$(AssemblyName)\%(RecursiveDir)%(Filename)%(Extension)')" />
</Target>
And hey presto:
A series of output folders, by assembly, containing only the artefacts relevant to that assembly, as opposed to the
Note however that with this scheme you get all files marked ‘Copy to Output Directory’ but not files marked as Content, which makes this currently unusable for deploying websites and means it’s not strictly speaking xcopy-ready[2]. Hopefully there is an easy fix to this, otherwise I’ll be diving back into the SNAK codebase where I’ve solved this previously.
[1] before serialization assemblies are generated, and before the obj folder has been copied to the bin folder. A good diagnostic is to put <Exec Command="dir" WorkingDirectory="$(OutDir)"/> into an AfterCompile target, and take a look at what you get back.
[2] I have a big thing about this, which is why I dislike the ClickOnce build output so much. Build and deploy need to be considered separately.
Here at work we have our profiles, including our documents folders, remoted to the network. It’s a constant pain, and this morning the fun was how to get the CopySourceAsHTML visual studio Add-in working as a result, since its ‘normal’ install location, My Documents\Visual Studio 2008\AddIn’ is within that remoted profile.
Irrespective of fiddling around with CASPOL and security zones, Visual Studio seems intransigent in not wanting to load AddIns from network locations. Fortunately it turns out there are a number of different places that you can put add-ins to get them to load, which is probably better explained in this conversation.
As a result I dumped the addin into
%ALLUSERSPROFILE%\Application Data\Microsoft\VisualStudio\9.0\AddIns
C:\Documents and Settings\All Users\Application Data\Microsoft\VisualStudio\9.0\AddIns
…and it worked just fine.
Shows how often I actually refer back to those DVDs, but now that Silverlight 2 is out they don’t work (well the navigation doesn’t, which is nearly as bad)
There are a number of crappy workarounds, but fortunately at least one clever fix (short of re-authoring the DVDs with the updated online content). Naively I thought you could just add the application/x-silverlight-2-b2 mime-type mapping into the registry and it would work, but apparently not.
The Silverlight version detection / browser sniffing / forwards compatibility story has been monumentally shite so far, because I’ve seen a whole heap of the ‘Install Silverlight’ button in lieu of content, but immortalizing it onto countless DVDs really takes the biscuit.
Flash – if I recall correctly – got this right first time, that being over a decade ago.
Yeah, so your fancy-pants Linq query, and your fluent interface all looks so neat and expressive, but after you’ve tried to play ‘pin the null reference exception on the donkey’ in production for the n’th time you might consider that all those intermediate variables were generally considered good style for a reason. And it’s not just that you can pinpoint an error given a line number:
“Putting each statement on a line of its own provides an accurate view of a program’s complexity” – Steve McConnell [1]
That and naming the variables documented the code[2], right?
Everyone’s oooohh so up in arms if a method spans multiple pages on screen, but a 10 line fluent interface chain mixed with some linq and some lambdas is somehow ok!? Be serious now[3].
Dumb it down. One day you will appreciate it.
[1] Code Complete, Chapter 31.5 Para ‘Using Only One Statement Per Line’. See http://www.cc2e.com/Page.aspx?hid=291
[2] …and chapter 11.1 ‘Considerations in Choosing Good Names’
[3] Not that I’m advocating avoiding either Linq or Lambdas, but moderation in everything, right? Really not sold at all on Fluent Interfaces mind. So last year.
Merging in updated or newly created data from an external source is a pretty common task in SSIS. The canonical way[1] these upserts are handled is the SSIS Lookup Pattern:
Incoming rows are run through the Lookup component to identify which already exist in the destination. Those that do become UPDATEs, and those that don’t (the Lookup’s error output) become INSERTs. You can further refine this to only retrieve the inserted-or-updated rows from the datasource in the first place and save on bandwidth and so on (a future post on delta loads I think).
It’s fairly straightforward, but to my mind this has a few major problems.
Firstly, it’s a pain in the arse. For something that’s such a common requirement, this seems like a lot of shape-twiddling-clicky-click-dragging to get it up and running[1]. And I hate how brittle the OLE DB Command parameter mapping is. But that’s another story.
More concerning is that the update flow uses the OLE DB Command component, and that’s something that fires per row, unlike the OLE DB Destination that your INSERTs go to, which can batch and/or use fast load. Normally the updates are going to be in the minority compared to the inserts, but if you get a significant number of them it still costs time.
Probably the bigger problem for me with this pattern is that the performance of the lookup degrades as the data volume progressively increases. Every time you insert data there's more for the lookup to subsequently lookup against. You can (and should) limit the columns in the lookup to restrict it's memory blow out, but it's always going to just progressively get worse over time.
To me this is symptomatic that the problem is being solved at the wrong tier. Why bring large quantities of data up from the database to determine 'newness' of a row, when that's clearly something best done closer to the database itself? Especially as both sets of rows are going to the same place anyway.
So then you think about pouring data into a temporary table, then doing the upsert within Sql (based on a join), then dropping the temporary table. But you’ve got to create and drop the temporary table. What you need is a temporary table that’s transparently created for you. And what, after all, is an INSTEAD OF trigger, if not precisely that?
So my solution for this is a series of 'upsert views'. An upsert view is a view with an INSTEAD OF INSERT trigger that instead of the insert performs an upsert. In this case (being Sql 2005) the upsert is an IF … INSERT … ELSE … UPDATE construct, but a 'true' Sql 2008 upsert would be a minor tweak.
An upsert view looks a bit like this:
Select all-the-updatable-columns
From sometable
Non-updatable columns include timestamp and identity columns (you'd never mark a column you brought in from a source system as an identity column, so there's no need to mess about with identity inserts).
The instead of trigger looks a bit like this:
-- Insert all the genuinely new rows
Insert into [table]
(all-the-updatable-columns)
From
Select (all-the-updatable-columns)
From inserted
Where not exists (select 1 from [table] where (keys=inserted.keys)
-- Update all the modified rows
Update [table]
Set all-the-updatable-columns=inserted.all-the-updatable-columns
From
Inserted
Inner join [table] on (keys = inserted.keys)
Where (concurrency_indicators <> inserted.concurrency_indicators)
(We’ll explain the keys and the concurrency indicators in a moment)
What this produces is a view over your data table that takes records INSERTED into the view and performs an UPSERT into the base table. As a result your SSIS can be simplified to this:
In reality of course you normally have more than this: probably have some other lookups and some data type conversions. However you've not had to put the Upsert functionality into the SSIS (by hand): instead the database is handling it all for you.
Provided you enable the FIRE_TRIGGERS option, this is entirely compatible with using Fast Load, and this can make a significant performance difference for the time of your package execution. You've kept all the row-matching down in Sql where it can use indexes, optimisations and the page cache, and you've reduced the memory footprint of your SSIS packages. I shame myself by not having any hard comparison data to back this up however.
We (of course) generate[2] these upsert views based on various bits of metadata we store to indicate which are the keys and the concurrency indicators. In a data warehouse it’s quite common for the primary key to be different to that defined in the source system, so the key used for row matching – the ‘business key’ – has to be indicated somewhere. Column-level extended properties seemed a fairly natural fit:
Similarly the columns that are used for concurrency control – rowversion / datetime updated – are marked as follows:
(If you don’t have one then you use the ‘fallback’ concurrency model that so many of Microsoft’s design-time tools like to kick out: you compare every column to see if it’s changed (mark all non-key columns). This is clearly a bit yuck).
And that’s pretty much it. Provided you can cope with INSERTing into a view and getting UPDATEs to your table, (which is a bit odd, but can be catered for with appropriate naming conventions) then maybe you can benefit from this approach too.
[1] Sometimes you can get the Slowly Changing Dimension wizard to kick out something similar, but that’s pretty hacky, and doesn’t like using certain types of columns as the update indicator (IRRC). It also re-creates the downstream flow every time you open it, which obviously removes any customisations you make. Avoid.
[2] The generation process was a fairly simple PowerShell script. I’ll post the important bits up one day.