Sunday, November 4, 2007

Data Source Annotations

This post is something of a follow-up on the “Annotate like you mean it” best practice I mentioned in my best practices post last week. I always figured that the technique was something of a no-brainer, but when I mentioned it to Greg Low in our SQL Down Under conversation last week he seemed a little surprised (in a good sort of way) and if he hadn’t thought of it[1] odds are there are a few people out in SSIS-land who might not have thought of it either.

In my last post I talked about updating an SSIS package to use Common Table Expressions in the source query to replace a complex set of asynchronous transforms in the package’s data flow. The upside of this change was that the package performed better by several orders of magnitude. The downside is that the source query became more complex.

Now this isn’t inherently a bad thing, but it is undesirable in that it makes the package less self-documenting. In addition, because I tend to base my source queries on expressions[2], it is not always a simple task to figure out exactly what query is being executed. Because of this, and because self-description is an admirable goal for any development artifact, I will generally update my data flow with an annotation on the design surface that shows the query next to the data source component. In this scenario, the data flow ended up looking something like this:


As you can see, it is now much easier to understand the data being extracted than it would be to examine the data source component to find out on which variable its query was based, and then examine the variable to understand the expression on which it, in turn, was based.

The obvious drawback of this approach is that as soon as the annotation is created, it is instantly out of date. There is no way to create a “dynamic annotation” in SSIS[3], so there is no enforced relationship between the annotation and the data source component that it describes. This then puts something of a maintenance demand on the package developers, as they must remember to update the annotation each time the source query is updated.
I personally think that this is an acceptable tradeoff. The maintenance cost itself is low, and the benefits of having a well-documented package are considerable. Also, if you follow the best practice of “Get your metadata right first, not later” then you will likely not need to update the source query (or the annotation) very often.
[1] I have a great deal of respect for Greg’s knowledge and experience, so if there’s something that is new to him, it’s worth blogging about.
[2] Which will likely be a post of its own before too long.
[3] Although having annotations based on expressions would be a cool new feature, wouldn’t it?

No comments: