Building the Meta-data Diff Engine

From: Laconic2 <laconic2_at_comcast.net>
Date: Thu, 21 Oct 2004 11:43:19 -0400
Message-ID: <L9OdnSkAE828ROrcRVn-hg_at_comcast.com>



What follows in more skunk works. You have been warned!

On two of the contracts I worked, one of the things I was called upon to work with was a pair of databases that had diverged, because different DDL had been applied to the two.

I approached the first one by moderating a series of meetings between the two people who controlled the DDL, each on one database. I detected the differences pretty much by hand.

By the time I hit the second case, I was ready with a more automatic approach. I never would have been able to get the DBDs in the same room anyway. They were on opposite sides of the pond. And they spoke two different languages, which each of them insisted was called "English".

So I built a meta-database.

The meta-database contains user tables that are the reflection of the system tables in each of the databases. I only selected a subset of the data in the system tables. There were some tables and columns I didn't need. I had to add one column to each table in the meta-database, called DATABASE_NAME. This is completely unnecessary metadata in the original databases, due to what I'll call the "Tigger principle": ("The most wonderful thing is, I'm the only one!") But, without it, we've got the cat food problem all over again.

Then I loaded the meta-database from the respective databases with an INSERT...SELECT construct. The insert was into the user tables of my meta-database, while the select was from the system tables of one of the databases. The SELECT also has one extra item, a literal name of the databases, like this:

    , 'EAST_DB' as DATABASE NAME

This means I needed two separate loader scripts, one for each database, that differed by two trivial details, the connect statement, and the literal database name. Big deal.

This INSERT...SELECT construct is what make me call this approach "skunk works". It violates the "thou shalt not conflate meta-data with data" dictum all over the place. I use skunk works whenever I need to.

Once I had completed the loading, I had a database where I could construct some easy queries that would answer questions like these:

Find all the tables that are in one database but not both. Find all the tables that are in both databases, but have a different number of columns.
Find all the columns that are in only one database, but not both, and are in tables that are in both databases.
Find all the columns that are in both databases, but are based on different domains in the two databases.
Find all the columns that are in both databases, are based on the same domain, but have a different datatype, precision, or scale in the two databases.

I could have done the same for constraints, but it would have been a lot more work.

Among thousands of columns, I found one that was based on a floating point number on one side of the pond, and on an integer on the other side of the pond. And, it turns out, this was the answer to a problem that had plagued the client for months! But that was a byproduct of the effort. The main product was to produce a set of columns where the negotiation could proceed between the two sites as to whether the column should be removed from one site, or added to the other.

Tables that were only at one site were less of an issue, for reasons I can go into later.

All of the above was less than one days' work, from "Eureka!" to presented results. When they asked me how the heck I did it, I said, "I used a computer". The client was not amused.

This topic is tangentially related to a couple of topics started recently by Kenneth. Received on Thu Oct 21 2004 - 17:43:19 CEST

Original text of this message