Re: Data conversion between various schemas

From: Joe Novella <jnovella_at_voicenet.com>
Date: Tue, 16 Jul 2002 19:32:40 GMT
Message-ID: <3D3474DD.D83EB468_at_voicenet.com>


Paul Vernon wrote:

> > There are newer data profiling products which not only integrate
> multiple databases into a
> > consolidated normalized model, they do so based on analyzing the data
> below the metadata.
> > Through inferring dependencies and redundancies across different
> sources, the analyst can model
> >the source data or map to a pre-defined fixed target.
>
> Indeed there are tools to help automate the normalisation process on
> individual data sources (my step 2).
> I'm not sure how well they work when attempting to integrate separate
> models together however. This is more a matter of matching entity
> predicates between sources rather than finding hidden functional
> dependences

Agreed. Though at least one of the data profiling tools on the market can infer, what they call, "cross-file dependencies."

Other products do not infer cross-file dependencies. Rather, they compare value sets of individual pairs of attributes and provide overlap percentages. For example, identifying how many (and which) values in cust_id in one file exist in customer_num in another file and vice versa. Of course, it's only one step in the redundancy process. The analyst has to decide whether cust_id and customer_num really should be the same attribute in the model. If so, the analyst creates a "synonym" between the attributes. If not, the analyst moves to the next attribute pair.

Likewise, if there is little overlap between the two attributes, and there really should be a synonym between the two, you should be able to still create the synonym. Of course, you have another problem on your hands. :-)

Finally, other profiling tools can perform orphan analysis on sets of attributes. The purpose of redundancy profiling is to identify potential foreign keys and duplicate columns across files, tables, or systems, using the data itself, not the metadata or code.

Keep in mind, the data profiling tools do not make the normalization decisions. The analyst has to determine which functional dependencies to include in the model and which redundancies are true foreign keys (or denormalized attributes). The normalization decisions still need to be made by a user. The difference is that you're making these decisions based on the underlying data. Once these decisions are made, though, it's very easy for the profiling tool to normalize the sources.

From personal experience, I've developed integrated models incorporating data sources of flat files, RDBMS tables, VSAM files, and IMS segments, using these tools. It still is a lengthy process, but it is manageable, iterative, and repeatable.

Hope this helps.

--
Joseph Novella
Anthem Consulting, LLC
"Good ideas are not adopted automatically.  They must be driven into
practice with courageous patience." - Admiral Hyman Rickover
Received on Tue Jul 16 2002 - 21:32:40 CEST

Original text of this message