Data Warehouse - extract methods

From: Dave <dave_at_spam-killer-dghconsulting.co.uk>
Date: Sun, 23 Feb 2003 11:28:37 +0000 (UTC)
Message-ID: <b3ab95$3ie$1@sparta.btinternet.com>

Hi,

This has probably been covered before, but it is very well hidden! I'm about to build a data warehouse, and am discovering that the schema design is the easy bit. One of my biggest challenges has been designing the extract processes. I'd appreciate it if anyone would be kind enough to share their experiences regarding the "E" bit of ETL. Here's my situation:

Prerequisites:

All the data sources are production Oracle8i databases. We don't have the option of upgrading these in the near future.
Anything other than the very cheapest commercial ETL tools are out, I'm afraid. Data Junction's ETL fitted our budget (the basic version is very cheap), but its features weren't adequate for some of our complex transformation rules.
The data warehouse database will be Oracle9i Release 2.
Some of the source tables are relatively large (60 million rows).

My options (as I see them):

Extract to flat files, ship files, then load with SQL*Loader/external tables.

Oracle seems to like this method (they mention it in their data warehousing docs). Somehow I don't. Change detection would be a major challenge, as we don't have timestamps on all the source tables. Even with timestamps, I'd still have to detect deletions. Some of the tables on the source database are relatively large, and we'd like to minimise performance hit and extract window. I would guess we'd have to implement some kind of proprietary change detection system, but I'm wary of the maintenance costs involved with proprietary code. On the plus side, the extract doesn't rely on the data warehouse database being up, and the load doesn't rely on the source databases being up.

2. Change Data Capture (CDC).

Not really an option, as the source databases are on 8i. I just thought I'd mention that I'm aware of this technology in 9i. Even if it was an option for us, I'd be slightly concerned about the related performance hit on our production databases (our source systems), and have not been able to find much information about CDC and performance. Presumably CDC would also require significant amounts of storage on the source systems.

3. Read-only snapshots (materialized views).

Implementing snapshots (materialized views) with fast refreshes would reduce the hit on our source systems. However, the initial hit would be large (requires an initial full refresh). The snapshot logs would require some storage on the source systems, but I wouldn't be too concerned about that if we implement regular refreshes. I would foresee us replicating the source tables to staging area tables on the data warehouse database. My main concern about this option is that it merely moves the change detection challenge to the data warehouse database. Change detection is clearly a major challenge here.

4. Proprietary solution over Oracle Net.

This option is just too scary to contemplate. The maintenance cost would be large, and the change detection logic could place additional load on the source systems. I want to avoid over-engineering the solution, and would like to use built-in Oracle features wherever possible.

Any help at all would be gratefully received.

Dave. Received on Sun Feb 23 2003 - 05:28:37 CST