Re: Data Warehouse - extract methods

From: Jusung Yang <JusungYang_at_yahoo.com>
Date: 23 Feb 2003 10:56:20 -0800
Message-ID: <130ba93a.0302231056.2c9a77ce@posting.google.com>

The analysis of your situation is pretty complete and accurate. If you are restricted in so many ways, you will have very few options. Change detection, or "delta" extraction, is indeed the major challenge. If ORACLE CDC is not an option for you now, you may have to design your own CDC - typically with trigger. Data storage is not a concern when you are dealing with incremental extraction. But you will have to see if you can live with the performance impact on the production system.

If you do decide to replicate the source table to either staging area or the warehouse, you can export/import the source tables from production and then use the "on prebuilt table" clause to create your materialized views. From then on you can fast refresh your MVs - provided they are fast refreshable.

Of course, as you also noted, the only benefits of doing the above is that the ETL will run in the staging or DW database and the production system will not take a performance hit during ETL. But you are simply shifting the workload from one server to another, and the solution is not scalable - and it won't be until a "delta" extraction stategy has been implemented.

Jusung Yang

"Dave" <dave_at_spam-killer-dghconsulting.co.uk> wrote in message news:<b3ab95$3ie$1_at_sparta.btinternet.com>...
> Hi,
>
> This has probably been covered before, but it is very well hidden! I'm about
> to build a data warehouse, and am discovering that the schema design is the
> easy bit. One of my biggest challenges has been designing the extract
> processes. I'd appreciate it if anyone would be kind enough to share their
> experiences regarding the "E" bit of ETL. Here's my situation:
>
>
> Prerequisites:
>
> 1. All the data sources are production Oracle8i databases. We don't have the
> option of upgrading these in the near future.
> 2. Anything other than the very cheapest commercial ETL tools are out, I'm
> afraid. Data Junction's ETL fitted our budget (the basic version is very
> cheap), but its features weren't adequate for some of our complex
> transformation rules.
> 3. The data warehouse database will be Oracle9i Release 2.
> 4. Some of the source tables are relatively large (60 million rows).
>
>
> My options (as I see them):
>
> 1. Extract to flat files, ship files, then load with SQL*Loader/external
> tables.
>
> Oracle seems to like this method (they mention it in their data warehousing
> docs). Somehow I don't. Change detection would be a major challenge, as we
> don't have timestamps on all the source tables. Even with timestamps, I'd
> still have to detect deletions. Some of the tables on the source database
> are relatively large, and we'd like to minimise performance hit and extract
> window. I would guess we'd have to implement some kind of proprietary change
> detection system, but I'm wary of the maintenance costs involved with
> proprietary code. On the plus side, the extract doesn't rely on the data
> warehouse database being up, and the load doesn't rely on the source
> databases being up.
>
> 2. Change Data Capture (CDC).
>
> Not really an option, as the source databases are on 8i. I just thought I'd
> mention that I'm aware of this technology in 9i. Even if it was an option
> for us, I'd be slightly concerned about the related performance hit on our
> production databases (our source systems), and have not been able to find
> much information about CDC and performance. Presumably CDC would also
> require significant amounts of storage on the source systems.
>
> 3. Read-only snapshots (materialized views).
>
> Implementing snapshots (materialized views) with fast refreshes would reduce
> the hit on our source systems. However, the initial hit would be large
> (requires an initial full refresh). The snapshot logs would require some
> storage on the source systems, but I wouldn't be too concerned about that if
> we implement regular refreshes. I would foresee us replicating the source
> tables to staging area tables on the data warehouse database. My main
> concern about this option is that it merely moves the change detection
> challenge to the data warehouse database. Change detection is clearly a
> major challenge here.
>
> 4. Proprietary solution over Oracle Net.
>
> This option is just too scary to contemplate. The maintenance cost would be
> large, and the change detection logic could place additional load on the
> source systems. I want to avoid over-engineering the solution, and would
> like to use built-in Oracle features wherever possible.
>
>
> Any help at all would be gratefully received.
>
> Dave.
Received on Sun Feb 23 2003 - 12:56:20 CST