Re: What's FASTEST way to IMPORT Massive amounts of TRANSACTIONAL DATA? (Non-SQL*Loader)

Home -> Community -> Usenet -> c.d.o.server -> Re: What's FASTEST way to IMPORT Massive amounts of TRANSACTIONAL DATA? (Non-SQL*Loader)

From: Billy Verreynne <vslabs_at_onwe.co.za>
Date: Wed, 1 Dec 1999 10:34:10 +0200
Message-ID: <822mi1$7iq$1@ctb-nnrp1.saix.net>

[cross postings removed]

Papaya Paradise wrote in message
<821tgu$kpe$1_at_ash.prod.itd.earthlink.net>...

>SQL*Loader is fastest for raw pumping of data straight to tables but we
have
>to do queries as we import to see if an Entity is already in the database.
If
>they are, then update their ENTITY and ENTITY_ATTRIBUTES information in the
>database. If not then insert into the below type of tables.

Interesting problem. No matter how you cut and slice this problem, you need to check the keys of the existing data and new data, to determine what needs to be inserted (new keys) and what needs to be updated (existing keys).

A dirty method that may work, but which I will not recommend is to use SQL*Loader. Up the number of handling errors so that the complete input file will be processed by SQL*Loader. Run the file thru SQL*Loader and load the data. All new keys will be inserted. All errors (i.e. existing keys) will be written away to the badfile. Now update the table with the badfile data. Very dirty method IMO. But worth mentioning anyway.

Based on similar monthly loading problems in a data warehouse environment, one of the things I would try first is the following:

Direct and parallel load the data using SQL*Loader into a temp table. Fastest way to get data into Oracle, especially if large volumes need to be loaded.
Run a create index on the temp table to create index on the key.
Run a CREATE TABLE AS UNRECOVERABLE SELECT, doing a minus between the temp table keys and the source table keys. This will give me a temp table with new keys to be inserted.

OK, now we have identified the new keys. We can either now use this table to join the temp table to identify the to-be-inserted-data or the to-be-updated-data. As this join needs to be made twice (maybe more?), it may make sense to simply split the temp data table now into two data sets, doing these joins up front, using two CREATE TABLE AS UNRECOVERABLE SELECT statements.

4. Run a INSERT INTO TABLE AS SELECT, inserting the new data. Either by using the original temp table joined to the temp keys table, or from the insert data set.

5. Run UPDATE with SELECT using the data from the temp table outer joined to the temp keys table. Or we can simply doing the SELECT from the second (update) data set. Ideas here to increase performance. Increase rollback segment to make the transactions as large as possible. Do the UPDATE in parallel by giving each UPDATE SELECT a range of keys to process.

One of the problems with this method is that you require space for the temp tables and indexes.

Other ideas are to maybe do the UPDATE first, then drop the indexes from the source table and then run INSERT. After you have inserted the new data, re-create the indexes. In fact, I would recommend that you do try this irrespective of method you use as an INSERT can spend up to 99% of it's time scanning huge unique indexes before inserting a row to ensure that it does not create duplicates. You can also create indexes in parallel, which cam make the index re-create option much more viable.

The main thing to do is to identify where the bottlenecks are likely to occur. And then focus on those areas and try to either work around them or optimise them. In my case I had lots of CPU power at my disposal as we were building the warehouse in OPS on 12 nodes. So once I had the data into Oracle, I could make use of massive parallel processes to address these problems. My bottleneck was getting the data down from the mainframe. Your bottlenecks may just be reversed. In some cases it's a good idea to perform as much pre-processing (on another platform even) as possible. Like supplying the source data system with a list of existing keys and have it generate two data sets for you - a insert-key data set and an update-key data set.

Many ways of skinning this cat. :-)

regards,
Billy Received on Wed Dec 01 1999 - 02:34:10 CST