Re: de-dup process

From: Nigel Thomas <nigel_cl_thomas_at_yahoo.com>
Date: Thu, 14 Dec 2006 11:47:16 -0800 (PST)
Message-ID: <20061214194717.25191.qmail@web58711.mail.re1.yahoo.com>

>>Cannot clean data before loading as data is from many different sources that don't know about each other.: How many is many? If you really have many different load files (or can manufacture them), then you may be able to change the way the problem scales. How do you decide which of the duplicates to use? if you can use the "first come - first served" rule your de-duplicate problem becomes much simpler - duplicate avoidance rather than duplicate removal Is there any easy way you can partition the incoming data so you can be sure that records in files in group A can't interfere with records in files in groups B, C, D, etc? if the files 'overlap' when they come in, you may still be able to split them on a partition key - ie a component of the unique key - before they get loaded Task I: "Reception" First, as data files arrive, deal them out into N non-overlapping file groups, where with luck N is an acceptable degree of (external) parallelism. This task is not internally parallelised get the next file place it as-is into the appropriate load file group for task III OR to a splitter process for task II if necessary Go back to step 1 This is NOT parallel - like the maitre-d in a restaurant, it does one thing quickly, then hands you off: - "here is Annette to take you to your table" OR - "Your table is by the window" Task II: "Your table is ready" This step deals with physically partitioning individual files outside the database as a precursor to task III. You can skip this if you don't need to physically split files find a file that's ready to split (eg it's in a specific directory) deal it the records into a file for each load group load it into temporary table TEMP_TABLE_A (truncate first). Each group has its own temp table... You can parallelize this task if necessary - have as many parallel splitter processes as the cpu and file system can stand, to ensure that there is always a file ready for every loader process in the next task. This is parallel - like the number of waiters in a restaurant; one of them organises you, takes your order etc Task III: "The meal" Within each group, set up a loader process: find a file that's ready to load for this group (eg it's in a specific directory) load it into temporary table TEMP_TABLE_A (truncate first). Each group has its own temp table... merge TEMP_TABLE_A into your master table BASE_TABLE (using the technique Gints Plivna mentioned) mark the file as loaded (eg move it to another directory) Go back to step 1 until files are exhausted (or you are) This is parallel (like the number of tables in a restaurant). Restaurants can add more tables - but only if the kitchen can keep up... otherwise you get mad and leave. Summary You now have some flexibility to tune things: the size of the individual files - to get the best out of the hash join TEMP_TABLE_A to BASE_TABLE. the external degree of parallelism (how many loader group processes, how many splitter processes) the (internal) degree of parallelism within the database (for each group) so that the sum of parallel servers required across all groups matches the capacity of your server (ie DOP * N = function of max_parallel_servers). task II could be inside Oracle (easier to manage?) - subsititute temp tables for temp files throughout or if task II could be pipelined into task III using fifos (means you don't have to double handle the data on disc - but can cause other problems) And BTW, this doesn't necessarily need the Partitioning option... HTH Regards Nigel

--
http://www.freelists.org/webpage/oracle-l

Received on Thu Dec 14 2006 - 13:47:16 CST