Re: 1 Billion 11 Byte Words... Need to Check Uniqueness UsingOracle

From: Bryan W. Taylor <bryan_w_taylor_at_yahoo.com>
Date: 10 Feb 2002 20:56:35 -0800
Message-ID: <11d78c87.0202102056.3b7680d@posting.google.com>

"Keith Boulton" <kboulton_at_ntlworld.com> wrote in message news:<Ziu98.11187$YA2.2300870_at_news11-gui.server.ntli.net>...
> I don't agree.
>
> All you need to do is have N instances of a process generating a sorted
> sub-set 1/Nth of the data, with a final process doing a merge of the
> intermediate files.

Actually, this is exactly what oracle does if you ask it to sort more data from a table than can fit in memory.

Whether you sort first and combine with a merge or partially order into N pieces and then sort realy makes little difference. If you choose N so the pieces are just smaller than physical memory, N-1 of the pieces have to be stored to disk. That IO should dominate the total run time.

Again, I think you are seriously neglecting the issues of doing IO throughput optimization on multi-CPU, multi-disk machine. Oracle has been obsessed with exactly that for decades. You haven't given me any indication of a method to make sure you get as close to the sustained transfer rate of your IO system as possible.

> Its only 11GB of data, so you're unlikely to have a very high number of
> devices available. Given that each on the N intermediate processes does
> about the same amount of work, you can just spread the intermediate files
> over the output files. At each stage, you report an error and stop on
> discovery of a duplicate.

I don't understand what you are saying. The size of the datafile has exactly 0% effect on the "number of devices available".

Being able to quit at the first dup found isn't really much of an advantage, because you can't do it if there are no dups. I interpreted the problem to be to find all dups, anyway.

> The advantages include: you don't have to load the data first and then
> create an index which in itself may double the data processing time; you can
> tune the sort algorithm to suit your purposes; the underlying read of data
> from disk and parsing are much simpler.

The process I described did not call for the creation of an index for exactly this reason. The IO associated with making the index persistent is unneeded. It is the sort involved in the index creation that is value added.

> I'm also not convinced that the Oracle sort algorithm is very efficient.

What are you basing this on? Sorting is a core competency for a database application, so I'd be astounded if they don't do this pretty well. I'm assuming that the data in question here is randomly chosen and randomly ordered, so I'd expect them to get O(n log n) just like everybody else. If you really think you will get 10X out of having a better sorting algorithm, you're crazy.

> I would expect to be able to achieve a 10-fold performance advantage by not
> using Oracle.

No way. I'm assuming that we're dealing with an Oracle instance tuned specificially for this task, of course. This is actually pretty similar to the kind of task an enterprise datawarehouse would routinely do during a data load, so Oracle has a vested interest in optimizing this.

> Of course, if the objective was to ensure the ongoing uniqueness of an 11
> billion element set subject to inserts, updates and deletes, you'd be much
> better off using oracle.
Received on Sun Feb 10 2002 - 22:56:35 CST