Re: 1 Billion 11 Byte Words... Need to Check Uniqueness UsingOracle

From: Keith Boulton <kboulton_at_ntlworld.com>
Date: Sun, 10 Feb 2002 13:15:24 -0000
Message-ID: <Ziu98.11187$YA2.2300870@news11-gui.server.ntli.net>

I don't agree.

All you need to do is have N instances of a process generating a sorted sub-set 1/Nth of the data, with a final process doing a merge of the intermediate files.

Its only 11GB of data, so you're unlikely to have a very high number of devices available. Given that each on the N intermediate processes does about the same amount of work, you can just spread the intermediate files over the output files. At each stage, you report an error and stop on discovery of a duplicate.

The advantages include: you don't have to load the data first and then create an index which in itself may double the data processing time; you can tune the sort algorithm to suit your purposes; the underlying read of data from disk and parsing are much simpler.

I'm also not convinced that the Oracle sort algorithm is very efficient.

I would expect to be able to achieve a 10-fold performance advantage by not using Oracle.

Of course, if the objective was to ensure the ongoing uniqueness of an 11 billion element set subject to inserts, updates and deletes, you'd be much better off using oracle.

Bryan W. Taylor <bryan_w_taylor_at_yahoo.com> wrote in message news:11d78c87.0202091151.3256e0f_at_posting.google.com...
> "Keith Boulton" <kboulton_at_ntlworld.com> wrote in message
news:<YH498.7607$YA2.1485257_at_news11-gui.server.ntli.net>...
>
> > > Surely some sort of file based sort program would be a lot cheaper if
you
> > don't.
> > >
> > And very, very much faster!
>
> Done correctly it would be faster, but not by as much as you think.
> You are underestimating the capabilities of oracle's multithreaded IO.
> It would not be a trivial programming task if you want to be
> competitive.
>
> The key is not to have to use disk more than is essential IO during
> the sort. Since you likely have more data than memory, you have to
> store it to disk and eventually read it back. This will be by far the
> slowest operation. IO managment on a multi-disk SMP machine is not
> trivial. Oracle has multithreaded IO built in - you'll have to write
> your own. If your program isn't making multiple disks read and write
> simultaneously, you'll lose.
>
> The method would essentially parallel the method I outlined for oracle
> to use: split into separate files of managable size based on a partial
> ordering hash. Then sort each file in memory and scan it for repeats.
>
> The only advantages you'll have over oracle are 1)oracle will put the
> data into blocks, which makes it expand and creates more IO and 2)
> your in-memory executable code will probably be smaller, thereby
> allowing more memory for the sort and potentially allowing you to use
> fewer pieces. Along these lines, if the data is printable characters,
> compression of the 11-byte words will probably help substantially.
Received on Sun Feb 10 2002 - 07:15:24 CST