Re: 1 Billion 11 Byte Words... Need to Check Uniqueness with Oracle

From: Bjørn Augestad <b-augest_at_online.no>
Date: Sat, 09 Feb 2002 09:31:49 GMT
Message-ID: <90698.2305$3c.91948_at_juliett.dax.net>


Hooty and the Blowfish wrote:

>
> I have a group that is currently using Oracle but is considering
> moving to SQL Server to save some money. One of the business cases
> they're working with is the testing of 1 billion 11 character words
> for uniqueness. Apparently they've been sold on the idea that SQL
> Server will rock their monkey.
>
> I tend to disagree and believe that Oracle will handle this task much
> more elegantly.
>
> So that I can give this group some guidance: How would you go about
> testing uniqueness on 1 Billion 11 character words using oracle?
>
> Ideas to get you started:
> 1) Write small clients to do database inserts and distribute them
> across the network on small desktops. Run Oracle with one table, one
> column and one constraint (that it be a primary and unique key).
> Start the inserts and wait for an exception. You could play with the
> number of inserts per commit and number of boxes submitting
> connections to the database.
>
> 2) Install Oracle on 10 boxes and split the 1 Billion words into 10
> segments. Insert the numbers into the database and check for
> uniqueness. This won't prove uniqueness across the entire set so
> you'd then have to bulk insert or import all data from each Oracle
> database into one master database that checks uniqueness. Maybe this
> would be faster than checking uniqueness on every insert.
>
> 3) ??? Any other ideas?
>

If you want ways to check one billion rows for uniqueness, maybe... I did something very similar a couple of months ago.

Using a RDBMS to do this won't work. It will create too much overhead (locking, index updates, transaction logs, disk access, network roundtrips and more). It will probably take forever, which in most cases is too long :-)

How about this approach? I do assume that you use a contemporary operating system, 64-bit CPU and support for files larger than 2GB.

  1. Find a way to convert your string into a 64-bit integer. Should be possible considering what you write in 2). (man atoll)
  2. Convert the file(s) containing your strings into one file of 1.000.000.000 64-bit integers.
  3. Sort the file using the standard C function qsort. You must first map the file into memory using mmap(), then sort it using qsort(). Now you have a sorted array of integers stored in a file.
  4. A new program can memory map the file again, convert the void* returned from mmap() into a "long long *". Just traverse the array (of "long long") using a simple for loop and check for duplicate entries. If a[n] == a[n + 1], then you have a duplicate.

Hope this helps. Contact me if you're interested in some source code. Bjørn Augestad

PS. If you do it this way and qsort() turns out to be too slow, check out the book "Programming Pearls". It has some very nice chapters on optimizing. Received on Sat Feb 09 2002 - 10:31:49 CET

Original text of this message