Re: 1 Billion 11 Byte Words... Need to Check Uniqueness with Oracle

From: Ashish Mittal <mittalashish_at_yahoo.com>
Date: Sun, 17 Feb 2002 08:22:25 GMT
Message-ID: <4LJb8.2695$Ot2.241359_at_sccrnsc01>


I think, this question should be asked on groups for SQL Server and Oracle simultaneously, otherwise you are likely to get biased advice. however, one thing you might want to consider that oracle always does a sort group by - where as SQL Server is biased towards multi-level hashes. I have been told the latter is better for large data sets.

Ashish
"heppy" <heppy62_at_yahoo.co.uk> wrote in message news:2ebb4802.0202091647.2ae89252_at_posting.google.com...
> Investigate the total cost of ownership of Oracle & SQL Server over
> the lifespan of your use; include hardware, software, peopleware,
> support contracts, scalability, growth, everything. Even then, if SQL
> Server comes out cheaper buy Oracle :)
>
> I'm presuming you've multiple CPUs and multiple disks as a minimum,
> and you're using 8i or 9i. I'm assuming: a random distribution of
> data values; read-only operations; that the test is purely for
> performance.
>
> I'd use a partitioned table subject to the following reasoning.
> Roughly speaking, if it takes X seconds to check for duplicates in a 1
> million row table it will take twice X seconds to check for duplicates
> in a 2 million row table. Split the table in two partitions, put each
> table partition on different disk subsystems, execute the query in
> parallel on two CPUs, etc., and the time will come down to nearer X
> seconds again, even though you've twice the number of rows. This
> would be equivalent to your option 2) but without having to use
> multiple instances. If you have all your data in one instance you can
> find duplicates in one operation; there's the 'elegance'.
>
> The number of partitions will be a function of the physical disk
> layout and the data values. For example if you've alpha-only data and
> you partition like: part1 has values <'A', part2 has values <'B',
> etc., you'd have 26 partitions, each with 1,000,000,000/26 or approx.
> 38 million rows. If you had 13, 26, etc disk subsystems then you
> could evenly distribute the datafiles across them in a round-robin
> fashion. It's intuitive that if you double the number of disks you
> have, you can double the I/O bandwidth and consequently (parallel)
> queries will execute more quickly, presuming CPU & memory are unbound;
> in other words the more disks you have, the better.
>
> Create each partition with 0% free to ensure maximum density, no
> logging since we're not bothered about recovery and a degree of
> parallelism commensurate with the number of CPUs you have. Create the
> database with a large block size - 16/32/64KB?? - certainly a multiple
> of your platform's disk I/O buffer size. Also perhaps char(11) may be
> more efficient that varchar2(11), if all your values are always 11
> characters long?
>
> The quickest way to get the data into the table would be to use
> SQL*Loader with the direct method. You could parallelise this
> operation by splitting your data into batches that would go into each
> partition, then loading each file concurrently.
>
> I don't know the constraints of index organised tables enough to state
> their applicability here, but if you can partition IOTs then do so.
> Otherwise, create a non-unique index on your 11-char column if you
> have other columns in your table, and analyse for the cost-based
> optimiser. Make the tablespace read-only. Then to determine
> duplicates: select column from partitioned_table group by column
> having count(*) > 1;
>
> In summary, I reckon inserting 1 billion rows into an indexed table
> would be a lot slower than bulk parallel loading them once, creating
> and analysing an index in parallel once, then parallel querying for
> the duplicates.
>
> You'd have to determine optimal values for the size (number of rows)
> in each partition and the degree of parallelism taking into account #
> disks, # CPUs etc. Also size for temporary tablespace, sort area, all
> the usual, etc. With such a large dataset there will be opportunities
> to tune performance, so don't expect to get it right first time.
> Prototype with a non-partitioned table and, say, 500k rows.
>
> If you've 10s of gig of memory, you could put the whole DB on a RAM
> disk and forget disk I/O...
>
> Hooty and the Blowfish <bob_at_bob.com> wrote in message
 news:<v3s76uciii2742r4sgm19gruvtn5bcja9p_at_4ax.com>...
> > I have a group that is currently using Oracle but is considering
> > moving to SQL Server to save some money. One of the business cases
> > they're working with is the testing of 1 billion 11 character words
> > for uniqueness. Apparently they've been sold on the idea that SQL
> > Server will rock their monkey.
> >
> > I tend to disagree and believe that Oracle will handle this task much
> > more elegantly.
> >
> > So that I can give this group some guidance: How would you go about
> > testing uniqueness on 1 Billion 11 character words using oracle?
> >
> > Ideas to get you started:
> > 1) Write small clients to do database inserts and distribute them
> > across the network on small desktops. Run Oracle with one table, one
> > column and one constraint (that it be a primary and unique key).
> > Start the inserts and wait for an exception. You could play with the
> > number of inserts per commit and number of boxes submitting
> > connections to the database.
> >
> > 2) Install Oracle on 10 boxes and split the 1 Billion words into 10
> > segments. Insert the numbers into the database and check for
> > uniqueness. This won't prove uniqueness across the entire set so
> > you'd then have to bulk insert or import all data from each Oracle
> > database into one master database that checks uniqueness. Maybe this
> > would be faster than checking uniqueness on every insert.
> >
> > 3) ??? Any other ideas?
Received on Sun Feb 17 2002 - 09:22:25 CET

Original text of this message