Re: SQLServer beats 8i finding top-10 duplicates

From: <mtownsen8888_at_my-deja.com>
Date: Thu, 13 Jan 2000 02:20:26 GMT
Message-ID: <85jcos$1qe$1@nnrp1.deja.com>

In article <Yiyd4.2110$v31.211757_at_ptah.visi.com>, Scott Narveson <psnarv_at_visi.com> wrote:
> Brian Peasland <peasland_at_edcmail.cr.usgs.gov> wrote:
> : To me, this is comparing apples to oranges.

<snip>

>
> Clearly the two particulars of the two queries are very different.
>
> However, I am working on an application which must run against both
> SQLServer and Oracle, and which would happily use a decently-
> performing implementation of the basic query posed in my original
> note. Vendor-specific tricks in each case are perfectly acceptable.
> SQLServer has a one that works nicely. I was hoping that some Oracle
> wizard could suggest Oracle-specific tweaks that could get the basic
> question answered more quickly for Oracle.
>
> So far, no takers... :-)
>
> - Scott
>

I spent the last week mulling this one over in some spare time.

Basically I repeated the exercise on 8.1.5 running on my 266 Mhz, 128 Mg RAM (200 Mg swap) NT laptop, using a single NTFS 10 Gig drive. I also used the default install parameters for Oracle8i. Here's my results

The full tablescan in my scenario took about 1 min 40 seconds
Bumping SORT_AREA_SIZE to a more representative size decreased query time to around 45 seconds (one pass sorts)
Altering the table parallel didn't do much on my hardware, but may on your environment if you have multiple CPUs and some available IO bandwidth. Try it with a degree of 4 and see what happens.
Adding a bitmapped index on num avoided the tablescan entirely and reduced the query to 8 seconds. So addition of an index increased query speed by a factor of 12. YMMV

The last result made me a bit suspicious - the figures that you posted would imply that SQLServer 7 is also managing to avoid the table scan somehow.

Logically, there are only two ways of doing this

There is some sort of index present in the SQLServer test. But you specifically said that no index was present - were you using clustered indexes without being aware of this ? If so, that would explain the difference. If that's the case, try asking for the LEAST duplicated rows, and see what happens. I'd be interested in the results.
I think the former is the most likely difference. However, it is possible that SQLServer is storing statistics on the number of duplicated rows for each data value, and is not actually looking at the data at all. While unlikely, this is a pretty keen idea - but only if your statistics are absolutely up to date, and only if they have been collected for each row. If your statistics are out of date, or have been collected with sampling, then using this approach you will get a result - it will just be incorrect.

Mark Townsend
Senior Product Manager
Oracle Corporation

P.S. I'm posting from from deja because I can no longer find the original post on my newserver.

(My Email id for follow ups is mtownsen_at_us.oracle.com)

Sent via Deja.com http://www.deja.com/
Before you buy. Received on Wed Jan 12 2000 - 20:20:26 CST