Re: Suggestions on how to efficiently search a large table

From: <chrism778_at_gmail.com>
Date: Fri, 23 Nov 2007 05:59:28 -0800 (PST)
Message-ID: <4cc52dc8-6e22-4ed7-950c-3e48fc0d2aff@j20g2000hsi.googlegroups.com>

On Nov 15, 12:38 pm, bdbafh <bdb..._at_gmail.com> wrote:
> On Nov 15, 9:35 am, chrism..._at_gmail.com wrote:
>
>
>
> > On Nov 15, 7:42 am, bdbafh <bdb..._at_gmail.com> wrote:
>
> > > On Nov 14, 8:55 pm, chrism..._at_gmail.com wrote:
>
> > > > I have a billion+ row table that I need to do queries on.
>
> > > > The table looks something like this:
>
> > > > item
> > > > id number
> > > > process1_id number
> > > > process2_id number
> > > > process3_id char(35)
>
> > > > I need to perform multiple processes on each item. When I process an
> > > > item, I then set the corresponding column with the result of the
> > > > process.
>
> > > > I need to do queries like the following to determine which processes I
> > > > haven't completed yet:
>
> > > > select * from item where process1_id is null or process2_id is null
> > > > or process3_id is null
>
> > > > Any recommendations on how to index or query such a table to get good
> > > > performance? The cardinality of all the process*_id columns is very
> > > > high.
>
> > > > Thanks.
>
> > > You've provided no information regarding:
> > > - version of Oracle software
> > 10g
> > > - edition of Oracle software
> > 10.2.0.3
> > > - how many client processes will be hammering the table
> > 1
> > > - how the table is being loaded
>
> > loaded from a comma delimited file exported from a SQL Server
> > database> - hardware, OS, clustering, RAC, server color
>
> > LS41, SE Linux, OCSF, RAC> - desired response time
>
> > < 2 seconds
>
> > Also, I should have added that I will be limiting the query with a
> > rownum < 1000 in the WHERE clause.
>
> Ok.
> The subject of this post is overly constrained.
> Your issue is not just in searching the table for rows of data to be
> processed.
> It also includes processing rows and updating rows that have been
> processed.
> You mention that only one instance of a program will be processing
> rows that need to be processed ... but how much do you want to bet
> that somewhere down the road, additional resources will be devoted to
> processing the rows of interest in order to reduce the overall
> processing time?
> A design that works just fine for a single process will likely hit
> concurrency issues once it begins to be multiuser. Partitioning the
> data is probably the way to go.
>
> Back to your question -"how to efficiently search a large table?"
>
> To borrow from others:
>
> The fastest way to search a table is to not search it.
>
> In other words you want to perform the search against a smallish index
> to find the candidate rows.
> Your problem definition states that the data is loaded with columns of
> interest as null.
> Oracle b*tree indexes do not index null values.
> Oracle has support for function-based indexes that could be used to
> provide an access path to rows of interest.
> Rather than repeat content that is posted elsewhere, search for
> articles on the use of function based indexes in Oracle 10g. you'll
> need to use the same expression in the function-based index as in the
> select statement.
>
> Such indexes will add overhead to the loading operation.
> Depending upon the load type, rate and frequency, you might exhibit
> large amounts of undo being read in order to provide a consistent read
> of the index(es).
> Is the same session going to be performing both the inserts and the
> selects/updates?
> If so then concurrency won't be an issue.
> If not, you might want to time things such that loads and processing
> aren't running at the same time in order to minimize response time.
>
> Follow-up questions:
>
> is the Oracle Partitioning option available to you? (if so, use it)
> how many rows per day are inserted?
> what size batch are they inserted as?
> how frequently are the batches inserted?
> will Oracle sql loader be used to perform the inserts in bulk?
> is bulk (array) processing being used if sql loader is not being used?
> how many blocks can be read by the existing database server per
> second?
>
> This won't be just a matter of keeping a static index in the keep pool
> for fast access, as the (function-based) index will be undergoing
> maintenance with each row inserted, updated or deleted ... which will
> generated undo and redo ... and changed blocks which will be written
> out at checkpoints (and at other times, also).
>
> The part that will be funny is that after you have all of this put
> together and meeting your service level agreement for response
> time ... reporting against the data will change and blow your resource
> usage sky high.
>
> good luck. have fun.
>
> -bdbafh

Thanks all this information. I have actually come up with a pretty simple way to avoid the problem I was having. The ID column is based off a sequence, so there are no gaps in the values of this column. What I'm now going to do is just walk the table using that column. I now always include in my where clause "ID between ? and ?", and just keep track of the values I've already looked at. This query is extremely fast (because ID is indexed) regardless of the extra criteria I add such as "process1_id is null or process2_id is null or process3_id is null". The downside of this of course is that I will walk through chunks of rows that have no hits, but the extreme speed of the query makes this somewhat unimportant. Received on Fri Nov 23 2007 - 07:59:28 CST