Re: Help with sql query speed. Explain plan included

From: Charles Hooper <hooperc2000_at_yahoo.com>
Date: 16 Mar 2007 04:25:02 -0700
Message-ID: <1174044302.421345.162210@n76g2000hsh.googlegroups.com>

On Mar 15, 5:10 pm, "Valentin Minzatu" <valentinminz..._at_yahoo.com> wrote:
> On Mar 15, 1:53 pm, "Charles Hooper" <hooperc2..._at_yahoo.com> wrote:
> > On Mar 15, 9:14 am, JL19..._at_HOTMAIL.CO.UK wrote:
>
> > > Hello,
>
> > > I have a query about how to improve the speed of our sql and am asking
> > > for advice.
>
> > > My real system is complicated and relates to a call center, automated
> > > options chosen, and time on line using an automated telephone system.
> > > Rather than having to describe all the complicated bits, I have made
> > > up a situation using census data instead because we can all understand
> > > it.
>
> > > I have itemized the points of the made up situation. Here they are.
>
> > > 1. The data is census information.
> > > 2. There are 3 million people in the country.
> > > 3. Each person has to fill out a census form.
> > > 4. There are 80 questions on each census form.
> > > 5. Each question on the census form has an answer that is between 1
> > > and 100.
> > > 6. The country residents do not have to answer every question.
> > > 7. If they do not answer a question, a row in the table is not
> > > created.
>
> > > The information is stored in an Oracle database with the following
> > > structure.
>
> > > residentID questionNumber answer
> > > 1 1 10
> > > 1 2 45
> > > 1 3 67
> > > 1 4 55
> > > 2 1 15
> > > 2 2 56
> > > 2 3 12
>
> > > etc
>
> > > We need a query that that reads "give us the resident ID of any person
> > > that has answered more than 10 questions the same as the resident with
> > > ID number 1486674"
>
> > > Here is our sql.
>
> > > select tx.residentid
> > > from test1 t0,test1 tx
> > > where t0.residentid!=tx.residentid
> > > and t0.questionnumber=tx.questionnumber
> > > and t0.answer=tx.answer
> > > and t0.residentid=1486674
> > > group by tx.residentid
> > > having count(*)>10;
>
> > > The problem with this query is that it only returns 177 rows yet takes
> > > "Elapsed: 00:01:10.75" which is about 70 seconds.
>
> > > The only thing that is important to us is query speed. If the data
> > > needs restructuring to get speed then this is OK. But firstly is there
> > > a better SQL to this without restructuring of data.
>
> > > The plan statistics are calculated on 100% of rows in all tables with
> > > cascade at 2am every morning. They are up to date.
>
> > > Here is my plan. Index idxe has been created with "create unique index
> > > idxe on test1(residentid,questionnumber,answer)". Just in case it
> > > matters, there are no other table indexes but can easily add them if
> > > it improves performance.
>
> > > We use 10.2 on a Solaris V245.
>
> > > Thank you for helping in advance.
> > > John
>
> > > --------------------------------------------------------------------------------------
> > > | Id | Operation | Name | Rows | Bytes |TempSpc| Cost
> > > (%CPU)| Time |
> > > --------------------------------------------------------------------------------------
> > > | 0 | SELECT STATEMENT | | 33322 | 715K| | 26691
> > > (7)| 00:05:21 |
> > > |* 1 | FILTER | | | |
> > > | | |
> > > | 2 | HASH GROUP BY | | 33322 | 715K| 21M| 26691
> > > (7)| 00:05:21 |
> > > |* 3 | HASH JOIN | | 666K| 13M| | 24919
> > > (7)| 00:05:00 |
> > > |* 4 | INDEX RANGE SCAN | IDXE | 22 | 242 | | 3
> > > (0)| 00:00:01 |
> > > |* 5 | TABLE ACCESS FULL| TEST1 | 37M| 398M| | 24428
> > > (6)| 00:04:54 |
>
> > > PLAN_TABLE_OUTPUT
> > > ---------------------------------------------------
> > > Predicate Information (identified by operation id):
> > > ---------------------------------------------------
>
> > > 1 - filter(COUNT(*)>10)
> > > 3 - access("T0"."QUESTIONNUMBER"="TX"."QUESTIONNUMBER" AND
> > > "T0"."ANSWER"="TX"."ANSWER")
> > > filter("T0"."RESIDENTID"<>"TX"."RESIDENTID")
> > > 4 - access("T0"."RESIDENTID"=1486674)
> > > 5 - filter("TX"."RESIDENT"<>1486674)
>
> > > 22 rows selected.
>
> > I agree with much of what David Fitzjarrell stated regarding the
> > source of the problem and how possibly to correct the problem. You
> > are running Oracle 10g R2, which is performing a hash group by, so
> > there is no sort required by the GROUP BY. David mentioned the
> > SORT_AREA_SIZE parameter - while there is no sort indicated by the
> > explain plan, the HASH_AREA_SIZE by default is set to twice the
> > SORT_AREA_SIZE, so the parameter does have some relevance to your
> > case. The index on the RESIDENTID column is preventing two full
> > tablescans, but still allows one to occur, which results in 22 rows
> > being hash joined to 37,000,000 rows (average of 11 bytes per row).
> > The resulting hash join returns 666,000 rows and requires 13MB, and
> > during the GROUP BY operation 21MB is spilled to the temporary
> > tablespace to return 33,322 rows.
>
> > Most of the time is likely spent in returning the 37,000,000 rows
> > (likely from disk) and for performing the multi-pass hash join and
> > hash group by. You can set event 10104 generate a trace file to help
> > determine the effective HASH_AREA_SIZE needed to prevent the hashing
> > operations from spilling to disk. Your primary key index on the table
> > is likely RESIDENTID and QUESTIONNUMBER, and it appears that you
> > created another index on the RESIDENTID column. I believe that
> > someone else suggested a composite index on the QUESTIONNUMBER and
> > ANSWER columns - that appears to be a very good idea to reduce the
> > 37,000,000 rows that are entering the hash join (only 2% of the rows
> > would be passed on to the hash join).
>
> > Charles Hooper
> > PC Support Specialist
> > K&M Machine-Fabricating, Inc.- Hide quoted text -
>
> > - Show quoted text -
>
> It may be that the index OP already has in place (residentID
> +questionNumber+answer ) is the best as table access will be
> avoided ... I would test both options and see which one performs
> better with his data. It may also help to create an MV for
> SELECT questionNumber, answer
> FROM testTable
> WHERE residentId = 1486674

I overlooked that the OP created an index on RESIDENTID, QUESTIONNUMBER, and ANSWER columns. It is true that the index helped avoid a table access of up to 80 rows, but those 100 rows are very small compared to the 37,000,000 rows that are being returned from the TEST1 table. Also true that Oracle _could_ use an index skip scan, if forced to do so, to make use of the IDXE index to avoid retrieving the 37,000,000 rows. As David Fitzjarrell pointed out, the OP is asking Oracle to compare with all 3,000,000 people except one, so that would likely be a very inefficient access method considering that the average row size appears to be about 11 bytes (many rows retrieved per each 8KB block read).

I would suggest comparing the performance with one index on the RESIDENTID column and a second, composite index on the QUESTIONNUMBER and ANSWER columns. I am not sure that I understand the materialized view suggestion - should the OP create 3,000,000 materialized views - one for each possible value of RESIDENTID?

Charles Hooper
PC Support Specialist
K&M Machine-Fabricating, Inc. Received on Fri Mar 16 2007 - 06:25:02 CDT