Re: improve cardinality estimate

From: Jonathan Lewis <jlewisoracle_at_gmail.com>
Date: Fri, 19 Mar 2021 09:03:32 +0000
Message-ID: <CAGtsp8=Y8tem8fAWaA0wXipNmkwYVRbhR56yLR8ZJjKJY6BTbQ_at_mail.gmail.com>



Imaginfe a few models of the data - stick with date only to make it easier to think:
e.g. a table with 365 rows, the business processing_date ranges from 1st Jan to 31st Dec, the order_processing_date (opd) also ranges from 1st Jan to 31st Dec. In how many cases is opd > bpd ? Scenario 1
For two rows (1st Jan, 31st Dec) they match for every other row opd = bpd + 1
Scenario 2
For two rows (1st Jan, 31st Dec) they match for every other row opd = bpd - 1
Scenario 3
For every row opd = bpd.

So looking at the stats for two separate columns the optimizer has no idea of the relationship.
Now consider the same scenarios and work out opd - bpd

Scenario 1:  2 rows = 0, 363 rows = +1
Scenario 2:  2 rows = 0, 363 rows = -1
Scenario 3:  365 rows = 0

The pattern is now very oobvious to the optimizer.

The same principal applies with timestamps - you get a range of non-integral values, but it's easy for the optimizer to estimate the fraction where it's greater than zero, zero, or less than zero.

Two general points about your partitioning: you ought to have an explicit to_date() formatting option with your '01-01-1000'; and if you really need to represent a null through a faked value you ought to get it as close to the real data as possible to minimise the probability of the optimizer producing very bad estimates for range-based predicates.

It's also worth thinking about the impact on interval partitioning - if you've set this up interval partitioned by day then you MAY already have used up about 365,000 of the available partitions (1000 years * 365 days) which MIGHT show up some unexpected side effects on (e.g.) the dictionary cache or library cache. (It would also be quite interesting to see what Oracle does about dates in the 11 days when the switch from the Julian to the Gregorian calendars took place.)

Regards
Jonathan Lewis

On Fri, 19 Mar 2021 at 07:59, Laurentiu Oprea <laurentiu.oprea06_at_gmail.com> wrote:

> order_procesing_date and business_procesing_date are timestamps
> partition_date is date virtual column as nvl(
> trunc(business_procesing_date), 01-01-1000)
>
> I`m curios why rewriting instead of a>=b to a-b>=0 helped to get closer to
> cardinality estimate
>
> În joi, 18 mar. 2021 la 16:31, Jonathan Lewis <jlewisoracle_at_gmail.com> a
> scris:
>
>>
>> Are all these dates date-only, or is there a time component allowed ?
>> Is this one partition per day.
>>
>> From what you've said, and assuming YES as the answer to the questions
>> above, creating a virtual column on (order_processing_date -
>> business_processing_date) looks like an obvious thing to do. And then
>> create a (frequency) histogram on the virtual column so that the optimizer
>> can see there are (usually?) only 3 distinct values and the one you're
>> interested in is rare.
>>
>> Regards
>> Jonathan Lewis
>>
>>
>> On Thu, 18 Mar 2021 at 13:52, Laurentiu Oprea <
>> laurentiu.oprea06_at_gmail.com> wrote:
>>
>>> Regarding parallel I`m relaying my affirmation of the fact that I tested
>>> with hint /*+ dynamic_sampling(8) */ (the DS level with parallel hint) and
>>> gives the same estimate.
>>>
>>> Yes there is a relation order_procesing_date between
>>> business_procesing_date -1 and business_procesing_date + 1 and there is no
>>> fooling like null values or year 1900/9999
>>>
>>> The pattern is constant , grouping by partition_date for the last 7
>>> days it ranges from 0 to 130k (out of a total of 24M) rows filtered by
>>> condition order_procesing_date > = business_procesing_date .
>>>
>>> Also the columns are not virtual columns (strange expressions)
>>>
>>> În joi, 18 mar. 2021 la 15:14, Jonathan Lewis <jlewisoracle_at_gmail.com>
>>> a scris:
>>>
>>>> The change for parallel processing is one I'd have to think about - it
>>>> may be something to do with the way the predicates are rewritten for the
>>>> parallel query slaves - so I'll postpone thinking about that bit.
>>>>
>>>> The questions to ask first of all are:
>>>> a) what's the business logic. is there an approximate relationship
>>>> between order_processing_date and business_processing date (e.g. the former
>>>> is usually about 3 days before the latter).
>>>> b) has the optimizer been fooled by a "silly null" effect - like "not
>>>> yet processed" = "31-Dec-9999"
>>>> c) how much does the pattern change over time - e.g. if the partition
>>>> date is sysdate - 1 is this a partition where a lot of rows have not yet
>>>> had business_processing_date set, but if partition date = sysdate - 7 then
>>>> maybe lots of rows now have a business_processing_date set.
>>>>
>>>> Regards
>>>> Jonathan Lewis
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 18 Mar 2021 at 12:26, Laurentiu Oprea <
>>>> laurentiu.oprea06_at_gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I recently saw a query with a wrong execution plan and looks to be
>>>>> caused by an overestimate of a particular where clause. Ex:
>>>>>
>>>>> select
>>>>> count(*)
>>>>> from
>>>>> orders
>>>>> where
>>>>> partition_date = sysdate-1 and
>>>>> order_procesing_date > = business_procesing_date
>>>>>
>>>>> the count itself is = 42K
>>>>> the count itself without second condition ~ 24M
>>>>> the cardinality estimate ~ 24M
>>>>> the cardinality estimate if I add parallel hint: 1M (this is an
>>>>> interesting one and looks to be related with the fact that parallel queries
>>>>> get DS level 8)
>>>>> the cardinality estimate if I rewrite the condition :
>>>>> order_procesing_date - business_procesing_date >= interval '0' minute
>>>>> :1.4M
>>>>>
>>>>> I haven't tried yet but I suspect that if I create extended stats on
>>>>> expression ( order_procesing_date - business_procesing_date) can get me
>>>>> better results
>>>>>
>>>>> The question is what is the logic behind beter cardinality estimates
>>>>> if I rewrite the where clause?
>>>>> What will be the best approach for this query?
>>>>>
>>>>> Thank you.
>>>>>
>>>>

--
http://www.freelists.org/webpage/oracle-l
Received on Fri Mar 19 2021 - 10:03:32 CET

Original text of this message