RE: scalar subquery costing

Date: Tue, 28 May 2019 19:46:15 +0000

Jonathan,

What do you think it would be lambda if the cost follows the Poisson distribution?

As an alternative, I'd like to offer a formula which approximately matches the data from your experiment.

The linearity for NDV <= 10923 suggests that 10923 values returned by the scalar subquery can be cached simultaneously. Therefore, there isn't any extra cost when matching the cached value in this case.

For NDV > 10923 the probability p of finding the value in the cache can be calculated as follows:

p = 10923/NDV

Consequently, the probability of not finding the result in the cache is:

1-p = 1 - 10923/NDV

The total number of cache misses is:

cache_misses = num_rows * (1 - p) = num_rows * ( 1 - 10923 / NDV )

The total cost of the cache misses can be calculated as follows:

cache_misses_cost = cache_misses * cost_of_scalar_subquery_single_execution.

The total cost is approximately:

total_cost ~ cache_misses_cost + cost_of_filling_the_cache

The cost of filling the cache is the cost to fetch 10923 NDVs. Since we don't have the exact value I'll use the cost of 11000 NDVs from your experiment.

cost_of_filling_the_cache ~ cache_misses_cost + cost(11000)

In our case:

num_rows = 100000
cost_of_scalar_subquery_single_execution = 3

=>

total_cost(NDV) ~ 100000 * ( 1 - 10923/NDV ) * 3 + 34945 = 300000 * ( 1 - 10923/NDV ) + 34945

Let's calculate the cost for different values accross the range:

```total_cost(12) ~ 61870 vs. 59769
total_cost(24) ~ 198407 vs. 196303
total_cost(36) ~ 243920 vs. 241814

```

As we can see, this formula calculates the cost with ~ 97% precision.

I can think of the following refinements:

```1. First and foremost, try out "selection with replacement" expected value calculation. I think "with" replacement would be appropriate here as the values come back after have been evicted from cache.
2. Use cost(10923) instead of the approximation cost(11000).
3. Use the value of cost(10923) from the optimizer trace, to increase the precision.

```

Furthermore, I'd need to automate these calculation to verify the hypothesis across different datasets.

I'm, though, still bothered by the fact that the cost remains constant for NDV > num_rows/2 , as I couldn't think of any plausible explanation

What do you think?

Best regards,

-----Original Message-----
From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk> Sent: Dienstag, 28. Mai 2019 17:32

With a couple more tests is looks like

1. There's a magic value of 10,922 rather than 10% where the linear increase in cost stops - this seemed to be fixed for a few tests where the number of rows in table varied from 100,000 to 500,000 - there may be other limits for other ranges of table sizes.
2. From 10,923 to 50% of the rows in the table the change in cost looked roughly Poisson - sudden sharp increase for a small increase in num_distinct then long slow tail.
3. The cost stops growing after num_distinct reaches 50% of num_rows.

Regards
JOnathan Lewis

To: Jonathan Lewis; ORACLE-L
Subject: RE: scalar subquery costing

If we consider cost as a function of the number of distinct values (NDV), we can clearly identify three different subsets of its domain:

1. NDV <= 0.1 * rows_in_table. A purely linear function, probably under the assumption that *all* of the scalar subquery values are being cached.
2. 0.1 * rows_in_table < NDV <= 0.5 * rows_in_table. It's a logarithmically-shaped function with the asymptote ~ f(rows_in_table). The rational is probably: with large NDVs it will be much less likely to find its scalar subquery result in the cache. A pitfall here is that the function seems too steep at the beginning - a small increase in NDV will lead to a huge leap in cost.
3. NDV > 0.5 * rows_in_table. f(NDV) = 267300. A possible assumption in this case: no cache hits at all. Just the value seems a bit odd as the real cost should be by around 10% higher if the results aren't cached.

Best regards,

-----Original Message-----
From: Jonathan Lewis <jonathan_at_jlcomp.demon.co.uk> Sent: Dienstag, 28. Mai 2019 10:22
To: ORACLE-L (oracle-l_at_freelists.org) <oracle-l_at_freelists.org>; Noveljic Nenad <nenad.noveljic_at_vontobel.com> Subject: Re: scalar subquery costing

FYI - a quick test reporting costs for your query as I fake the column stats on t_100K to vary Statement N corresponds to N * 1,000 distinct values, with low_value 1 and high_value N * 1,000. I've picked the cost, cpu_cost, and io_cost for the top line of the execution plan in each case.

Interesting point(a) somewhere between 10 and 12 thousand distinct values, and another when num_distinct = 0.5 * rows in table.

```STATEMENT_ID                      IO_COST    IO_DIFF   CPU_COST   CPU_DIFF       COST
------------------------------ ---------- ---------- ---------- ---------- ----------
1                                   3068             252344831                  3094
2                                   6068       3000  487952031  235607200       6119
3                                   9068       3000  723559231  235607200       9143
4                                  12068       3000  959166431  235607200      12167
5                                  15068       3000 1194773631  235607200      15192
6                                  18068       3000 1430380831  235607200      18216
7                                  21068       3000 1665988031  235607200      21240
8                                  24068       3000 1901595231  235607200      24265
9                                  27068       3000 2137202431  235607200      27289
10                                 30068       3000 2372809631  235607200      30314
11                                 34945       4877 2755835542  383025911      35230
12                                 59769      24824 4705425626 1949590084      60256
13                                 80774      21005 6355078773 1649653147      81432
14                                 98779      18005 7769067185 1413988412      99583
15                                114383      15604 8994523809 1225456624     115314
16                                128036      13653 1.0067E+10 1072274546     129078
17                                140083      12047 1.1013E+10  946124599     141223
18                                150792      10709 1.1854E+10  840999644     152019
19                                160373       9581 1.2606E+10  752473365     161678
20                                168996       8623 1.3284E+10  677226029     170371
21                                176798       7802 1.3896E+10  612728312     178236
22                                183891       7093 1.4453E+10  557025738     185387
23                                190366       6475 1.4962E+10  508588717     191915
24                                196303       5937 1.5428E+10  466206324     197900
25                                201764       5461 1.5857E+10  428909819     203406
26                                206805       5041 1.6253E+10  395916755     208488
27                                211473       4668 1.6620E+10  366589588     213193
28                                215807       4334 1.6960E+10  340404618     217563
29                                219843       4036 1.7277E+10  316928437     221631
30                                223609       3766 1.7573E+10  295799875     225428
31                                227133       3524 1.7849E+10  276716012     228981
32                                230436       3303 1.8109E+10  259421261     232311
33                                233539       3103 1.8353E+10  243698760     235439
34                                236460       2921 1.8582E+10  229363540     238383
35                                239213       2753 1.8798E+10  216257051     241159
36                                241814       2601 1.9002E+10  204242770     243781
37                                244274       2460 1.9196E+10  193202621     246261
38                                246604       2330 1.9379E+10  183034062     248610
39                                248815       2211 1.9552E+10  173647700     250840
40                                250916       2101 1.9717E+10  164965315     252957
41                                252914       1998 1.9874E+10  156918226     254971
42                                254817       1903 2.0024E+10  149445929     256890
43                                256631       1814 2.0166E+10  142494957     258719
44                                258363       1732 2.0302E+10  136017913     260465
45                                260018       1655 2.0432E+10  129972672     262133
46                                261601       1583 2.0556E+10  124321686     263729
47                                263117       1516 2.0675E+10  119031402     265257
48                                264569       1452 2.0790E+10  114071760     266721
49                                265963       1394 2.0899E+10  109415770     268126
50                                267300       1337 2.1004E+10  105039139     269474
51                                267300          0 2.1004E+10          0     269474
52                                267300          0 2.1004E+10          0     269474
...
98                                267300          0 2.1004E+10          0     269474
99                                267300          0 2.1004E+10          0     269474
100                               267300          0 2.1004E+10          0     269474

```

Regards
Jonathan Lewis

From: oracle-l-bounce_at_freelists.org <oracle-l-bounce_at_freelists.org> on behalf of Noveljic Nenad <nenad.noveljic_at_vontobel.com> Sent: 24 May 2019 16:46:30
To: ORACLE-L (oracle-l_at_freelists.org) Subject: scalar subquery costing

The calculated cost for the following query is 262K (on Oracle 12.2):

select /*+ qb_name(QB_MAIN) */
(
select /*+ qb_name(QB_SUBQ) */ count(*)

```              from t_1k
where t_1k.n1 = t_100k.n1
```

)
from t_100k ;

Plan Table
```--------------------------------------+-----------------------------------+
| Id  | Operation           | Name    | Rows  | Bytes | Cost  | Time      |
--------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT    |         |       |       |  262K |           |
| 1   |  SORT AGGREGATE     |         |     1 |     4 |       |           |
| 2   |   TABLE ACCESS FULL | T_1K    |     1 |     4 |     3 |  00:00:01 |
```
| 3 | TABLE ACCESS FULL | T_100K | 98K | 488K | 69 | 00:00:01 |
```--------------------------------------+-----------------------------------+
```
Predicate Information:

2 - filter("T_1K"."N1"=:B1)

From the optimizer trace we can get more precise cardinalities and costs:

Final cost for query block QB_SUBQ (#0) - All Rows Plan:   Best join order: 1
Cost: 3.009811 Degree: 1 Card: 1.000000 Bytes: 4.000000

Final cost for query block QB_MAIN (#0) - All Rows Plan:   Best join order: 1
Cost: 268174.664510 Degree: 1 Card: 100000.000000 Bytes: 500000.000000

Table: T_100K Alias: T_100K
Card: Original: 100000.000000 Rounded: 100000 Computed: 100000.000000 Non Adjusted: 100000.000000 ...

Access Path: TableScan
Cost: 68.697001 Resp: 68.697001 Degree: 0       Cost_io: 68.000000 Cost_cpu: 16737631

In theory, the total cost should be caculated as follows. cardinality(QB_MAIN) * cost(QB_SUBQ) + cost(full table scan T_100K) = 100000 * 3.009811 + 68.69701 = 301049.79701

How the optimizer came up with 268174.664510?

Best regards,

Please consider the environment before printing this e-mail. Bitte denken Sie an die Umwelt, bevor Sie dieses E-Mail drucken.

Important Notice

Important Notice

```--