The "select with/without replacement" was just another guess about the formula. An argument for "select with replacement" being relevant somewhere is that you might need to consider the other end of the subquery where you have 1,000 distinct values, and you want to estimate how many of them might have been selected by the time 100,000 executions of the subquery have executed. (Because that's how many of them could have been cached).

Yes, you're right (as always). The cost indeed depends on the number of distinct values in t_100k.

The cost for two boundary cases are:
1. NDV=100k: 262K
2. NDV=1: 72 It should be "select without replacement", shouldn't it? That's because the t_100k values from the step 3 "are not returned into the urn" once they are matched with t_1k. But I'll do more tests and try to figure out which formula is exactly used.

Would you happen to know why the optimizer calculated 262K for NDV=100k? I mean there shouldn't be any caching there as the values don't repeat.

The tables are created as follows:

create table t_1k ( n1 integer ) ;

create table t_100k ( n1 integer ) ;

insert into t_1k
select level
from dual
connect by level <= 1000;

insert into t_100k
select level
from dual
connect by level <= 100000;

commit ;

begin
dbms_stats.gather_table_stats ( null, 'T_1K') ;   dbms_stats.gather_table_stats ( null, 'T_100K') ; end ;
/

The calculated cost for the following query is 262K (on Oracle 12.2):

select /*+ qb_name(QB_MAIN) */
(
select /*+ qb_name(QB_SUBQ) */ count(*)

```              from t_1k
where t_1k.n1 = t_100k.n1
```

)
from t_100k ;

Plan Table
```--------------------------------------+-----------------------------------+

| Id  | Operation           | Name    | Rows  | Bytes | Cost  | Time      |

--------------------------------------+-----------------------------------+

| 0   | SELECT STATEMENT    |         |       |       |  262K |           |
| 1   |  SORT AGGREGATE     |         |     1 |     4 |       |           |
| 2   |   TABLE ACCESS FULL | T_1K    |     1 |     4 |     3 |  00:00:01 |

```
| 3 | TABLE ACCESS FULL | T_100K | 98K | 488K | 69 | 00:00:01 |
```--------------------------------------+-----------------------------------+
```
Predicate Information:

2 - filter("T_1K"."N1"=:B1)

From the optimizer trace we can get more precise cardinalities and costs:

Final cost for query block QB_SUBQ (#0) - All Rows Plan:   Best join order: 1
Cost: 3.009811 Degree: 1 Card: 1.000000 Bytes: 4.000000

Final cost for query block QB_MAIN (#0) - All Rows Plan:   Best join order: 1
Cost: 268174.664510 Degree: 1 Card: 100000.000000 Bytes: 500000.000000

Table: T_100K Alias: T_100K
Card: Original: 100000.000000 Rounded: 100000 Computed: 100000.000000 Non Adjusted: 100000.000000 ...

Access Path: TableScan
Cost: 68.697001 Resp: 68.697001 Degree: 0       Cost_io: 68.000000 Cost_cpu: 16737631

In theory, the total cost should be caculated as follows. cardinality(QB_MAIN) * cost(QB_SUBQ) + cost(full table scan T_100K) = 100000 * 3.009811 + 68.69701 = 301049.79701

How the optimizer came up with 268174.664510?

Best regards,

Nenad

