[Advanced] The optimizer could not to it better ?

From: Hugo <hugo_at_nospam.invalid>
Date: Fri, 10 Apr 2009 10:51:38 +0200
Message-ID: <49df0bec$0$5079$426a34cc_at_news.free.fr>



Hi there,

This post may be clumsy, it's not so easy to be concise and precise at the same time, but I really need an advice here.

I crosspost with comp.databases.theory since I'm not sure the problem is related to MySQL only.

*Summary*

I have a remark regarding the "ref" join type and the "in" operator. In my opinion it appears to have some important limitations. But since a similar behavior occurs with other RDMS such as Oracle, I guess there is nothing that could be done to improve that scenario ?

*Presentation*

Server version: 5.1.32-community-log MySQL Community Server (GPL) Engine: InnoDB

I have two tables, entity and positionedelement:

> desc entity;

 Field Type Null Key Default Extra

  • ------------ ------- ------ ---------- -------- uidEntity int(11) NO PRI (null) gid bigint(20) NO UNI (null) entityName varchar(255) NO (null)

> show /*formatted for 80 col*/ create index from entity;

Non_unique Key_name Seq_in_index Column_name Cardinality ----------- ------------ ------------- ------------- ------------

0            PRIMARY       1              uidEntity     3199989
1            fullIdx       1              gid           3199989
1            fullIdx       2              uidEntity     3199989

> desc positionedelement;

 Field                Type        Null     Key     Default     Extra
 -------------------  ----------  -------  ------  ----------  --------
 idPositionedElement  int(11)     NO       PRI     (null)
 startPosition        int(11)     NO               (null)
 endPosition          int(11)     NO               (null)
 uidEntity            int(11)     NO       MUL     (null)
 idKnowledgeSet       int(11)     NO       MUL     (null)

> show /*formatted for 80 col*/ create index from positionedelement;

Non_unique Key_name Seq_in_index Column_name Cardinality

----------  ----------  -------------  -------------------  -----------
0           PRIMARY     1              idPositionedElement  19014004
1           pEntityIdx  1              uidEntity            6338001
1           pEntityIdx  2              idKnowledgeSet       19014004
1           ksIdx       1              idKnowledgeSet       1358143

The idea is that positionedelement references an entity (using uidEntity).

Some hints that may help understand further:

  1. The "join repartition" is very disparate, ie some entities are not referenced by any positionedelement, while some other may have 200.000 references.
  2. The mean expected matching value for a random value of uidEntity is around 2 or 3:

mysql> select count(*) from entity\G

*************************** 1. row ***************************
count(*): 3264185

3) For some strange reason, there is no actual FK constraint between positionedelement.uidEntity and entity.uidEntity but I'm not sure it have any influence here.

Ok now some queries:

> explain select count(0) from entity where uidEntity = 62\G

  • 1. row *************************** id: 1 select_type: SIMPLE table: entity type: const possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: const rows: 1 Extra: Using index

> explain select count(0) from entity where uidEntity in (62,64)\G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: entity
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 2
        Extra: Using where; Using index

Ok, uidEntity is the PK, no problem here, there will be 2 rows returned.

> explain select count(0) from positionedelement where uidEntity=(62)\G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: positionedelement
         type: ref
possible_keys: pEntityIdx
          key: pEntityIdx
      key_len: 4
          ref: const
         rows: 24840
        Extra: Using index

> select count(0) from positionedelement where uidEntity = (62)\G

*************************** 1. row ***************************
count(0): 98431

> explain select count(0) from positionedelement where uidEntity=(63)\G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: positionedelement
         type: ref
possible_keys: pEntityIdx
          key: pEntityIdx
      key_len: 4
          ref: const
         rows: 1
        Extra: Using index

> select count(0) from positionedelement where uidEntity = (63)\G

*************************** 1. row ***************************
count(0): 0

> explain select count(0) from positionedelement where uidEntity=(64)\G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: positionedelement
         type: ref
possible_keys: pEntityIdx
          key: pEntityIdx
      key_len: 4
          ref: const
         rows: 13364
        Extra: Using index

> select count(0) from positionedelement where uidEntity = (64)\G

*************************** 1. row ***************************
count(0): 14909

> explain select count(0) from positionedeleent where uidEntity in (62,64)\G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: positionedelement
         type: range
possible_keys: pEntityIdx
          key: pEntityIdx
      key_len: 4
          ref: NULL
         rows: 38204
        Extra: Using where; Using index

So everything is correct up to now. The optimizer is able to get a good evaluation for different value of uidEntity, and is even able to do it with an in operator, which is really great !

*The problem*

The previous query is a real use case for me, but I also need to get some extra informations stored in the entity table (like the entityName column). So the idea is just to make a straight inner join and let it be:

> explain select count(0) from positionedelement pe inner join entity e
on e.uidEntity = pe.uidEntity where pe.uidEntity in (62,64)\G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: e
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 2
        Extra: Using where; Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: pe
         type: ref
possible_keys: pEntityIdx
          key: pEntityIdx
      key_len: 4
          ref: basf.e.uidEntity
         rows: 3
        Extra: Using index

> select count(0) from positionedelement pe inner join entity e on
e.uidEntity = pe.uidEntity where pe.uidEntity in (62,64)\G

*************************** 1. row ***************************
count(0): 113340

Geez !

The execution plan stay the same *BUT* the cost evaluation is totally different, and in some extend, completely wrong. The optimizer takes the average matching rows (3 here), instead of the more precise evaluation is able to do when there is no join as seen previously (38204)

And this cause lot's of havoc in my application where this query is a sub part of a more complex one, that will be poorly executed because of the somehow wrong evaluation.

Note that it works perfectly with the "=" operator like here:

> explain select count(0) from positionedelement pe inner join entity e
on e.uidEntity = pe.uidEntity where pe.uidEntity = (62)\G

  • 1. row *************************** id: 1 select_type: SIMPLE table: e type: const possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: const rows: 1 Extra: Using index
  • 2. row *************************** id: 1 select_type: SIMPLE table: pe type: ref possible_keys: pEntityIdx key: pEntityIdx key_len: 4 ref: const rows: 24840 Extra: Using index 2 rows in set (0.05 sec)

I may understand that using the "in" operator is somehow more challenging for the optimizer than a simple "=". But since it's able to get it right without the join, it really is disappointing that it can't handle a simple ref join, don't you think ?

Is it a structural consequence of the way the engine is built that can't be overcome, or do you rather consider that as a temporary limitation that may disappear in future release of MySQL ?

Am I missing something that may help me around this problem ? For the moment I force the execution plan using an hint, which is risky if the actual data repartition completely change and invalidate this forced plan.

The only workaround I could think of, is to iterate on the values in the in, and for each of them using the "=" operator. This is not very satisfying, since I may sometimes have 40 to 200 values...

Thanks for any help in understanding this problem.

(fu2)

-- 
Hugo
Received on Fri Apr 10 2009 - 10:51:38 CEST

Original text of this message