Re: Query Challenge

From: Aamir Baig <abaig_at_etilize.com>
Date: 27 May 2003 04:10:18 -0700
Message-ID: <18456fd3.0305270310.299d356d_at_posting.google.com>


Thank you Cim for your reply,

The distinct cardinality of EP.KEY would be around 3000. The distinct cardinality of E.EID would be around 1 million. Each EID in E would have values for I want to say approx 100 keys, so expected cardinality of EP.Value would be around 100 million. (Not unique of-course, two EIDs might have the same value for a given key).

This generalized problem has a lot of applications in data searching, e.g If I have a catalog of IT products with every atomic detail of a product and I want to search it, the same scenario would be applicable.

I tried the value,key index too, no difference at all, and it shouldn't theoretically cause the index is formed by concatenating key-value or value-key, in each case, the number of resulting nodes should be the same. Keys are normally numbers (ints) and values are text.

Since the number of keys as you can see are not small, the star-join approach you mentioned I believe won't work too well. Also, the number of keys vary as the data gets updated.

Regards,
Aamir
On Tuesday, May 27, 2003, at 02:43 PM, cimarron_at_taylors.org wrote:

Just some thoughts about your problem,

What is the cardinality of the EP.KEY and EP.VALUE columns?

Does a search on a single VALUE run fast enough? e.g.

  Select E.*
    From Entity E, EntityProperties EP
   Where E.EID = EP.EID

     AND EP.VALUE = value1 
     AND EP.KEY = key1

Is there any correlation between the values in EP.VALUE for different EP.KEYS?

If not, then you may find an index on (value, key) to be more effective than one on (key,value).

Have you considered partitioning your EntityProperties into separate tables, each with a distinct subset of the KEY values, and then joining them?

If you have very few actual values for EP.KEY, you could use Codd's RM/T model and assign one table for each key value. This means a tuple (ID,K,V) in EntityProperties becomes a tuple (ID,V) in the K table. Your query would become

  Select E.*
    From Entity E, KEY1 K1, KEY2 K2, KEY3 K3, KEY4 K4

   Where E.EID = K1.EID AND K1.VALUE = value1 AND
         E.EID = K2.EID AND K2.VALUE = value2 AND
         E.EID = K3.EID AND K3.VALUE = value3 AND
         E.EID = K4.EID AND K4.VALUE = value4

Many optimizers will recognize this as a star-join.

If you try this scheme, I would also recommend you index the KEY tables on (value, eid) if the cardinality of your values is high.

This should run much faster as each of the indices on the KEY tables should be much smaller than the one on the EntityProperties table.

Hope this helps,

Cim

abaig_at_etilize.com (Aamir Baig) wrote in message news:<18456fd3.0305262003.d50122f_at_posting.google.com>...
> Hello All,
> I am trying to build a parametric search engine on a DB that does not
> allow sub-selects (MySQL, I wouldn't want to use sub-selects either
> for performance reasons). The problem I believe can be essentially
> generalized as follows:
>
> Have a table named Entity (PK EntityID) and another table called
> EntityProperties (PK EPID, Key varchar, Value varchar). EP stores
> key/value proeprties of E.
>
> What I would like to be able to do is to return a list of Entities
> that match a given set of key/value pairs. The query in itself is
> simple:
>
> Lets say I have 4 key value pairs (key1, value1) , (key2, value2),
> (key3, value3), (key4, value4) entered for search. The resulting
> search query would look like:
>
> Select E.*
> From
> Entity E, EntityProperties EP
> Where
> E.EID = EP.EID AND
> (
> (EP.KEY = key1 AND EP.VALUE = value1) OR
> (EP.KEY = key2 AND EP.VALUE = value2) OR
> (EP.KEY = key3 AND EP.VALUE = value3) OR
> (EP.KEY = key4 AND EP.VALUE = value4) OR
> )
> GROUP BY E.EID
> Having count(E.EID) >= 4;
>
> The count is 4, as there are 4 matching conditions required.
>
> The problem comes in when I have like 200,000 rows in E and 2 million
> rows in EP, the query above just crawls to like 20 secs average.
>
> So I create a 2 column index on EP (Key, value), and the query does
> become faster, comes down to around 7 secs avg.
>
> But a) This is still not fast enough, I would like it to be under a
> second, and b) I expect my EP to grow upto around 20-30 millions rows,
> so it certainly won't scale too well.
>
> I did think about creating key-value pair keywords, creating a column
> in E to store them and then create a FULL-TEXT index and use a BOOLEAN
> MODE search, however, In the key/value match conditions I use above, I
> might also need comparison operators..i-e something like:
> EP.KEY = key1 and EP.VALUE > value1
> and that the current mysql full-text implementation does not allow.
>
> I was also wondering if hash indexes can be used somehow, But I have
> no idea a) how they work in and b) whether they would be useful.
>
> Another issue that I have not yet mentioned is that the key/value
> pairs can be duplicated . I-e an Entity can have multiple values for a
> key. This makes the count() > = no of conditions check invalid. Now,
> I'm really breaking my brain trying to figure out what's the
> accurate/fastest way to do this.
>
> I would appreciate any advice/guidance in optimizing this query or a
> totally different solution that accomplishes the same result and
> showing me how to do this right.
>
> Thank you,
> Aamir Baig
Received on Tue May 27 2003 - 13:10:18 CEST

Original text of this message