Re: Indexing a char column

From: Mladen Gogala <mgogala_at_yahoo.com>
Date: Fri, 13 Mar 2015 18:50:33 -0400
Message-ID: <550369B9.6050402_at_yahoo.com>



Comparing large VARCHAR2 columns is a tricky issue. I would suggest using DBMS_OBFUSCATION_TOOLKIT.MD5 function, which will return different checksums for different input strings. Text indexes may be another viable solution.
Regards,

On 03/13/2015 06:41 PM, Cee Pee wrote:
>
> Thanks to everyone who answered. I have been too busy before to get
> back quickly. Jonathan was spot on when he said it was a web
> application. Sorry about not being open with table names; restrictions
> with the project. Here is the plan with autotrace:
>
> Execution Plan
> ----------------------------------------------------------
> Plan hash value: 2021814760
>
> ------------------------------------------------------------------------------------------------
> | Id | Operation | Name | Rows | Bytes
> | Cost (%CPU)| Time |
> ------------------------------------------------------------------------------------------------
> | 0 | SELECT STATEMENT | | 1 | 206 |
> 4723 (2)| 00:00:57 |
> | 1 | HASH UNIQUE | | 1 | 206 |
> 4723 (2)| 00:00:57 |
> |* 2 | HASH JOIN | | 1 | 206 |
> 4722 (2)| 00:00:57 |
> | 3 | TABLE ACCESS BY INDEX ROWID| XA | 1 | 103
> | 4 (0)| 00:00:01 |
> |* 4 | INDEX RANGE SCAN | I1394550123266 | 1 |
> | 3 (0)| 00:00:01 |
> | 5 | TABLE ACCESS FULL | XA | 888K| 87M|
> 4707 (1)| 00:00:57 |
> ------------------------------------------------------------------------------------------------
>
> Predicate Information (identified by operation id):
> ---------------------------------------------------
>
> 2 - access("XA"="XA")
> 4 - access("X"='1234567891011')
>
>
> The stats are bit old, the table has close to million rows now. The
> testing is paused temporarily due to a different reason. I am unable
> to find the answer for how big the table will grow yet.
>
> There are two indexes currently on the table, a single column index on
> a number column (not in this query) and a two column index on column
> 'X' and a third column. Jonathan's idea of building index on both
> columns in the query to skip the table looks like a brilliant idea to
> me especially given that the query is executed so many times.
>
> The columns all have the same value upto first 10 characters. Here is
> an example of a couple of rows:
>
> siteAlias:G5TdJiXR4fRpJM2yvcdSy9JDZzsRcrjBsp8hBTfxThLYvT04KnkW!1121631113!1415190289822
> siteAlias:RRcQJgXSblWT0MJt9BSC13RMTPy7JrhGpRGr7ZbrlWlFnqBgfcn0!1121631113!1415190290627
> siteAlias:flBxJhXSJmGtcZk68T3ZvnZfMJqBWW970vvQ6qMXRWKnGQlB9v3R!1121631113!1415190290628
>
> Can you please explain why the first 7 characters being same would
> mess up the cardinality estimates. Is the column not indexed fully to
> the whole length.
>
>
>
>
> On Wed, Mar 11, 2015 at 4:01 AM, Jonathan Lewis
> <jonathan_at_jlcomp.demon.co.uk <mailto:jonathan_at_jlcomp.demon.co.uk>> wrote:
>
>
> Given the nature of your query, the first thing to do is check the
> execution plan. If the query can be made efficient by indexing
> then you will need two indexes on the table, one on the X column
> to drive the subquery and one on the XA column to be the target of
> the values identified by the subquery.
>
> As it stands it's possible (if the XA index already exists) that
> the tablescan is the outer select and the optimizer is using the
> subquery as a filter subquery (or maybe semi-join) through an
> index on XA.
>
> If the table is only 1M rows, and you're adding 20 to 40 rows in
> a few seconds, then you seem to be growing the table at the rate
> of around 1M rows per week (ballpark) - so you must also have a
> process that is deleting those rows; I'd guess that this table is
> probably keeping track of web-connections in some way and that
> rapid access to this table is important to the web-users; that
> being the case, and given the "near uniqueness" of the column I
> think I would index it; if you're modified SQL is close to true
> I'd index (x, xa) so that the subquery could drive through an
> index range scan and avoid the table. The one special thing I
> would consider is whether or not to create a reverse key index:
> when I say this, I'm guessing that the table is a fairly fixed
> size with balanced inserts and deletes, and I'm guessing that the
> X column might be a constructed value that starts with a time or
> sequence-based component; if the latter (particularly) is not true
> and the data values arrive in a completely random order then
> reverse key won't be of any benefit; if mu guesses happen to be
> right you'll be minimising the impact of bug in the handling of
> leaf-block splits. (
> https://jonathanlewis.wordpress.com/2009/09/19/index-itls/ )
>
> The fact that the column averages 85 to 90 characters is a little
> undesirable - but not a disaster. On a related note, though, if
> many of the values look similar across the first 6 or 7 characters
> the optimizer could get really messed up with its cardinality
> estimates and produce massive over-estimates of the expected row
> counts.
>
>
> Regards
> Jonathan Lewis
> http://jonathanlewis.wordpress.com
> _at_jloracle
>>
>>
>>
>> On 10.03.2015 17:52, Cee Pee wrote:
>>>
>>> It is a varchar2 column with max length of 150. I checked
>>> and the lengths of strings stored currently vary from 85 to
>>> 90 characters distributed evenly across the values in about
>>> 1M rows. The sql in modified form with table name and col
>>> name changed:
>>>
>>> /SELECT DISTINCT x FROM xa WHERE xa IN (SELECT xa FROM xa
>>> WHERE x IN ('123456789101112') ) /
>>>
>>> table name and col name in the predicates are same; the
>>> subquery selects from the same user.table as the main query.
>>> The same SQL is getting executed several hundreds of times
>>> during the peak hours.
>>>
>>>
>>> On 10.03.2015 12:22, Cee Pee wrote:
>>>
>>> List
>>>
>>> I see a table being hit and queried a few thousand
>>> times over peak hours. The table has more than
>>> million rows and grows while operational. I did not
>>> measure the growth fully, but based on monitoring
>>> for several minutes, it is adding about 20 to 40
>>> rows in few seconds, when i was checking via
>>> sqlplus. I dont see this growth most of the times
>>> though. It is a small table with 6 columns and the
>>> app in a web application. There is also another sql
>>> that is run constantly that accesses the rows based
>>> on one of the char columns which is 150 characters,
>>> but there is no index on the column which is causing
>>> a tablescan. Is this column a bad candidate for
>>> indexing. Any rule of thumb length for char columns
>>> above which adding index is considered moot? v11.2.
>>>
>>> CP.
>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>>
>>> ---
>>> Diese E-Mail wurde von Avast Antivirus-Software auf
>>> Viren geprüft.
>>> http://www.avast.com
>>>
>>> --
>>> http://www.freelists.org/webpage/oracle-l
>>>
>>>
>>>
>>
>>
>> --
>>
>>
>>
>>
>>
>>
>> ------------------------------------------------------------------------
>> Avast logo <http://www.avast.com/>
>>
>> Diese E-Mail wurde von Avast Antivirus-Software auf Viren
>> geprüft.
>> www.avast.com <http://www.avast.com/>
>>
>>
>>
>
>
> --
>
>
>
>
>
>
> ------------------------------------------------------------------------
> Avast logo <http://www.avast.com/>
>
> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
> www.avast.com <http://www.avast.com/>
>
>
>

-- 
Mladen Gogala
Oracle DBA
http://mgogala.freehostia.com


--
http://www.freelists.org/webpage/oracle-l
Received on Fri Mar 13 2015 - 23:50:33 CET

Original text of this message