Re: Creating Oracle Text Indexes

From: Bobby Z. <vladimir.zakharychev_at_gmail.com>
Date: Thu, 5 Mar 2009 10:29:37 -0800 (PST)
Message-ID: <adff2a9a-8424-4dba-a8cc-43323e82c21e_at_a12g2000yqm.googlegroups.com>



On Mar 5, 2:20 am, Jeremy <jeremy0..._at_gmail.com> wrote:
> Hi this is on 10gR2 - we have a table with around 240,000 documents
> (word, pdf, html).
>
> We have a text index created on it using the following settings
> (generated by executing the procedure):
>
> ---------------------------------------------------------
>
> create or replace procedure wd_rebuild_cv_text_index_blob
>     (p_tablespace   in      varchar2 default 'WD_I')
> is
>
> begin
>
>   begin
>     execute immediate 'begin ctx_ddl.drop_preference
> (''wd_cand_doc_lob_store''); end; ';
>   exception
>     when others then
>       null;
>   end;
>
>   begin
>     execute immediate 'drop index wd_cand_doc_lob_text';
>   exception
>     when others then
>       null;
>   end;
>
>   execute immediate 'begin ctx_ddl.create_preference
> (''wd_cand_doc_lob_store'', ''BASIC_STORAGE''); end; ';
>
>   execute immediate 'begin ctx_ddl.set_attribute
> (''wd_cand_doc_lob_store'', ''I_TABLE_CLAUSE'',
>                         ''tablespace '||p_tablespace||' storage (initial
> 500m next 10m maxextents 200)''); end; ';
>   execute immediate 'begin ctx_ddl.set_attribute
> (''wd_cand_doc_lob_store'', ''K_TABLE_CLAUSE'',
>                         ''tablespace '||p_tablespace||' storage (initial
> 1K)''); end; ';
>   execute immediate 'begin ctx_ddl.set_attribute
> (''wd_cand_doc_lob_store'', ''R_TABLE_CLAUSE'',
>                         ''tablespace '||p_tablespace||' storage (initial
> 1K)''); end;';
>   execute immediate 'begin ctx_ddl.set_attribute
> (''wd_cand_doc_lob_store'', ''N_TABLE_CLAUSE'',
>                         ''tablespace '||p_tablespace||' storage (initial
> 1K)''); end;';
>   execute immediate 'begin ctx_ddl.set_attribute
> (''wd_cand_doc_lob_store'', ''I_INDEX_CLAUSE'',
>                         ''tablespace '||p_tablespace||' storage (initial
> 100m next 10m maxextents 500)''); end; ';
>   execute immediate 'create index wd_cand_doc_lob_text
>                      on wd_cand_doc_lob (doc_blob)
>                      indextype is ctxsys.context
>                      parameters ( ''TRANSACTIONAL SYNC(EVERY "SYSDATE+
> 15/1440") storage wd_cand_doc_lob_store'' )';
> exception
>   when others then
>     dbms_output.put_line('error');
>     dbms_output.put_line(sqlerrm(sqlcode));
> end;
> /
> sho err
> ---------------------------------------------------------
>
> We have run
>
> SELECT AVG(COUNT(*))
> FROM DR$wd_web_pages_imt$I
> GROUP BY TOKEN_TEXT HAVING COUNT(*) > 1;
>
> The documentation says anything beyond 10 would be an indication that
> you should refresh the index. On the database the score is a staggering
> 31.8.
>
> Using ctx_report.index_stats we see:
>
> indexed documents:                                            234,324
> allocated docids:                                             238,845
> $I rows:                                                   16,635,689
>
> total size of $I data:                        371,884,061 (354.66 MB)
>
> $I rows:                                                   16,635,689
> estimated $I rows if optimal:                               1,976,853
> estimated row fragmentation:                                     88 %
>
> garbage docids:                                                 4,521
> estimated garbage size:                           5,842,571 (5.57 MB)
>
> now this does not look good.
>
> Index query performance is extremely poor.
>
> Could anyone suggest some "best practise" for how we might be able to
> set the various preferences to create the indexes with minimal
> fragmentation?
>
> The server on which we are running this is running Linux RHEL4 with 4GB
> RAM.
>
> --
> jeremy

First of all, the stats from SELECT AVG(...) seem to be for a different index - WD_WEB_PAGES_IMT. Not sure what you ran ctx_report.index_stats() against, but if you just built your index, you doubtly would end up with fragmentation and garbage percentage like cited. For example, there should be zero, or around zero, garbage docids for new index. I don't think the issue is with index parameters. At least your index parameters seem fine to me, except the part where you limit number of extents for $I table and its index (what if one day your index will grow beyond that size?) and specify initial extent sizes of 1K (Oracle will allocate more anyway,) but storage does not affect text index fragmentation anyway.

Did you consider running ctx_ddl.optimize_index() at REBUILD or FULL level regularly (especially if your data is volatile and index fragmentation/garbage percentage grows quickly?) Rebuilding the index by dropping and creating it every so often makes it unavailable for the duration of "rebuild", which might not please your users, and consumes a lot of resources (mostly CPU, because Text needs to put every document being indexed through external filter program and then parse and process resulting XML); ctx_ddl.optimize_index (index_name,'REBUILD') is online and is much cheaper.

Regards,

   Vladimir M. Zakharychev
   N-Networks, makers of Dynamic PSP(tm)    http://www.dynamicpsp.com Received on Thu Mar 05 2009 - 12:29:37 CST

Original text of this message