Marcelo Ochoa

Subscribe to Marcelo Ochoa feed
This place is a notepad for sharing experience, code and other related stuff to Oracle, Java and XML technologies.Marcelo Ochoa
Updated: 23 hours 44 min ago

New release of Lucene Domain Index based on Lucene 2.9.0

Wed, 2009-09-30 06:47
A new binary distribution of Lucene Domain Index ( for Oracle 10g/11g has been released.
Lucene Domain Index is integration of Lucene Project running inside the Oracle JVM and integrated to the SQL layer by adding a new index type.
This new version uses latest Lucene 2.9.0 core libraries and introduces some of the changes on API.
Here complete list of changes:
  • Tested with Oracle 11gR2, 11gR1 and 10.2 databases.
  • DefaultUserDataStore do a SAX parsing to get text nodes and attributes from an XMLType value.
  • A SimpleLRUCache is used to load rowids and his associated Lucene doc id, this reduce memory consumption when querying very big tables. A new parameter has been added, CachedRowIdSize by default 10000 to control the size of the LRU cache.
  • Lucene Domain Index core was updated to use TopFieldCollector and to avoid computation time when lscore() is not used.
  • Two new parameter has been added NormalizeScore which control when to track the Max Score and when querying, both parameters are consequence of new Lucene Collector API and boost the performance when querying.
  • A table alias L$MT is defined for the master table associated to the index to be used in complex queries associating columns from master tables and columns from dependant tables.
Full documentation is at Google Doc.
Download binary version for Oracle 10g and 11g.
One of the biggest changed introduced into Lucene core libraries are in the TopCollector API, it introduces optimizations when you don't need to compute the score, track maximum score and preserve the Lucene document ID in order.
This changes are reflected in Lucene Domain Index through the usage of lscore() ancilliary operator. For example:
create table emails (
emailFrom VARCHAR2(256),
emailTo VARCHAR2(256),
subject VARCHAR2(4000),
emailDate DATE,
bodyText CLOB)
and an index created as:
create index emailbodyText on emails(bodyText) indextype is lucene.LuceneIndex
parameters('Analyzer:org.apache.lucene.analysis.StopAnalyzer;ExtraCols:emailDate "emailDate",subject "subject",emailFrom "emailFrom",emailTo "emailTo"');
-- required to Sort by subject
alter index emailbodyText parameters('FormatCols:subject(NOT_ANALYZED),emailFrom(NOT_ANALYZED),emailTo(NOT_ANALYZED)');
-- do not track max score
alter index emailbodyText parameters('NormalizeScore:false');
this queries reflect when score is computed or not.
SELECT /*+ DOMAIN_INDEX_SORT */ lscore(1) sc,subject
FROM emails where emailfrom like '' and
lcontains(bodytext,'subject:lucene injection',1)>0
order by lscore(1) DESC;
  • Lucene score is computed and do not track maximum score, the result will be returned using relevance order descendant, default Lucene ordering no extra effort.
FROM emails
where lcontains(bodytext,'subject:lucene injection','subject:ASC',1)>0;
  • Lucene score is not computed and obviously the maximum score is not tracked.
Note that if we are querying using the optimizer hint DOMAIN_INDEX_SORT and lcontains(..,'subject:ASC',..) the score value is not relevant the result will be returned using the column subject ascending.
Similar result can be obtained using order by SQL, but to do that the RDBMS first collects all the rows that match to lcontains() operator and then do an order by, the difference in this simple test is an optimizer cost 3 over 2.
The NormalizeScore parameter is by default true, to get back compatibility, applications which assume an score in range 0..1 will run without any change, but if you want to get faster response time by avoiding max score computation you can change it by using alter index parameters DDL command.
To see which really means NormalizeScore parameter look a this example using WikiPedia English dump:
select /*+ FIRST_ROWS DOMAIN_INDEX_SORT */ lscore(1),
extractValue(object_value,'/page/revision/timestamp') "revisionDate"
from pages where lcontains(object_value,'rownum:[1 TO 2] AND sport','revisionDate:ASC',1)>0;
with NormalizeScore:false returns:
1.56775963306427001953125 SCCA 25-FEB-02 PM +00:00
1.79172527790069580078125 Sports utility vehicle 04-APR-02 PM +00:00
with NormalizeScore:true returns:
0.875 SCCA 25-FEB-02 PM +00:00
1 Sports utility vehicle 04-APR-02 PM +00:00
Both results are logically because we are looking for a result order by revisionDate:ASC, but the score computed is different.
Well, I will be at Oracle Open World 09, if anybody want to know more about this kind of internal implementation I will talk at the Oracle Unconference and obviously we can meet at the OTN Lounge, see you there Marcelo.

Unleash Oracle ODCI API - OOW09 Voting Session

Wed, 2009-06-17 07:28
Oracle Open World Voting session is a new way to create the conference session agenda.
I have submited two speaker session, one named "Unleash Oracle ODCI API" that is ready for voting at Oracle Mix comunity.
Oracle Data Cartridge API is provided to implement many powerful functionality such as new Domain Indexes, pipeline tables, and aggregated functions.
The presentation will include an introduction to this API showing many of his features using as example the code of a Lucene Domain Index which is a mix between Java running inside the OJVM and Oracle Object types.
Lucene Domain Index is an open source project which integrates the Apache Lucene IR library as a new Domain Index, providing features has free text searching, faceting, highlighting, filtering at index level, multi table/column indexes and more for 10g/11g databases.
Basically I would like to introduce this exciting API which allows developers to interact directly with the RDBMS engine and adding some examples in Java that are not included into the Oracle Documentation.
Well if you want to see this session at OOW09 please click here, see you there....