Oracle PL/SQL String Function Overhead [message #330008] |
Fri, 27 June 2008 05:15 |
harrysmall3
Messages: 109 Registered: April 2008 Location: Massachusetts
|
Senior Member |
|
|
Good Mornings,
I was recently surprised about the processing time I encountered performing string data comparissons on data stored in table arrays retrieved from a bulk collection and am looking to get pointed in the right research direction to decide if a process I own can be better optimized.
Below is the background of the process situation; the topic questions I am looking for direction in study are
(1) Do Oracle built in functions differ in noticable efficiency for string comparisson and editing functions such as
INSTR, SUBSTR, REPLACE, TRANSLATE for the number of machine instructions they execute? How much overhead is in a FOR loop?
For eg, if two 10 character strings were to be compared for the existence of any four characters of one in the other what is the tradeoff between:
perform an outer loop for each 4 substr() chars of var1 and check if = to any 4 substr() chars of var2 in an inner loop VS
one loop where every 4 chars are checked in var2 via INSTR()?
(2) Is there any difference in process efficieny between the CASE statement and an IF/ELSE construct?
**(3) Is it worth considering using Java or C+ as compiled stored procedures for string function efficiency?
-----background:
We take health insurance claims and try to match the data on the claim with the correct practitioner record on our provider database so that the claim can be repriced using contract information on the provider record.
The inherent problem with claim matching is that data can be missing, or incorrect on the claim, and the practitioner can have multiple entries/contracts on the lookup database.
When a claim cannot be automatically matched it falls thru to a secondary process which is the focus of my attention. This process tries to make "fuzzy" matches to retrieve a group of "candidate" possible matches on the provider db and either determine if the provider is out-of-network, if (1) no fuzzy match can produce candidate hits, or (2) for each candidate record the contract is not active.
If we find at least one potential provider record with an active contract we send the claim to a que for manual intervention. I have been working on improving the intelligence to reduce the amount of records requiring manual intervention but we had to back out our last enhancement due to performance degredation.
The stored procedure had a series of cursors based on data combinations to join the claim record to the provider record - different variations of name and location data. The criteria grew to incoporating string functions in the lookup to do a variety of character combination scans for the claim provider names against the provider db names.
My approach was to remove all name parsing functions and other data lookup functions out of the cursors and consolidate them into one query on only direct field matches taking full advantage of all indexing.
That was a godsend. The database IO time was lightening quick.
Without the name portion of join criteria, at worst case, if zip code was not present I would be gathering every provider record in a city and state. The database IO was still phenomenal. Even for NY,NY I could get 30000+ recs in a bulk collection in a few seconds.
Now comes the bottleneck - sweeping my table array to employ the fuzzy logic of the string compares formely employed in the sql was enormous. After much reworking to insure minimal work was employed in the iterations I'm still losing the performance improvement on the outlier scenarios as above. In the case of the 30,000 record result set from NY, it takes about 15 seconds to do an instr() compare of a substr() of every four characters in a string to the value stored in a variable.
My next idea from the DB aspect would be to make a parallel pipelined table function to pipe each row of the cursor collection to the routine doing the string work to gain a little time back.
Any ideas, or tips on direction of research- and sharing of personal experiences in this situation - much appreciated!
Thank you
Harry
[Updated on: Fri, 27 June 2008 05:23] by Moderator Report message to a moderator
|
|
|
|
|
|
Re: Oracle PL/SQL String Function Overhead [message #333487 is a reply to message #333455] |
Fri, 11 July 2008 21:50 |
rleishman
Messages: 3728 Registered: October 2005 Location: Melbourne, Australia
|
Senior Member |
|
|
Another consideration is Domain Indexes
I've never done it, but I understand that you can create an index and matching function that would improve this.
The idea is that by using a Domain Index, it stores all of the possible four-character substrings in a secondary table and indexes them. Then when you pass in your string to match, it splits it into 4 character substrings and performs lookups on the secondary table.
The good bit is that the Domain Index will automatically maintain the secondary table for you.
There's a fair bit of effort in this, you can start learning with the Data Cartridge Guide - http://download.oracle.com/docs/cd/B19306_01/appdev.102/b14289/toc.htm
Ross Leishman
|
|
|
Re: Oracle PL/SQL String Function Overhead [message #333631 is a reply to message #330008] |
Sun, 13 July 2008 16:51 |
coleing
Messages: 213 Registered: February 2008
|
Senior Member |
|
|
Pattern Matching is not Oracle's fort'e im afraid.
We used some software called Trillium Parsing engine for pattern matching - way faster than Oracle.
It was even faster exporting the data to flat files, processing in Trillium to get matching scores for all possible candidates and then loading back up. (millions of rows - talking parsing half a million applications to match against a 60,000,000 person database).
Regular Expressions would be great if Oracle was good at them but Oracle is slower at parsing regular expressions than it is with user defined function doing the same job. Infact, we converted a regular expression in raocle to a user defined function, and it ran 12 times faster. THe regular expression was to find a postcode in a string - so nothing too over the top.
|
|
|
|
|
|
Re: Oracle PL/SQL String Function Overhead [message #340012 is a reply to message #330008] |
Mon, 11 August 2008 00:22 |
harrysmall3
Messages: 109 Registered: April 2008 Location: Massachusetts
|
Senior Member |
|
|
I think I found something promising for a solution.
Well its been almost a month since I ran into this string function overhead issue, which we finally got to acceptable performance by compensating with additional boolean variable fields to track the matching status of each record so we could skip some extraneous checks, which of course has twisted code into a tangled maze of horror
I decided that when I get to the full system rewrite their had to be a better approach all around. Regardless of performance, the extent of procedural logic required to implement fuzzy pattern matching would make maintenance, modification and "learnability" of the process too much an expense to be an acceptable solution.
I looked for another "way" towards parsing besides a procedural algorithm. Here is what I am putting together for proof of concept to develop a protoype for the new process, which once tested I will submit here.
#1 - Circumvent using pl/sql to parse data and make the strings query-able with sql. I hadnt explored the object oriented aspects of Oracle, and had limited experience in this area because i never had a necessity for it. I was quite enthused to find Oracle's OO implementation was through functions and procedures and object reference was virtually transperent to the structure of usage of regular datatypes.
I had coded with oracle objects as required in table function usage but never thought about them in any other light than as containers to pipe back bulk collections. And I never thought about table functions serving any other purpose than processing sql - accepting cursors and returning records.
To make single record and field/variable data available for SQL processing, I reversed that model. Data can be piped back from any object source.
The data fields will be passed to a TYPE object who's constructor functions, with minimal substring execution, split it into parts and stored into an array, and then passed to a table function.
Our "Instring SubString" moving pattern scan which iterated through the substring of string1 for existence in any instr position of string2 could now become a a straight SELECT join.
#2 - Parsing by creating new datatypes to replace procedural character algorithms. The splitting of datainto components could happen in levels where they could be cast into new types that inherit from it. I had ideas like "ALPHABETIC" - only letters, "ALPHANUM" only characters and numbers, "VARIANT" as in VB - chars,numbers,symbols... then the combinations of like "SNUM", "SCHAR", for per se "Number Symbol" with further delineation of number-symbol-mixed, number-symbol-leading, trailing,etc.
This could be priceless for what i have to do but the mystery will once again be about performance.
I think their would be an advantage to the fact that memory would be released as each object can be destroyed when the next record is to be processed?
I'll post my performance analysis when completed.
Harry
|
|
|