fast operation [message #410726] |
Mon, 29 June 2009 23:50  |
maheshmhs
Messages: 93 Registered: October 2008 Location: Nepal
|
Member |
|
|
Hello all
INSERT INTO zzz_emp_similarity
(
sn,
empgroupname,
companyname,
similarity,
field1
)
SELECT
ROWNUM AS sn,
a.EmpGroupName,
b.companyname,
fn_string_similarity(a.EmpGroupName,b.companyname) AS imilarity,
b.field1
FROM zzz_emp_nomatch a,
hr_company_fields b
WHERE fn_string_similarity(a.EmpGroupName,b.companyname) > 90;
I have this code. fn_string_similarity is a function which i created that measures similarity between two string based on leivensteins distance.
zzz_emp_nomatch has around 500 records. and hr_company_fields
has 13 million records. This query has already taken 4 days and still running. Can there be better approach of writing this query (other than using parallel in hint)so that it takes less time.
Thank you
|
|
|
Re: fast operation [message #410731 is a reply to message #410726] |
Tue, 30 June 2009 00:29   |
 |
Michel Cadot
Messages: 68734 Registered: March 2007 Location: Saint-Maur, France, https...
|
Senior Member Account Moderator |
|
|
The problem is not the query, it is the function that is executed for each row of cartesian product that is 500*13 million times.
Even if the function takes 0.1ms for each row the total is 0.0001*(500*13000000)/86400 > 7.5 days.
Regards
Michel
[Updated on: Tue, 30 June 2009 00:32] Report message to a moderator
|
|
|
|
|
|
Re: fast operation [message #416203 is a reply to message #410726] |
Fri, 31 July 2009 02:47   |
easymatica
Messages: 5 Registered: July 2009
|
Junior Member |
|
|
maheshmhs wrote on Tue, 30 June 2009 07:50 | This query has already taken 4 days and still running. Can there be better approach of writing this query (other than using parallel in hint)so that it takes less time
|
Can there be better approach of organizing the data structure so that it takes less time? Even a fast code on a huge amount of data gives a slow performance
|
|
|
Re: fast operation [message #416887 is a reply to message #416203] |
Wed, 05 August 2009 00:30   |
 |
Kevin Meade
Messages: 2103 Registered: December 1999 Location: Connecticut USA
|
Senior Member |
|
|
most obvious solution is to use some simple hueristics in your where clause to reduce the joins done.
if you can come up with some simple rules that can help you to avoid certain combinations of code then you can save lots of time.
For example maybe based on length of text strings you can skip comparisons that obviously won't match your > 90, without having to do you function call.
Also, some solutions of distance are matrix based. Maybe you can find some combination fo Oracle built in matrix functions that does the math for you and thus avoid calling your home grown plsql function to do the job. I believe some versions of oracle now offer matrix support.
Lastly you may want to think about changing your logic. For example, there is no need to actually compute the distance, only to know that your distance exceeds some value. If your distance calculation occurrs in a loop or steps as most algorithms do, then you can stop the process once it exceeds you limit. Modify your function to accept your limit and short circuit once your limit is reached.
Good luck, Kevin
[Updated on: Wed, 05 August 2009 00:39] Report message to a moderator
|
|
|
|
Re: fast operation [message #416929 is a reply to message #416926] |
Wed, 05 August 2009 03:57   |
cookiemonster
Messages: 13963 Registered: September 2008 Location: Rainy Manchester
|
Senior Member |
|
|
elaiyavel wrote on Wed, 05 August 2009 09:44 | create a function based index for your string comparison utility.
function based index would provide a improved performance.
Thx.
|
Since the inputs for the function come from different tables that's not possible.
|
|
|
Re: fast operation [message #416992 is a reply to message #416929] |
Wed, 05 August 2009 09:40  |
 |
Kevin Meade
Messages: 2103 Registered: December 1999 Location: Connecticut USA
|
Senior Member |
|
|
You really need to find out where the cost is comming from, then fix that area.
1) if the cost is due to context switching from sql to plsql then your solution may be to re-write your code without the function call using only sql.
2) if the cost is accessing the data then your solution will likely be tuning the underlying sql.
3) if the cost is the algorithm of the function then you need a better algorithm.
Construct various tests to figure out where the cost is, then proceed based on what you learn. You will be more or less successful given you talents for detective work. Consider an approach like this:
1) run the sql to get a full cost of the query
2) run the sql without the function call to get the cost of the function call
3) remove data from the table used by the function by truncating them then make you call to understand the cost of the sql.
4) put data back into the table and run the query but call the function using a constant so that is always pulls the same data. This will help you understand the cost of the algorithm.
5) examine your data and see if you can rank it by difficulty level based on your function. low/medium/high (use more groups if you want to fine tune). Then get average cost for each ranking. This will help you understand if the data is exhibiting one of two difficult cases or if all data is expensive to process.
Good luck, Kevin
|
|
|