Home » SQL & PL/SQL » SQL & PL/SQL » Identifying Similar Set of Rows in a table
Identifying Similar Set of Rows in a table [message #212087] Wed, 03 January 2007 14:10 Go to next message
desigan
Messages: 8
Registered: October 2006
Junior Member
Hi All,

I need to identify set of rows in a table, that are similar in few columns. How could this be done? Does Oracle has a built-in (analytic?) function to do that?

Example
Let us consider an Address table with the following columns - EmpId, StreetNumber, Street and Town.

An employee can have more than one address. I got to identify all the employees for whom the all the addresses match. Let us consider that the following records are there in the table:

Emp1 10 Street1 Town1
Emp1 20 Street2 Town2
Emp1 30 Street3 Town3

Emp2 99 Street2 Town9

Emp3 10 Street1 Town1
Emp3 20 Street2 Town2
Emp3 30 Street3 Town3

Emp4 20 Street2 Town2
Emp4 30 Street3 Town3

From the above rows, we can say that Emp1 and Emp3 are same in their address as all their addresses and the number of addresses match. However, Emp1 and Emp4 are not the same (for them only two addresses match).

So, I got to identify similar set of rows for each Employee that matches completely (both the number of addresses and the column values for the addresses)

Regards,
Desigan
Re: Identifying Similar Set of Rows in a table [message #212180 is a reply to message #212087] Thu, 04 January 2007 03:39 Go to previous messageGo to next message
sap_arul
Messages: 27
Registered: November 2005
Location: BANGALORE
Junior Member
Hi,
Please find the solutions for your problem with an example.
The following Query will give the your required output.


SQL> DESC emp_addd
Name Null? Type
------------------------------- -------- ----
EMPNAME VARCHAR2(10)
ADDRESS VARCHAR2(20)
TOWN VARCHAR2(20)

SQL> SELECT * FROM emp_addd;

EMPNAME ADDRESS TOWN
---------- -------------------- --------------------
Emp1 10 Street1 Town1
Emp1 20 Street2 Town2
Emp1 30 Street3 Town3
Emp2 99 Street2 Town9
Emp3 10 Street1 Town1
Emp3 20 Street2 Town2
Emp3 30 Street3 Town3
Emp4 20 Street2 Town2
Emp4 30 Street3 Town3


SQL> select DISTINCT a.empname,a.address,a.town from emp_addd a,emp_addd b where
2 a.empname<>b.empname and not exists
3 (
4 select address,town from emp_addd x where x.empname=a.empname
5 minus
6 select address,town from emp_addd y where y.empname=b.empname
7 )
8 and not exists(
9 select address,town from emp_addd x where x.empname=b.empname
10 minus
11 select address,town from emp_addd y where y.empname=a.empname
12 );

EMPNAME ADDRESS TOWN
---------- -------------------- --------------------
Emp1 10 Street1 Town1
Emp1 20 Street2 Town2
Emp1 30 Street3 Town3
Emp3 10 Street1 Town1
Emp3 20 Street2 Town2
Emp3 30 Street3 Town3

6 rows selected.

SQL>

regards
Arul.
Re: Identifying Similar Set of Rows in a table [message #212289 is a reply to message #212180] Thu, 04 January 2007 09:19 Go to previous messageGo to next message
desigan
Messages: 8
Registered: October 2006
Junior Member
Thanks for the reply Arul.

I have written some similar query using IN and count(*). But, unfortunately the table I am operating with has 25 million rows and the query gets slow as the number of records increases.

Is there any other Oracle Built in function that can help us to achieve this?

Regards,
Desigan
Re: Identifying Similar Set of Rows in a table [message #212298 is a reply to message #212087] Thu, 04 January 2007 10:33 Go to previous messageGo to next message
srinivnp
Messages: 136
Registered: January 2006
Location: stlouis MO USA
Senior Member

curamgph>desc foo;
Name Null? Type
----------------------------------------- -------- ----------------------------
EMPID VARCHAR2(5)
STREETNUMBER NUMBER(5)
STREET VARCHAR2(10)
TOWN VARCHAR2(10)

curamgph>select * from foo;

EMPID STREETNUMBER STREET TOWN
----- ------------ ---------- ----------
Emp1 10 Street1 Town1
Emp1 20 Street2 Town2
Emp1 30 Street3 Town3
Emp2 99 Street2 Town9
Emp3 10 Street1 Town1
Emp3 20 Street2 Town2
Emp3 30 Street3 Town3
Emp4 20 Street2 Town2
Emp4 30 Street3 Town3



==================

Quote:
select empid,cstr from (
select empid , cstr, count( distinct empid )
over (partition by (cstr) ) as grr
from (
select empid , rn,cstr ,cnt
from foo
model
main grr
partition by (empid )
dimension by ( row_number() over (partition by empid
order by streetnumber,street,town) -1 as rn )
measures (streetnumber,street,town, cast (null as varchar2(4000)) as cstr,
count (distinct streetnumber||street||town )
over (partition by empid ) -1 as cnt )
rules iterate(999999) until (iteration_number=cnt[iteration_number])
(
cstr[iteration_number] order by streetnumber,street,town =
case when iteration_number=0 then to_char(null)
else cstr[cv()-1]||'=' end
|| streetnumber[cv()]||'='|| street[cv()]||'='||town [cv()]
)
) where rn=cnt
) where grr > 1
/



EMPID CSTR
----- ------------------------------
Emp1 10=Street1=Town1=20=Street2=To
wn2=30=Street3=Town3

Emp3 10=Street1=Town1=20=Street2=To
wn2=30=Street3=Town3



If you want the "common" empids to appear in one row , just pivot this output once again on output column CSTR.

You can parallellize this query if you have multiple cpus and PQ is enabled.

Srini




Re: Identifying Similar Set of Rows in a table [message #212346 is a reply to message #212298] Thu, 04 January 2007 19:42 Go to previous messageGo to next message
rleishman
Messages: 3724
Registered: October 2005
Location: Melbourne, Australia
Senior Member
This should be fairly efficient, but there is an infinitesimal chance that it will wrongly return a non-identical combinations. I wouldn't let this worry you though, by rough calculation I reckon the probability is somewhere in the vicinity of 13,611,294,664,161,032,540,679,908,859,307 to 1.

WITH addrhash AS (
  SELECT EmpId
  ,      sum(ora_hash(StreetNumber)) || '#' || 
         sum(ora_hash(Street)) || '#' ||
         sum(ora_hash(Town)) || '#' ||
         sum(ora_hash(StreetNumber||Street||Town)) AS hashval
  FROM address
  GROUP by EmpId
)
SELECT a1.EmpId, a2.EmpId
FROM addrhash a1
JOIN addrhash a2 ON (
    a2.hashval = a1.hashval 
AND a2.EmpId > a1.EmpId
)


Ross Leishman
Re: Identifying Similar Set of Rows in a table [message #212371 is a reply to message #212346] Fri, 05 January 2007 00:13 Go to previous messageGo to next message
Frank
Messages: 7880
Registered: March 2000
Senior Member
rleishman wrote on Fri, 05 January 2007 02:42
by rough calculation I reckon the probability is somewhere in the vicinity of 13,611,294,664,161,032,540,679,908,859,307 to 1.

Hm, watch out for sperm whales and petunias!
Re: Identifying Similar Set of Rows in a table [message #212380 is a reply to message #212087] Fri, 05 January 2007 01:13 Go to previous messageGo to next message
moorthygs
Messages: 3
Registered: January 2007
Location: Chennai
Junior Member
Hi,
Please find the solutions for your problem with an example.
The following Query will give the your required output.


SQL> DESC emp_addd
Name Null? Type
------------------------------- -------- ----
EMPNAME VARCHAR2(10)
ADDRESS VARCHAR2(20)
TOWN VARCHAR2(20)

SQL> SELECT * FROM emp_addd;

EMPNAME ADDRESS TOWN
---------- -------------------- --------------------
Emp1 10 Street1 Town1
Emp1 20 Street2 Town2
Emp1 30 Street3 Town3
Emp2 99 Street2 Town9
Emp3 10 Street1 Town1
Emp3 20 Street2 Town2
Emp3 30 Street3 Town3
Emp4 20 Street2 Town2
Emp4 30 Street3 Town3


SQL> select DISTINCT a.empname,a.address,a.town from emp_addd a,emp_addd b where
2 a.empname<>b.empname and not exists
3 (
4 select address,town from emp_addd x where x.empname=a.empname
5 minus
6 select address,town from emp_addd y where y.empname=b.empname
7 )
8 and not exists(
9 select address,town from emp_addd x where x.empname=b.empname
10 minus
11 select address,town from emp_addd y where y.empname=a.empname
12 );

EMPNAME ADDRESS TOWN
---------- -------------------- --------------------
Emp1 10 Street1 Town1
Emp1 20 Street2 Town2
Emp1 30 Street3 Town3
Emp3 10 Street1 Town1
Emp3 20 Street2 Town2
Emp3 30 Street3 Town3

6 rows selected.
Re: Identifying Similar Set of Rows in a table [message #212386 is a reply to message #212380] Fri, 05 January 2007 01:47 Go to previous messageGo to next message
Frank
Messages: 7880
Registered: March 2000
Senior Member
...which is EXACTLY the same as sap_arul suggested. This solution was turned down as non-performant.
Re: Identifying Similar Set of Rows in a table [message #212433 is a reply to message #212380] Fri, 05 January 2007 05:11 Go to previous messageGo to next message
ebrian
Messages: 2794
Registered: April 2006
Senior Member
Now that's what I call plagiarism at it's finest.
Re: Identifying Similar Set of Rows in a table [message #212558 is a reply to message #212371] Fri, 05 January 2007 17:21 Go to previous messageGo to next message
rleishman
Messages: 3724
Registered: October 2005
Location: Melbourne, Australia
Senior Member
Frank wrote on Fri, 05 January 2007 17:13
rleishman wrote on Fri, 05 January 2007 02:42
by rough calculation I reckon the probability is somewhere in the vicinity of 13,611,294,664,161,032,540,679,908,859,307 to 1.

Hm, watch out for sperm whales and petunias!


Good point. Before you use my SQL, check the data for any flats in Islington.
Re: Identifying Similar Set of Rows in a table [message #212591 is a reply to message #212558] Sat, 06 January 2007 02:07 Go to previous messageGo to next message
Littlefoot
Messages: 20897
Registered: June 2005
Location: Croatia, Europe
Senior Member
Account Moderator
Are you talking about the Chaos Theory? ./fa/1600/0/
Re: Identifying Similar Set of Rows in a table [message #212592 is a reply to message #212591] Sat, 06 January 2007 02:24 Go to previous messageGo to next message
Frank
Messages: 7880
Registered: March 2000
Senior Member
Know your classics!

42. Does that ring a bell?
Re: Identifying Similar Set of Rows in a table [message #212593 is a reply to message #212592] Sat, 06 January 2007 02:29 Go to previous messageGo to next message
Littlefoot
Messages: 20897
Registered: June 2005
Location: Croatia, Europe
Senior Member
Account Moderator
Hitchhikers Guide to the Galaxy?
Re: Identifying Similar Set of Rows in a table [message #212620 is a reply to message #212593] Sat, 06 January 2007 10:12 Go to previous messageGo to next message
Frank
Messages: 7880
Registered: March 2000
Senior Member
./fa/1964/0/
Re: Identifying Similar Set of Rows in a table [message #212941 is a reply to message #212087] Mon, 08 January 2007 21:13 Go to previous messageGo to next message
vadimtro
Messages: 8
Registered: December 2006
Junior Member
Given

table A
set# element
---- -------
1    a
1    b
1    c
2    a
2    b
2    c
3    a
3    b


you would like to find all the sets that have identical elements, right? In this example the expected answer is:

set1 set2
---- ----
1    2


This is relational division type of query. Check up the very last exercise in

http://www.orafaq.com/forum/fa/1915/0/


Alternative solution can leverage oracle collections. Or, as other posts suggested, you can concatenate the elements into a string in the right order and compare strings.





[Updated on: Mon, 08 January 2007 21:15]

Report message to a moderator

Re: Identifying Similar Set of Rows in a table [message #213223 is a reply to message #212941] Tue, 09 January 2007 19:39 Go to previous messageGo to next message
rleishman
Messages: 3724
Registered: October 2005
Location: Melbourne, Australia
Senior Member
I don't think the Relational Division examples in that link help us. If we were looking for all employees with the same rows as EMP1, it would be fine. But we are trying to find employees with the same rows as ANY OTHER employee. This calls for a cartesian join in the suggested Relational Division algorithm, which is not scalable.

Similarly, the object relational features of Oracle to compare sets will also call for a cartesian join (compare the set of details for every employee to every other employee).

Ross Leishman
Re: Identifying Similar Set of Rows in a table [message #213442 is a reply to message #213223] Wed, 10 January 2007 15:50 Go to previous messageGo to next message
vadimtro
Messages: 8
Registered: December 2006
Junior Member
select a1.set#, a2.set# from A a1, A a2
where a1.element = a2.element
group by a1.set#, a2.set#
having count(*) = (select count(*) from A a where a.set#=a1.set#)
and count(*) = (select count(*) from A a where a.set#=a2.set#)

Where is Cartesian product?
Re: Identifying Similar Set of Rows in a table [message #213466 is a reply to message #213442] Wed, 10 January 2007 19:51 Go to previous messageGo to next message
rleishman
Messages: 3724
Registered: October 2005
Location: Melbourne, Australia
Senior Member
Load 25 million rows into your table with frequent element matches and see how it goes.

The cartesian product is not on the full table - just on subsets.

If you have 1000 rows with the same element, then all 1000 of them will join to the other 1000. ie. 1,000,000 results.

Performance using this query would only be acceptable if matching elements were scarce. As soon as they became more common, the result-set would blow out (as described above), and the correlated sub-queries in the HAVING clause would be executed too many times.

Simply, this solution is not scalable, which is what the OP asked for.

Ross Leishman
Re: Identifying Similar Set of Rows in a table [message #213467 is a reply to message #213466] Wed, 10 January 2007 21:16 Go to previous message
vadimtro
Messages: 8
Registered: December 2006
Junior Member
Agreed in principle. There is a lot of research on set containment join query efficiency, e.g.
http://acm.org/sigmod/sigmod03/eproceedings/papers/r06p02.pdf
but I doubt it could match hash-value based comparison.

Note, that your method in spirit is not really that different from collection based methods. You add hash values, while collection based method faithfully gather all the elements. Clearly, you are more efficient when it turns up to comparing scalar values.

BTW, the hash-based method is also in this chapter -- in the "Symmetric Difference" section.
Previous Topic: Comparing records and merging them into one record (merged)
Next Topic: Refresh Path error in Materialized view
Goto Forum:
  


Current Time: Wed Dec 07 10:49:32 CST 2016

Total time taken to generate the page: 0.11764 seconds