Re: Identifying pairs as groups

From: Michel Cadot <micadot{at}altern{dot}org>
Date: Tue, 23 Mar 2004 16:52:00 +0100
Message-ID: <40605d07$0$316$626a14ce@news.free.fr>

"Charlie Edwards" <charlie3101_at_hotmail.com> a écrit dans le message de news:217ac5a8.0403230410.5ca3d15f_at_posting.google.com...
> Daniel Morgan <damorgan_at_x.washington.edu> wrote in message news:<1079973490.55117_at_yasure>...
> >
> > Someone might be able to if you could clearly state the rules: Something
> > you have not done. Examples are fine ... but rules stated in clear,
> > descriptive, declarative sentences, are essential. I have no idea from
> > what you've written what exactly defines a pair or a group. I could
> > guess ... but that is not sufficient.
>
>
> OK, point taken. I thought I was simplifying things, but obviously
> not (I bet you'd guess right, though :-)). I'll start at the
> beginning...
>
> Some legacy data defines processes and the tasks that make up the
> procedures. So a process might be "Log Complaint" and the activities
> within it might be "Answer phone", "Enter customer details", "Log
> details", "Inform supervisor"
>
> So the actual data of the child table looks like this:
>
> PROCESS_CODE TASK_CODE TASK_COUNT
> ------------ --------- ----------
> 4268 231 1
> 4268 244 1
> 4268 257 1
> 4268 268 2
> 4268 272 1
> 4268 276 1
> 4270 231 1
> 4270 244 1
> 4270 248 1
> 4270 256 1
> 4270 272 1
> 4270 276 1
> 4273 231 1
> 4273 244 1
> 4273 257 1
> 4273 268 2
> 4273 272 1
> 4273 276 1
> 4285 231 1
> 4285 256 1
> 4285 269 1
> 4285 270 1
> 4285 272 1
>
> Now over a period of time, "new" proceeses have been created that have
> actually been duplicates of existing one. The defintion of a
> duplicate is if each process in a pair has identical tasks and task
> count. So in the above example process 4268 and 4273 are duplicates.
>
> So in order to identify duplicates (which is what I've been asked to
> do), I wrote the following SQL to identify "pairs" of duplicates:
>
> select x.process_code code1,
> y.process_code code2
> from
> (select b.process_count,
> a.process_code,
> a.task_code,
> a.task_count
> from ag a,
> (select process_code,count(*) process_count
> from ag group by process_code) b
> where a.process_code = b.process_code) x,
> (select b.process_count,
> a.process_code,
> a.task_code,
> a.task_count
> from ag a,
> (select process_code,count(*) process_count
> from ag group by process_code) b
> where a.process_code = b.process_code) y
> where x.process_code <> y.process_code
> and x.process_count = y.process_count
> and x.task_code = y.task_code
> and x.task_count = y.task_count
> group by x.process_code,
> y.process_code
> having count(*) = max(x.process_count)
> /
>
> This identifies each duplicate process and the process(es) that it is
> a duplicate of. So in the above example, I'd get
>
> CODE1 CODE2
> --------- ---------
> 4268 4273
> 4273 4268
>
> However, I found that in many cases there were more than 2 processes
> that had identical tasks. So again, in the above example, if 4300 had
> the same tasks as 4268 and 4273, I'd get the following:
>
> CODE1 CODE2
> --------- ---------
> 4268 4273
> 4268 4300
> 4273 4268
> 4273 4300
> 4300 4268
> 4300 4273
>
> So this brings me to my original problem. I've identified that 4268
> is a duplicate of 4273 and that 4268 is a duplicate of 4300 (and also
> that 4273 is a duplicate of 4300, obviously). But what I want to do
> is say that 4268, 4273 and 4300 are all part of the same group of
> duplicates.
>
> Of course maybe I'm barking up the wrong tree with my idea of
> identifying pairs, but any suggestions would be welcome.
>
> Thanks
>
> CE

I don't a database to check it but something like that might work:

with
myview1 as (

select x.process_code code1,

y.process_code code2
from

   (select b.process_count,
           a.process_code,
           a.task_code,
           a.task_count
    from   ag a,
           (select process_code,count(*) process_count
              from ag group by process_code) b

where a.process_code = b.process_code) x,

    (select b.process_count,
            a.process_code,
            a.task_code,
            a.task_count
     from   ag a,
            (select process_code,count(*) process_count
               from ag group by process_code) b
     where a.process_code = b.process_code) y
    where x.process_code < y.process_code
      and x.process_count = y.process_count
      and x.task_code = y.task_code
      and x.task_count = y.task_count

group by x.process_code, y.process_code having count(*) = max(x.process_count) ),
myview2 as (

select substr(groups,1,

                 decode(instr(groups,' ',2),0,length(groups)+1)-1) father,

lvl, groups||' '||code2 groups
from (

      select level lvl, sys_connect_by_path(code1,' ') groups, code2
      from myview1
      connect by code1 = prior code2
      start with ( select code1 from myview1 a
                   where not exists ( select 1 from myview1 b where b.code2 = a.code1 ) )

)
select a.groups
from myview2 a
where a.lvl = (select max(b.lvl) from myview2 b where b.father=a.father) /

Regards
Michel Cadot Received on Tue Mar 23 2004 - 09:52:00 CST