Re: sql to delete rows from joining a table to itself

From: Sebastian Kolski <sebastian.kolski_at_gmail.com>
Date: Tue, 30 Jun 2009 21:22:31 +0200
Message-ID: <h2dolq$mcg$1_at_achot.icm.edu.pl>



Shakespeare wrote:
> Mark D Powell schreef:

>> On Jun 29, 8:03 am, Ed Prochak <edproc..._at_gmail.com> wrote:
>>> On Jun 28, 7:51 pm, Jeff Calico <jeffcal..._at_gmail.com> wrote:
>>>
>>>
>>>
>>>
>>>
>>>> Hello everyone
>>>> I am trying to delete rows from a massive data table that are "almost"
>>>> duplicates. I can do a nice query
>>>> to find them when I join the table with itself, but I can't figure out
>>>> how to form a delete statement out
>>>> of this.
>>>> Query:
>>>> SELECT a.*
>>>> FROM Trafficdata a, Trafficdata b
>>>> WHERE a.id = b.id
>>>> AND a.date = b.date
>>>> AND a.flag = 'Q'
>>>> AND b.flag = 'Q'
>>>> AND a.dstop = b.dstop
>>>> AND (a.batch != b.batch OR (a.batch = b.batch AND a.pos_marker <
>>>> b.pos_marker) );
>>>> One idea I tried after searching the web was
>>>> DELETE FROM ( <above query> ); but that gives
>>>> ORA-01752 "Cannot delete from view without exactly one key preserved
>>>> table"
>>>> Any help would be greatly appreciated!
>>>> Jeff C.
>>> Do you have a primary key on this Trafficdata table?
>>> Assuming your results are the rows your really want to delete then a
>>> delete of the form:
>>>
>>> DELETE FROM Trafficdata WHERE pk IN (SELECT a.pk from ...) ;
>>>
>>> I suggest you do this AFTER doing a backup.
>>>
>>> Ed- Hide quoted text -
>>>
>>> - Show quoted text -
>>
>> Assuming you want to keep one copy of each row the following will work
>>
>> delete from table_name
>> where (key_list, rowid) in
>> ( select keys, rowid from table_name
>> minus
>> select keys, min(rowid) from table_name
>> group by keys
>> )
>> /
>>
>> HTH -- Mark D Powell --
> 
> Looks like you are skipping the condition that the record with the 
> largest pos_marker should be preserved, not the lowest rowid.
> Furthermore, if a.batch != b.batch, both records would be deleted in the 
> original query (if it worked).
> 
> And is there no mutating table problem here? Won't this lead to 
> 'snapshot too old' problems with large tables? (I'm not sure about that)
> 
> Shakespeare

I had similar problem. Large table with many rows to delete. My solution was to write a short program which would connect to db, open full table scan, reading all important columns and rowids. While reading i was making in memory list of good rows to keep (actualy it was hash map) and for each row i read i would check if i had row with similar important columns on my list, if no i would add it to list, if yes i would check if it should replace row on list or if it should be deleted. On second list i kept rowids rows to be deleted and every time when its size reached 10k i was sending rowids from it to separate deleter thread which would just loop through that rowids and issue delete from x where rowid = y for each one.
So total cost of that solution was one full table scan plus numer of deletes by rowid.
Since i was commiting every 10k deletes, snapshot too old was not a problem because i was able to restart that job any time, and after each restart there would be less rows in table.
Anyway in my case "cleaning" job duration droped from 120h (it was writen in PL/SQL loop on the beginging) to 5h for table with ~150m rows (25m after cleanup) and the bottleneck is now at redo log writers, how fast they are able to log changes.

Sebastian K. Received on Tue Jun 30 2009 - 14:22:31 CDT

Original text of this message