Re: Remove dupes with CLOB columns involved

From: Vladimir M. Zakharychev <vladimir.zakharychev_at_gmail.com>
Date: Tue, 23 Jun 2009 09:10:29 -0700 (PDT)
Message-ID: <c327bebc-4816-4e02-b405-e95d351f13a2_at_a36g2000yqc.googlegroups.com>



On Jun 23, 5:40 pm, Mark D Powell <Mark.Pow..._at_eds.com> wrote:
> On Jun 23, 4:37 am, "Vladimir M. Zakharychev"
>
>
>
> <vladimir.zakharyc..._at_gmail.com> wrote:
> > On Jun 23, 11:45 am, "Álvaro G. Vicario"
>
> > <alvaro.NOSPAMTH..._at_demogracia.com> wrote:
> > > John Hurley escribió:
>
> > > > On Jun 22, 11:25 am, "Álvaro G. Vicario"
> > > > <alvaro.NOSPAMTH..._at_demogracia.com> wrote:
> > > >> I've created a GLOBAL TEMPORARY TABLE where one of the columns stores
> > > >> the result of concatenating a variable (and unknown) number of
> > > >> VARCHAR2(100) cells. I didn't want to hit the 4,000 character limit when
> > > >> the source table has more than 40 rows so I created such temp column as
> > > >> CLOB. But when I joined my temporary table with some other regular
> > > >> tables I got an "ORA-00932: inconsistent datatypes: expected - got CLOB"
> > > >> error message. The problem seems to be that it's a "SELECT DISTINCT"
> > > >> query and I've read in docs that it's a known limit. Replacing DISTINCT
> > > >> with GROUP BY doesn't change this.
>
> > > >> Looking at the documentation, I appeared to me that TO_CHAR() could be a
> > > >> solution but it isn't: it fixes ORA-00932 but I get "ORA-22835: Buffer
> > > >> too small for CLOB to CHAR or BLOB to RAW conversion (actual: 4001,
> > > >> maximum: 4000)" when the CLOB contains more than 4000 bytes. Some forums
> > > >> suggest DBMS_LOB.SUBSTR() but it also fails for 4000+ (ORA-06502:
> > > >> PL/SQL: error : buffer de cadenas de caracteres demasiado pequeño
> > > >> numérico o de valor ORA-06512: en línea 1).
>
> > > >> I just need to run a query where some of the columns are CLOB and get
> > > >> rid of duplicated rows. Is there an easy approach I could use?
>
> > > >> It's a PHP/5.2+OCI8 app running on top of Oracle 10g Express Edition in
> > > >> a Windows XP box. Thank you in advance,
> > > > Could you give an overview of what exactly you are trying to
> > > > accomplish?
>
> > > > Perhaps a redesign of the database tables involved might be a better
> > > > idea.  While there's nothing exactly wrong with BLOBs or CLOBs ...
> > > > they can get unwieldy and quite complicated unless planned and tested
> > > > from the beginnings of design phases.
>
> > > It's part of a module that prints custom reports (defined by users).
> > > Among (manu) other things, each reportable object has a list of
> > > attributes. A simplified example:
>
> > > object:
> > >         object_id
> > >         object_name
>
> > > attribute:
> > >         attribute_id
> > >         object_id
> > >         attribute_name
>
> > > I'm emulating MySQL's GROUP_CONCAT() function so I get one row per
> > > object with all the attributes in one cell:
>
> > > object_name, attribute_list
> > > 'object_1', 'attr_1_1; attr_1_2; attr_1_2'
> > > 'object_2', 'attr_2_1; attr_2_2'
>
> > > It happened that the code I'm using [1][2] implies some complex
> > > operations (CONNECT BY and the like) and was crashing Oracle Express
> > > Edition if used as subquery in the main query. So I decided to run these
> > > queries first and store the result in a temporary table. Since the
> > > attribute table can have lots of rows I created attribute_list as CLOB
> > > to prevent truncation errors.
>
> > > So far I'm considering two possibilities:
>
> > > - Get over with it and just truncate large data (after all, 4,000 bytes
> > > of attributes might not be a real life possibility).
> > > - Fetch all rows and remove dupes in the application side (the PHP code
> > > might not be optimal but it's easy to write).
>
> > > [1]http://halisway.blogspot.com/2006/08/oracle-groupconcat-updated-again...
> > > [2]http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2...
>
> > > --
> > > --http://alvaro.es-Álvaro G. Vicario - Burgos, Spain
> > > -- Mi sitio sobre programación web:http://borrame.com
> > > -- Mi web de humor satinado:http://www.demogracia.com
> > > --
>
> > Another way to solve this: just compute a cryptographic hash on
> > attribute list (MD5 or SHA1, whichever you prefer.) Regardless how
> > long your attribute lists are, their hashes are fixed length and are
> > perfectly comparable. Something similar to this might do:
>
> > select distinct dbms_crypto.hash(clob_col, 2 /
> > *dbms_crypto.hash_md5*/)
> >   from my_gtt_with_clob
>
> > Hth,
> >    Vladimir M. Zakharychev
> >    N-Networks, makers of Dynamic PSP(tm)
> >    http://www.dynamicpsp.com
>
> Computing a checksum/hash for comparison purposes is one method that I
> think has merit but you might just want to consider using the
> dbms_lob.compare procedure to compare the lob values for being equal.
>
> HTH -- Mark D Powell --

Indeed, but how would you group by it, which is what the OP wants? Crypto hashes fit perfectly for this particular task (though it's not going to be cheap on I/O and CPU.)

Regards,

   Vladimir M. Zakharychev
   N-Networks, makers of Dynamic PSP(tm)    http://www.dynamicpsp.com Received on Tue Jun 23 2009 - 11:10:29 CDT

Original text of this message