Re: Regular expression help required please

From: Tim X <timx_at_nospam.dev.null>
Date: Wed, 29 Sep 2010 09:35:52 +1000
Message-ID: <8739st4f1j.fsf_at_puma.rapttech.com.au>



Kevin S <searlek_at_googlemail.com> writes:

> Firstly I realise that I typed the wrong regular expression in my
> question which won't help so apologies for that.
>
> Perhaps it would be easier if I say that my aim is to strip out
> everything except alphanumeric characters and whitespaces from strings
> to get over an error we are getting in an Oracle 8 database.
>
> We can reproduce the error as follows where step 4 results in the
> 'ORA-01401: inserted value too large for column' error even though
> it's the right length.
>
> 1 DROP TABLE FIFTYCHAR;
> 2 CREATE TABLE FIFTYCHAR(FIFTY_CHAR_COL VARCHAR2(50));
> 3 SELECT length('• 123456789012345678901234567890123456789012345678')
> FROM dual;
> 4 INSERT INTO FIFTYCHAR(FIFTY_CHAR_COL)
> VALUES('• 123456789012345678901234567890123456789012345678');
> 5 DROP TABLE FIFTYCHAR;
>
> I think I have found an initial solution to my proposed solution.
>
> SELECT regexp_replace('H\el£lo World','[^[:alnum:][:space:]]', '')
> FROM dual;
>
> will strip out the alpha numerics while leaving white spaces.
>
> It's not ideal as it strips out £ signs and all punctuation, however
> it's characters that fall under the punctuation category (Bullet
> points, £ signs etc) which are causing the problem.
>
> Would anyone have any suggestions as to how to only strip out
> troublesome characters?

Just a side note. I suspect you may be solving the wrong problem.

If I understand your issue, the problem is that your getting errors that the data is too long to fit in the varchar2 column, but in the variable has the correct number of characters.

This is almost certainly due to you having mixed character encodings in your data. Most likely, a different database character encoding from your clients. If you look at the docs for some of the sql functions dealing with things like calculating the length of data, you will se that the varchar2 related versions report the number of characters, not the number of bytes. This is why it looks like your string is the right length., but generates the exception you are seeing.

Trying to 'fix' this with regular expressions is unmaintainable as you would need to filter out all multibyte characters, not just the punctuation ones you have noticed. Also note that some regexpression implementations don't handle mixed encodings well without adding additional switches (I'm not sure about Oracle's implementation).

Read up about Oracle's support for internationalisation, look at nvarchr2 et.al. and check what the default encodiing is for both your database and clients. Note also that there are some sql functions that you can pass a string to which will return more informative/accurate information regarding the actaul characters and their 'real' length.

Tim

-- 
tcross (at) rapttech dot com dot au
Received on Tue Sep 28 2010 - 18:35:52 CDT

Original text of this message