OraFAQ Forum: SQL & PL/SQL » Regex Delete Repeating Words in a String

Home » SQL & PL/SQL » SQL & PL/SQL » Regex Delete Repeating Words in a String (Oracle SQL Developer v2.1.1.64)

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

Regex Delete Repeating Words in a String [message #502310]

Tue, 05 April 2011 14:20

supercox
Messages: 4
Registered: April 2011
Location: NOVA

Junior Member

I am trying to delete duplicate or repeating words in a string using regular expressions (regex) and an Oracle database.

I have Googled quite a bit on this topic and was unable to track down something that fit the bill or remotely came close to paying the tip.

Any assistance would be helpful.

A pattern would be: word word start with new

The final string should look like this: word start with new

Basically all the strings begin with the two words repeating.

Thank you
Roger

[Updated on: Tue, 05 April 2011 14:27]

Report message to a moderator

Re: Regex Delete Repeating Words in a String [message #502312 is a reply to message #502310]

Tue, 05 April 2011 14:26

BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal

Senior Member

It would be helpful if you provided DDL (CREATE TABLE ...) for tables involved.
It would be helpful if you provided DML (INSERT INTO ...) for test data.
It would be helpful if you followed Posting Guidelines - http://www.orafaq.com/forum/t/88153/0/
It would be helpful if you provided expected/desired results & a detailed explanation how & why the test data gets transformed or organized.

Report message to a moderator

Re: Regex Delete Repeating Words in a String [message #502313 is a reply to message #502312]

Tue, 05 April 2011 14:48

supercox
Messages: 4
Registered: April 2011
Location: NOVA

Junior Member

I am trying to use the String Comparison to compare a filename and document title to determine their match percentage.

The specific field is a file and path name to the specific files.
The folder name, which is always repeated as the first portion of the file name, is separated by a forward slash ( / ), which I have removed prior to getting to this point.

Here is an example filename and document title:
Path/Filename: word/word_v901_test_certification_and_evaluation_memo.pdf
Title: Word v9.0.1 Test Certification and Evaluation Memo

I've already removed the slash (/), underscores (_), and file extensions (.pdf) so that I could compare the filename with the specific document name.
Additionally, I am using a LOWER command to ensure they are evaluated as lowercase characters, as well as removed all the periods.
The repeating words are the last piece of the puzzle to remove before I can use the String Comparison solution (#msg_502307) to evaluate their percentage of similarity.

If I wanted to replace the specific repeating words listed above, I could do this:
UPDATE admin.testpercentcompare SET document_title = REPLACE (document_title, 'word word ', 'word ') WHERE document_title LIKE '%word word %';

The repeating words are actually different on many of the 3500 rows, which is why I am seeking a Regex solution to the problem.

Report message to a moderator

Re: Regex Delete Repeating Words in a String [message #502314 is a reply to message #502313]

Tue, 05 April 2011 14:56

BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal

Senior Member

blah, blah, blah, blah

It would be helpful if you followed Posting Guidelines - http://www.orafaq.com/forum/t/88153/0/
It would be helpful if you followed Posting Guidelines - http://www.orafaq.com/forum/t/88153/0/
It would be helpful if you followed Posting Guidelines - http://www.orafaq.com/forum/t/88153/0/
It would be helpful if you followed Posting Guidelines - http://www.orafaq.com/forum/t/88153/0/

Report message to a moderator

Re: Regex Delete Repeating Words in a String [message #502328 is a reply to message #502314]

Tue, 05 April 2011 23:08

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

Since it is always the first word that is repeated, you could just use substr and instr to remove the first word, as demonstrated below.

SCOTT@orcl_11gR2> SELECT percentage_of_matching_words
  2  	      (SUBSTR (file_name, INSTR (file_name, ' ') + 1),
  3  	       title)
  4  FROM   (SELECT 'word word v901 test certification and evaluation memo' file_name,
  5  		    'word v901 test certification and evaluation memo' title
  6  	     FROM   DUAL)
  7  /

PERCENTAGE_OF_MATCHING_WORDS(SUBSTR(FILE_NAME,INSTR(FILE_NAME,'')+1),TITLE)
---------------------------------------------------------------------------
                                                                        100

1 row selected.

SCOTT@orcl_11gR2>

Report message to a moderator

Re: Regex Delete Repeating Words in a String [message #502331 is a reply to message #502328]

Tue, 05 April 2011 23:31

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

Alternatively, using regexp_replace:

SCOTT@orcl_11gR2> SELECT percentage_of_matching_words
  2  	      (REGEXP_REPLACE (file_name, '^[a-z0-9]+ ', '', 1, 1),
  3  	       title)
  4  FROM   (SELECT 'word word v901 test certification and evaluation memo' file_name,
  5  		    'word v901 test certification and evaluation memo' title
  6  	     FROM   DUAL)
  7  /

PERCENTAGE_OF_MATCHING_WORDS(REGEXP_REPLACE(FILE_NAME,'^[A-Z0-9]+','',1,1),TITLE
--------------------------------------------------------------------------------
                                                                             100

1 row selected.

SCOTT@orcl_11gR2>

Report message to a moderator

Re: Regex Delete Repeating Words in a String [message #502479 is a reply to message #502331]

Wed, 06 April 2011 10:37

supercox
Messages: 4
Registered: April 2011
Location: NOVA

Junior Member

Thank you Barbara! It worked like a charm.
The piece that I needed was: '^[a-z0-9]+ ', '', 1, 1)

Here is the code that I used to remove the duplicates from the table:

UPDATE admin.testpercentcompare SET document_title = REGEXP_REPLACE (document_title, '^[a-z0-9]+ ', '', 1, 1);

When I finally removed to duplicates, I was able to see some of the filenames and document titles achieve 100% comparison!
I appreciate your help in this!

Report message to a moderator

Re: Regex Delete Repeating Words in a String [message #502491 is a reply to message #502479]

Wed, 06 April 2011 13:13

Barbara Boehmer
Messages: 9106
Registered: November 2002
Location: California, USA

Senior Member

I don't know what other options you may have considered or how big your data set is or whether it is static or changes frequently, but some other options that you might be interested in are the functions in the utl_match package and Oracle Text. If there may be slight differences in spelling of the words or portions of the words, then these will find those and take them into account when scoring. If you have a large data set, Oracle Text uses an index, so it can be a lot faster and has a lot of features available. You can also combine methods, so that you can use one that is fast to obtain a small result set, then use another one that is more detailed to order that smaller data set. It might help to have some more details, such as where the files and titles come from and the need for comparison.

Report message to a moderator

Previous Topic:	Package
Next Topic:	DBMS_DATAPUMP

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Sun Aug 03 15:02:54 CDT 2025