Regex Delete Repeating Words in a String [message #502310] |
Tue, 05 April 2011 14:20  |
 |
supercox
Messages: 4 Registered: April 2011 Location: NOVA
|
Junior Member |
|
|
I am trying to delete duplicate or repeating words in a string using regular expressions (regex) and an Oracle database.
I have Googled quite a bit on this topic and was unable to track down something that fit the bill or remotely came close to paying the tip.
Any assistance would be helpful.
A pattern would be: word word start with new
The final string should look like this: word start with new
Basically all the strings begin with the two words repeating.
Thank you
Roger
[Updated on: Tue, 05 April 2011 14:27] Report message to a moderator
|
|
|
|
Re: Regex Delete Repeating Words in a String [message #502313 is a reply to message #502312] |
Tue, 05 April 2011 14:48   |
 |
supercox
Messages: 4 Registered: April 2011 Location: NOVA
|
Junior Member |
|
|
I am trying to use the String Comparison to compare a filename and document title to determine their match percentage.
The specific field is a file and path name to the specific files.
The folder name, which is always repeated as the first portion of the file name, is separated by a forward slash ( / ), which I have removed prior to getting to this point.
Here is an example filename and document title:
Path/Filename: word/word_v901_test_certification_and_evaluation_memo.pdf
Title: Word v9.0.1 Test Certification and Evaluation Memo
I've already removed the slash (/), underscores (_), and file extensions (.pdf) so that I could compare the filename with the specific document name.
Additionally, I am using a LOWER command to ensure they are evaluated as lowercase characters, as well as removed all the periods.
The repeating words are the last piece of the puzzle to remove before I can use the String Comparison solution (#msg_502307) to evaluate their percentage of similarity.
If I wanted to replace the specific repeating words listed above, I could do this:
UPDATE admin.testpercentcompare SET document_title = REPLACE (document_title, 'word word ', 'word ') WHERE document_title LIKE '%word word %';
The repeating words are actually different on many of the 3500 rows, which is why I am seeking a Regex solution to the problem.
|
|
|
|
Re: Regex Delete Repeating Words in a String [message #502328 is a reply to message #502314] |
Tue, 05 April 2011 23:08   |
 |
Barbara Boehmer
Messages: 9106 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
Since it is always the first word that is repeated, you could just use substr and instr to remove the first word, as demonstrated below.
SCOTT@orcl_11gR2> SELECT percentage_of_matching_words
2 (SUBSTR (file_name, INSTR (file_name, ' ') + 1),
3 title)
4 FROM (SELECT 'word word v901 test certification and evaluation memo' file_name,
5 'word v901 test certification and evaluation memo' title
6 FROM DUAL)
7 /
PERCENTAGE_OF_MATCHING_WORDS(SUBSTR(FILE_NAME,INSTR(FILE_NAME,'')+1),TITLE)
---------------------------------------------------------------------------
100
1 row selected.
SCOTT@orcl_11gR2>
|
|
|
Re: Regex Delete Repeating Words in a String [message #502331 is a reply to message #502328] |
Tue, 05 April 2011 23:31   |
 |
Barbara Boehmer
Messages: 9106 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
Alternatively, using regexp_replace:
SCOTT@orcl_11gR2> SELECT percentage_of_matching_words
2 (REGEXP_REPLACE (file_name, '^[a-z0-9]+ ', '', 1, 1),
3 title)
4 FROM (SELECT 'word word v901 test certification and evaluation memo' file_name,
5 'word v901 test certification and evaluation memo' title
6 FROM DUAL)
7 /
PERCENTAGE_OF_MATCHING_WORDS(REGEXP_REPLACE(FILE_NAME,'^[A-Z0-9]+','',1,1),TITLE
--------------------------------------------------------------------------------
100
1 row selected.
SCOTT@orcl_11gR2>
|
|
|
|
Re: Regex Delete Repeating Words in a String [message #502491 is a reply to message #502479] |
Wed, 06 April 2011 13:13  |
 |
Barbara Boehmer
Messages: 9106 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
I don't know what other options you may have considered or how big your data set is or whether it is static or changes frequently, but some other options that you might be interested in are the functions in the utl_match package and Oracle Text. If there may be slight differences in spelling of the words or portions of the words, then these will find those and take them into account when scoring. If you have a large data set, Oracle Text uses an index, so it can be a lot faster and has a lot of features available. You can also combine methods, so that you can use one that is fast to obtain a small result set, then use another one that is more detailed to order that smaller data set. It might help to have some more details, such as where the files and titles come from and the need for comparison.
|
|
|