Home » SQL & PL/SQL » SQL & PL/SQL » Words occurrence (oracle 10.2)
Words occurrence [message #580734] Wed, 27 March 2013 10:16 Go to next message
guddu_12
Messages: 175
Registered: April 2012
Location: UK
Senior Member
Dear All,

My requirment is to find the top 20 most occured words in the perticular text column accross the full table. i WILL have desc column which is plain text where user wites the essay. i have below 2 data example where i have 2 find which word occur most of the time. i have more than 1 million data where i have to perfor this calculation

E.g
INCIDENT_NO INITIAL_SD_FAULT_DESC
70000030 late entering service on the outer rail causing a 4 minute delay. After speaking to the Hainault DDM and the Hainault DMT it has been established that 70000031 E105 on entering into platform by one cars lenght applied emergency brakes to stop the train due to a person under his train. Train stopped with two cars still outside the platform. I received a call from Line Controllor at 0823 hrs that a person has jumped in front of train 105 at Greenford. I informed Ian Williamson Train Operations Manager. We traveled on a special taxi with a spare Train Operator at 0839hrs and reached Greenford Station at 0859hrs. Emergency Services ( Police, London Ambulance, London Fire Brigade and ERU)

CREATE TABLE tbl_incd(
  INCIDENT_NO            NUMBER ,
  SD_DESC  VARCHAR2(4000 BYTE)
)


data for the above table
Insert into tbl_incd   ( INCIDENT_NO, SD_DESC)
 Values
   (70000031,'E105 on entering into platform by one cars lenght applied emergency brakes to stop the train due to a person under his train. Train stopped with two cars still outside the platform. I received a call from Line Controllor at 0823 hrs that a person has jumped in front of train 105 at Greenford. I informed Ian Williamson Train Operations Manager. We traveled on a special taxi with a spare Train Operator at 0839hrs and reached Greenford Station at 0859hrs. Emergency Services ( Police, London Ambulance, London Fire Brigade and ERU) were already on site. When spoken to Line Controlllor stated he received a normal radio call from E105 to inform him that he has got a person under his train. At 0826 Line Controllor had traction current switched off between Northolt and Greenford. Train service was suspended from North Acton to West Ruislip in both directions (east and west). I could not speak to the Train Operator as Police was in the process of interviewing him and he was too traumatised to speak. Trauma Support was arranged from Loughton as we had no one available at West Ruislip and White City. Trauma Support spoke to T/OP over the telephone and Train Operator decided to travel in a special taxi which was booked for him on his own, although a Station Staff was was arranged to travel with him. During the process a spare T/OP stayed with this T/OP. When spoken to Mr. Clack Operations Security Manager stated he was traveling on this train from West Ruislip and made a telephone call to Network Operations Centre at 0826hrs to inform them of the situation. Mr. Clack travelled to the leading cab and confirmed that although the T/OP was badly shaken but he carried out his duties in a professinal manner by placing Short Circuiting Devise in front of the train and second SCD was placed on the rear of the track by Station Supervisor after traction current was discharged. Passengers were detrained by operation butterfly cock. Les Blaxwall Operations Standards Manager was also travelling on this train and offered his help to T/OP. Mr. Blaxwall assisted in the wrong direction move and stayed till the end of the operation to assist. Wrong Direction move was carried out by T/OP Martin to assist emergency services for the track search. The deceased was an elderly of Asian origin approximately of 65yrs of age wearing white shirt, beige trauser and white trainers. When CCTV was viewed he was seen pacing up and down on the east bound platform near the waiting room. On approach of T105 he jumped in front of the train at 0824hrs and train stopped with two cars outside the station. The body was removed from platform at 0940hrs by ERU and taken in a body bag to the secure room at Greenford Station. Track was accessed by BTP to carry out track search for any trace of identification at 0932hrs and gave all clear at 0942hrs. TOM Williamson assisted in closing the train doors by operating the butterfly cock to close position. Traction Current was switched on at 0948hrs and T105 was worked empty to North Acton and reversed to Ruislip Depot. All clear was given at 0951hrs and after station search, Greenford Station was re opened at 0955hrs. E142 was the first train to run in service on the east bound. Body was removed at 1013hrs from Greenford Station and taken to Uxbridge by the police. Trauma Support was offered to Station supervisor by his DSM, but he declined the offer. Service Manager and Line Controllor were kept informed by myself. I would like to thank all those who were involved for assisting to recover the service, especially the station staff.');
Insert into tbl_incd   ( INCIDENT_NO,SD_DESC)
 Values
   ( 70000032'PART OF TRAIN WITHDRAWAL Train 17 had a 6 min delayed departure from White City due to the train having a defective speedometer. Train was changed over and dud train was sent to Ruislip Depot for the attention of the DDMs staff. Ruislip DDM confirmed that the speedometer was defective. This was traced to a defective ATP controller.');
Insert into tbl_incd   ( INCIDENT_NO, SD_DESC)
 Values
   ( 70000033,' had a 1 min delayed arrival at Ealing Broadway due to the train having problems at signal EAB2058. The T/op stated that the train was in ATO and was stopped at the red signal (2058). The signal cleared and the train moved forward, the signal went straight back to red and the train stopped. This happened once again. The train did not pass the signal. T/op contacted the Line controller via radio and the signal was cleared, with the train proceeding into the platform at Ealing Broadway. The train was taken to Hainault for the attention of the DDM and a download. Train download and Tracker net confirm the T/op`s account of events. Due to the serious implications of events, an investigation is being undertaken by the Duty Engineer at Wood Lane. The T/op carried out his duties in a professional and correct manner and informed the passengers of the situation as and when required.');
COMMIT;


Any clue will be great help


[EDITED by LF: fixed topic title typo; was "occurance"]

[Updated on: Sat, 30 March 2013 03:08] by Moderator

Report message to a moderator

Re: words occurance [message #580743 is a reply to message #580734] Wed, 27 March 2013 12:48 Go to previous messageGo to next message
dariyoosh
Messages: 532
Registered: March 2009
Location: Iran / France
Senior Member
I tried the following based on your sample data (assuming that the space character is the delimiter among words). I use a row generator based on the space character as delimiter.

SET TIMING ON;
COLUMN word FORMAT a50;
SET PAGESIZE 100;

SELECT *
FROM
(
    SELECT word, COUNT(*) AS word_count
    FROM
    (
        SELECT  REGEXP_SUBSTR
                (
                    sd_desc, 
                    '([^ ])+', 
                    1, 
                    t2.COLUMN_VALUE
                ) AS word
        FROM tbl_incd t1 
        CROSS JOIN
            TABLE
            (
                CAST
                (
                    MULTISET
                    (
                        SELECT LEVEL
                        FROM DUAL
                        CONNECT BY LEVEL <= 
                            REGEXP_COUNT(t1.sd_desc, '([^ ])+')
                    )
                    AS SYS.odciNumberList
                )
            ) t2
    )
    GROUP BY word
    ORDER BY word_count DESC
)
WHERE ROWNUM <= 20;
SET TIMING OFF;



Which gives the following output

WORD						   WORD_COUNT
-------------------------------------------------- ----------
the							   49
was							   33
to							   33
and							   32
at							   20
a							   19
train							   18
of							   17
in							   13
by							   12
on							   10
Train							    9
he							    8
from							    8
his							    7
The							    7
Ruislip 						    6
for							    6
with							    6
Station 						    6

20 rows selected.

Elapsed: 00:00:01.54

SQL> 




The problem with this solution: Is Very Slow !!! Sad (in particular for a table of 1000000 rows)



Regards,
Dariyoosh
Re: words occurance [message #580831 is a reply to message #580743] Thu, 28 March 2013 08:20 Go to previous messageGo to next message
guddu_12
Messages: 175
Registered: April 2012
Location: UK
Senior Member
This doesn't work in 10g.
Re: words occurance [message #580855 is a reply to message #580831] Thu, 28 March 2013 12:49 Go to previous messageGo to next message
dariyoosh
Messages: 532
Registered: March 2009
Location: Iran / France
Senior Member
Then you can read Row Generator topic which provides different methods based on different oracle versions.


Regards,
Dariyoosh
Re: words occurance [message #580857 is a reply to message #580734] Thu, 28 March 2013 13:07 Go to previous messageGo to next message
Solomon Yakobson
Messages: 2033
Registered: January 2010
Senior Member
Context index is where I would go:

create index tbl_incd_ctx_idx1
  on tbl_incd(sd_desc)
     indextype is ctxsys.context
/

Index created.

SQL>


Now:

select  token_text,
        token_count
  from  dr$tbl_incd_ctx_idx1$i
/

Query results

SY.

[EDITED by LF: applied [spoiler] tags]

[Updated on: Sat, 30 March 2013 09:20] by Moderator

Report message to a moderator

Re: words occurance [message #580859 is a reply to message #580857] Thu, 28 March 2013 13:30 Go to previous messageGo to next message
Solomon Yakobson
Messages: 2033
Registered: January 2010
Senior Member
I forgot about stop list:

begin
ctx_ddl.create_stoplist('mystop','BASIC_STOPLIST');
end;
/
create index tbl_incd_ctx_idx1
  on tbl_incd(sd_desc)
     indextype is ctxsys.context
     parameters('stoplist mystop transactional')
                
/
select  token_text,
        token_count
  from  dr$tbl_incd_ctx_idx1$i
/

Query results

Now you see words like THE, ALL...

SY.

[EDITED by LF: applied [spoiler] tags]

[Updated on: Sat, 30 March 2013 09:22] by Moderator

Report message to a moderator

Re: words occurance [message #580863 is a reply to message #580859] Thu, 28 March 2013 15:07 Go to previous messageGo to next message
joy_division
Messages: 4515
Registered: February 2005
Location: East Coast USA
Senior Member
Solomon Yakobson wrote on Thu, 28 March 2013 14:30
I forgot about stop list:

begin
ctx_ddl.create_stoplist('mystop','BASIC_STOPLIST');
end;
/
create index tbl_incd_ctx_idx1
  on tbl_incd(sd_desc)
     indextype is ctxsys.context
     parameters('stoplist mystop transactional')
                
/
select  token_text,
        token_count
  from  dr$tbl_incd_ctx_idx1$i
/


Now you see words like THE, ALL...


Cool. I learned something new today.
But...something is amiss. For example, the word train appears way more than 3 times that your output shows. Did you use the same input as the OP?
Re: words occurance [message #580938 is a reply to message #580863] Fri, 29 March 2013 20:18 Go to previous messageGo to next message
Solomon Yakobson
Messages: 2033
Registered: January 2010
Senior Member
joy_division wrote on Thu, 28 March 2013 16:07
But...something is amiss. For example, the word train appears way more than 3 times that your output shows. Did you use the same input as the OP?


You are absolutely right, I lied Embarassed. Column TOKEN_COUNT in DR$indexname$I table is number of documents token appears in. And word train appears in all three rows of the table. So how do we get word count? We can issue Oracle Text CONTAINS operator query against source table with count as score algorithm using DR$indexname$I table as list of words we are counting. And, btw, there was no need to create stop list. We can use predefined CTXSYS.EMPTY_STOPLIST:

create index tbl_incd_ctx_idx1
  on tbl_incd(sd_desc)
     indextype is ctxsys.context
     parameters('stoplist ctxsys.empty_stoplist transactional')
/


Now:

select  token_text,
        sum(score(0)) cnt
  from  tbl_incd,
        dr$tbl_incd_ctx_idx1$i
  where contains(
                 sd_desc,
                 '<query>' ||
                 '   <textquery lang="ENGLISH" grammar="CONTEXT">' ||
                 '"' || token_text || '"' ||
                 '   </textquery>' ||
                 '   <score datatype="INTEGER" algorithm="COUNT"/>' ||
                 '</query>',
                 0
                ) > 0
  group by token_text
  order by cnt desc,
           token_text
/

Query results

SY.

[EDITED by LF: applied [spoiler] tags]

[Updated on: Sat, 30 March 2013 09:23] by Moderator

Report message to a moderator

Re: words occurance [message #580962 is a reply to message #580938] Sat, 30 March 2013 08:32 Go to previous message
Solomon Yakobson
Messages: 2033
Registered: January 2010
Senior Member
One more thing. Example I posted does case-insensitive search. Well, to be precise it is index, not search that is case-insensitive. If we want to use case-sensitive search we need to create case-sensitive index:

exec ctx_ddl.create_preference('my_lexer','BASIC_LEXER')
exec ctx_ddl.set_attribute('my_lexer','MIXED_CASE','YES')
create index tbl_incd_ctx_idx1
  on tbl_incd(sd_desc)
     indextype is ctxsys.context
     parameters('stoplist ctxsys.empty_stoplist lexer my_lexer transactional');


Now same search query:

select  token_text,
        sum(score(0)) cnt
  from  tbl_incd,
        dr$tbl_incd_ctx_idx1$i
  where contains(
                 sd_desc,
                 '<query>' ||
                 '   <textquery lang="ENGLISH" grammar="CONTEXT">' ||
                 '"' || token_text || '"' ||
                 '   </textquery>' ||
                 '   <score datatype="INTEGER" algorithm="COUNT"/>' ||
                 '</query>',
                 0
                ) > 0
  group by token_text
  order by cnt desc,
           token_text
/

Query results

SY.

[EDITED by LF: applied [spoiler] tags]

[Updated on: Sat, 30 March 2013 09:23] by Moderator

Report message to a moderator

Previous Topic: Can I get this required result.?
Next Topic: Bulk Collect
Goto Forum:
  


Current Time: Mon Sep 22 17:55:34 CDT 2014

Total time taken to generate the page: 0.10349 seconds