Home » SQL & PL/SQL » SQL & PL/SQL » Extracting values acquired via UTL_TCP (Oracle 10g XE, MS Windows XP SP3)
Extracting values acquired via UTL_TCP [message #467650] Mon, 26 July 2010 07:32 Go to next message
Littlefoot
Messages: 19815
Registered: June 2005
Location: Croatia, Europe
Senior Member
Account Moderator
Using UTL_TCP package (for the first time; maybe there's smarter way to do that), I captured contents of a certain web page. The page contains (along with some text, images, etc.), a list of values I'd like to extract. Something like this:
PRVI miravira pond          23.7.2010  102,2221
NEKI miravira pond          23.7.2010  105,0996
DRUGI miravira pond         22.7.2010  101,3789

The result (output) of the PL/SQL procedure contains several hundreds of lines. I narrowed the output to a part which I'm interested in. It looks like this (a screenshot, because [code] or [pre] tags won't allow me to paint text, while the others don't preserve formatting):

./fa/8081/0/

Red values are what I'm looking for. The first one represents a date (July 23rd 2010), and the second one is amount (105,0996).

Now, what's the problem: the above mess doesn't look the same every day (probably because the rest of data that appears on a web page changes too). Therefore, SUBSTR (with hard-coded positions) that seems to be working today is wrong tomorrow. Moreover, IF condition I used might not provide desired part of a web page every time.

Here's the code (modified a little bit, unimportant for what matters):
declare
   c utl_tcp.connection;
   n number;
   buffer varchar2(255);
   
   l_x_dat varchar2(10); 
   l_x_izn varchar2(10);
begin
   c := utl_tcp.open_connection('proxy_name', 1234);
   n := utl_tcp.write_line(c, 'GET http://www.some_site.hr/  HTTP/1.0');
   n := utl_tcp.write_line(c);

   begin
     loop
       n := utl_tcp.read_text(c, buffer, 255);

       if instr(buffer, 'NEKI miravira pond') > 0  --> narrowing the output
       then
          dbms_output.put_line(buffer);            --> it produces a screenshot posted above
          l_x_dat := substr(buffer, 176, 10);      --> a date value
          l_x_izn := substr(buffer, 208, 10);      --> an amount
       end if;          
     end loop;
     
  exception
    when utl_tcp.end_of_input then
      null;
  end;
  
  utl_tcp.close_connection(c);

end; 


L_X_DAT and L_X_IZN are values I need. SUBSTR parameters, as I said, are far from being good. Basically, I need a suggestion for a piece of code that goes into the IF - END IF structure.

Is there any smart way to extract those two values? What (Oracle) technology could be used here? Or should I just try to, somehow, set SUBSTR values correctly (dynamically)?

If there was a way to uniquely fetch those values (instead of getting the whole web page and digging for what I'm interested in), it would be extraordinary.

Could someone point me into the right direction, please?
  • Attachment: utl_tcp.PNG
    (Size: 6.31KB, Downloaded 785 times)
Re: Extracting values acquired via UTL_TCP [message #467654 is a reply to message #467650] Mon, 26 July 2010 07:48 Go to previous messageGo to next message
ThomasG
Messages: 3114
Registered: April 2005
Location: Heilbronn, Germany
Senior Member
One possibility:

- Search for the first <td> and the first </td> after that with insrt, and then put that string into an XMLType.

- Extract the data with extractValue somewhat like :

SELECT extractValue(col,'/td/span/text()'),
       extractValue(col,'/td/b/text()')
FROM ( 
  SELECT NEW xmltype('<td><span>foo</span><b>bar</b></td>') col FROM dual
) tab;

Re: Extracting values acquired via UTL_TCP [message #467655 is a reply to message #467650] Mon, 26 July 2010 07:49 Go to previous messageGo to next message
Michel Cadot
Messages: 59809
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator
You have to get the complete definition of the structure of the page and specification what you want (with words and not image as PL/SQL can't process image) then you can start to think about a code.
There what yiu say is that somewhere in the page there is 2 values you want that are at any place and can have any value. Good luck with these specifications.

Regards
Michel
Re: Extracting values acquired via UTL_TCP [message #467658 is a reply to message #467654] Mon, 26 July 2010 07:52 Go to previous messageGo to next message
ThomasG
Messages: 3114
Registered: April 2005
Location: Heilbronn, Germany
Senior Member
Also: You could check if the entire page is valid XML. Then it might be possible to get the values directly with the correct XPath search path.

It might be possible to specify something like "the next <td> after Neki Meravia Pond"

PS: UTL_HTTP might be of interest.

[Updated on: Mon, 26 July 2010 08:01]

Report message to a moderator

Re: Extracting values acquired via UTL_TCP [message #467660 is a reply to message #467658] Mon, 26 July 2010 08:27 Go to previous messageGo to next message
Alessandro Rossi
Messages: 166
Registered: September 2008
Location: Rome
Senior Member
There are different ways to do what you're asking for, but they're all quite complex to implement and I don't think you may find a complete and exhaust explanation on Oracle Forums.

If the web page is an XHTML page (witch is an XML document) you can use Oracle built in XML features through the XMLTABLE function and a good knowledge of XML query technique.

If not, you should find, or develop(using antlr for example), an HTML parser/interpret and then instruct it to find the values you're looking for.

Bye Alessandro
Re: Extracting values acquired via UTL_TCP [message #467675 is a reply to message #467660] Mon, 26 July 2010 10:48 Go to previous messageGo to next message
Littlefoot
Messages: 19815
Registered: June 2005
Location: Croatia, Europe
Senior Member
Account Moderator
Huh, yes - I thought that "XML"-something will be in your answer(s). That's something I'm not familiar with - yet. Anyway, this is a good reason to start reading about it.

As of Michel's remark about specifications: well, I'd rather if it was simpler. The page I'm working with looks like a stock exchange report (hrportfolio). *Somewhere* on the page is a "date" and a "value":

./fa/8088/0/

The page changes - new funds open, some of them close, and values I need are always there, but not always at the same place.

My last attempt (after creating a topic) was on Thomas' first message's path - find a "point" that doesn't change (using the INSTR function), then extract the value. Today - it worked. I don't know will it work tomorrow (as my yesterday's attempt didn't work today). Gee ...

OK then, thank you for the guidelines, all of you. I'll do some reading and, hopefully, make it work.
Re: Extracting values acquired via UTL_TCP [message #467677 is a reply to message #467675] Mon, 26 July 2010 11:29 Go to previous messageGo to next message
Michel Cadot
Messages: 59809
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator
Nevertheless there are some points that are fixed in the page, at least the ones you want extract the value, for instance you want the second and third value in the row which contains "Hi-growth" in the first cell of a table with title "Equity funds" (maybe).
So if you can get and post a html version of the page, we can extract only the wanted table and then it is a xml and then we can easily extract the values.

This was what I have in mind in my first post.

Regards
Michel
Re: Extracting values acquired via UTL_TCP [message #467689 is a reply to message #467675] Mon, 26 July 2010 13:04 Go to previous messageGo to next message
Alessandro Rossi
Messages: 166
Registered: September 2008
Location: Rome
Senior Member
Littlefoot wrote on Mon, 26 July 2010 17:48
Huh, yes - I thought that "XML"-something will be in your answer(s). That's something I'm not familiar with - yet. Anyway, this is a good reason to start reading about it.

....

I'll do some reading and, hopefully, make it work.



If so this tutorial may interest you. It's not too long and it should give enough explanations to solve your problem.
Re: Extracting values acquired via UTL_TCP [message #467784 is a reply to message #467650] Tue, 27 July 2010 05:43 Go to previous messageGo to next message
Michel Cadot
Messages: 59809
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator
The hard part was the unstability of the page to normalize it to a valid XML data:
drop type fonds;
drop type fond;
create or replace type fond as object (
  fond_name  varchar2(30),
  value_date varchar2(10),
  value      varchar2(20),
  currency   varchar2(8)
  )
/
create or replace type fonds is table of fond
/
set define #
create or replace function get_fond (p_fond in varchar2) 
return fonds pipelined
is
  l_page clob;

  procedure get_page 
    -- Retrieve the page from url and record it into l_page variable
  is 
    l_pieces utl_http.html_pieces;
  begin
    -- If proxy is used the following statement must be replaced by:
    -- l_pieces := utl_http.request_pieces (
    --               'http://www.hrportfolio.com/',
    --               proxy => 'user:password@proxy_url:proxy_port'
    --             );
    l_pieces := utl_http.request_pieces('http://www.hrportfolio.com/');
    dbms_lob.createtemporary (lob_loc => l_page,
                              cache   => TRUE,
                              dur     => dbms_lob.call);
    for i in 1..l_pieces.count loop
      l_page := l_page || l_pieces(i);
    end loop;
  end;

  procedure get_table 
    -- Retrieve and keep only the wanted table from l_page
  is
  begin
    -- Table is retrieved from its id, tabelaTec1, that can be anywhere
    -- in the <TABLE> tag which can can have any case.
    -- First remove all before and including the <TABLE> tag of wanted table
    -- (any character case)
    l_page := substr(l_page, 
                     regexp_instr(l_page, 
                                  '<table [^>]+ id=["]{0,1}tabelaTec1["]{0,1}[^>]+',
                                  1, 1, 1, 'i'));
    -- Reinsert the <TABLE> tag at the beginning
    l_page := '<TABLE' || l_page;
    -- Remove all after the first </TABLE> tag (no nested table inside our one)
    l_page := substr(l_page, 1, regexp_instr(l_page, '</table>', 1, 1, 1, 'i'));
  end;

  procedure normalize
    --Normalize the l_page content to be a simple "TABLE" XML page
  is
  begin
    -- Replace any contiguous space string by a single space
    l_page := regexp_replace(l_page, '[[:space:]]+', ' ');
    -- Replace "&nbsp;" string by a space (Oracle seems to not like "&nbsp;")
    l_page := replace(l_page, '&nbsp;', ' ');
    -- Remove IMG element (any character case)
    l_page := regexp_replace(l_page, '<IMG[^>]+>', '', 1, 0, 'i');
    -- Remove all attributes in tags
    l_page := regexp_replace(l_page, '[[:alpha:]]+=[^>]+', '');
    -- Remove <SUP> parts (references to footnote in page) (any character case)
    l_page := regexp_replace(l_page, '<SUP *>[[:digit:]]+</SUP *>', '', 1, 0, 'i');
    -- Remove <A> tags (any character case)
    l_page := regexp_replace(l_page, '<[/]{0,1}A *>', '', 1, 0, 'i');
    -- Remove <DIV> tags (any character case)
    l_page := regexp_replace(l_page, '<[/]{0,1}DIV *>', '', 1, 0, 'i');
    -- Put remaining tags in upper case as Oracle XML query is case sensitive
    l_page := replace(l_page, '</table>', '</TABLE>');
    l_page := replace(l_page, '<tbody>', '<TBODY>');
    l_page := replace(l_page, '</tbody>', '</TBODY>');
    l_page := regexp_replace(l_page, 'td *>', 'TD>');
    l_page := regexp_replace(l_page, 'tr *>', 'TR>');
  end;

  procedure print (p_string in clob)
    -- Display a string in lines of "line_lg" width
  is 
    line_lg constant pls_integer := 250; -- Lines of 250 characters max.
    lg      constant pls_integer := length(p_string);
    i       pls_integer := 1;
  begin
    while i <= lg loop
      dbms_output.put_line(substr(p_string, i, line_lg));
      i := i + line_lg;
    end loop;
  end;

begin
  -- Get the wanted table and normalize it at simple XML page
  get_page;
  get_table;
  normalize;
  -- Retrieve the wanted data for the matched fonds
  for rec in (
    select extractvalue(value(tr), '//TD[position()=1]') "FOND",
           extractvalue(value(tr), '//TD[position()=5]') "DATE",
           extractvalue(value(tr), '//TD[position()=6]') "VALUE",
           extractvalue(value(tr), '//TD[position()=7]') "CURRENCY"
    from ( select xmltype(l_page) val from dual ) tt,
         table(xmlsequence(extract(tt.val, '/TABLE/TBODY/TR'))) tr
    where lower(extractvalue(value(tr), '//TD[position()=1]')) like 
           '%'||lower(p_fond)||'%'
    order by 1
  ) loop
    pipe row (fond(rec.fond, rec.date, rec.value, rec.currency));
  end loop;
exception
  when no_data_needed then
    -- Nothing to do in this case
    null;
  when others then  -- for debugging only
    print ('>>> l_page content when exception occured <<<');
    print (l_page);
    print ('>>> End of l_page content <<<');
    print (' ');
    raise;
end;
/
show error
set define &

SQL> select * from table(get_fond('growth'));
FOND_NAME                      VALUE_DATE VALUE                CURRENCY
------------------------------ ---------- -------------------- --------
HI-growth                      26.07.     8,0564               €

1 row selected.

SQL> select * from table(get_fond('KD'));
FOND_NAME                      VALUE_DATE VALUE                CURRENCY
------------------------------ ---------- -------------------- --------
KD Energija                    26.07.     9,5609               kn
KD Nova Europa                 26.07.     6,4640               kn
KD Prvi izbor                  26.07.     12,0338              kn
KD Victoria                    26.07.     14,5112              kn

4 rows selected.

Regards
Michel

[Updated on: Mon, 13 September 2010 13:04]

Report message to a moderator

Re: Extracting values acquired via UTL_TCP [message #467789 is a reply to message #467784] Tue, 27 July 2010 06:10 Go to previous messageGo to next message
Littlefoot
Messages: 19815
Registered: June 2005
Location: Croatia, Europe
Senior Member
Account Moderator
Now, Michel, that's something wonderful! Thank you so much for your time and all the effort you put into creating a solution.

My attempt (after doing some research, based on links and code already posted) was far much simpler and - most probably - as much worse. A problem with testing is that page refreshes only once a day so ... one has to be patient.

It'll take some time until I understand what you did. Once again, thank you!

[EDIT] No, it won't take some time, but probably my whole life.

[Updated on: Tue, 27 July 2010 06:17]

Report message to a moderator

Re: Extracting values acquired via UTL_TCP [message #467997 is a reply to message #467789] Wed, 28 July 2010 08:41 Go to previous messageGo to next message
Michel Cadot
Messages: 59809
Registered: March 2007
Location: Nanterre, France, http://...
Senior Member
Account Moderator
@Littlefoot,

I added some comments inside the code to make it easier to understand and improve (usage of proxy, change line length in "print" procedure, this procedure is only used in case of exception).
It was tested in 10.2.0.4 and seems to still work today. Wink

Regards
Michel

[Updated on: Wed, 28 July 2010 08:45]

Report message to a moderator

Re: Extracting values acquired via UTL_TCP [message #468086 is a reply to message #467997] Wed, 28 July 2010 16:18 Go to previous message
Littlefoot
Messages: 19815
Registered: June 2005
Location: Croatia, Europe
Senior Member
Account Moderator
Thank you, that's very nice and - above all - helpful. Your comments are more than welcome; looking at code, I was puzzled, asking myself why did he do that?!?. Now it is much easier to understand it.

Developing a "simple" solution by my own, it seems that we share some problems (such as Oracle not liking &nbsp). Proud to say - my procedure did the job two consequential days, which is an improvement. However, I'm not satisfied with it; I'm afraid that if certain strings aren't where they were, it will fail. Therefore, I'm more than interested in your code.

This:Michel
So if you can get and post a html version of the page, we can extract only the wanted table and then it is a xml and then we can easily extract the values.
raised my attention; I thought that it is really cool, but I couldn't imagine how the heck could one extract only the wanted table ...

There's so many things I don't know, that it's a shame. Pointing to a right direction is helpful, presenting a solution is unmeasurable. I'm speechless.
Previous Topic: SELECT INTO not working
Next Topic: replace zero with null values
Goto Forum:
  


Current Time: Thu Nov 27 07:27:33 CST 2014

Total time taken to generate the page: 0.20923 seconds