Home » SQL & PL/SQL » SQL & PL/SQL » Extracting values acquired via UTL_TCP (Oracle 10g XE, MS Windows XP SP3)
| Extracting values acquired via UTL_TCP [message #467650] |
Mon, 26 July 2010 07:32  |
 |
Littlefoot
Messages: 17256 Registered: June 2005 Location: Croatia, Europe
|
Senior Member Account Moderator |
|
|
Using UTL_TCP package (for the first time; maybe there's smarter way to do that), I captured contents of a certain web page. The page contains (along with some text, images, etc.), a list of values I'd like to extract. Something like this:
PRVI miravira pond 23.7.2010 102,2221
NEKI miravira pond 23.7.2010 105,0996
DRUGI miravira pond 22.7.2010 101,3789
The result (output) of the PL/SQL procedure contains several hundreds of lines. I narrowed the output to a part which I'm interested in. It looks like this (a screenshot, because [code] or [pre] tags won't allow me to paint text, while the others don't preserve formatting):

Red values are what I'm looking for. The first one represents a date (July 23rd 2010), and the second one is amount (105,0996).
Now, what's the problem: the above mess doesn't look the same every day (probably because the rest of data that appears on a web page changes too). Therefore, SUBSTR (with hard-coded positions) that seems to be working today is wrong tomorrow. Moreover, IF condition I used might not provide desired part of a web page every time.
Here's the code (modified a little bit, unimportant for what matters):
declare
c utl_tcp.connection;
n number;
buffer varchar2(255);
l_x_dat varchar2(10);
l_x_izn varchar2(10);
begin
c := utl_tcp.open_connection('proxy_name', 1234);
n := utl_tcp.write_line(c, 'GET http://www.some_site.hr/ HTTP/1.0');
n := utl_tcp.write_line(c);
begin
loop
n := utl_tcp.read_text(c, buffer, 255);
if instr(buffer, 'NEKI miravira pond') > 0 --> narrowing the output
then
dbms_output.put_line(buffer); --> it produces a screenshot posted above
l_x_dat := substr(buffer, 176, 10); --> a date value
l_x_izn := substr(buffer, 208, 10); --> an amount
end if;
end loop;
exception
when utl_tcp.end_of_input then
null;
end;
utl_tcp.close_connection(c);
end;
L_X_DAT and L_X_IZN are values I need. SUBSTR parameters, as I said, are far from being good. Basically, I need a suggestion for a piece of code that goes into the IF - END IF structure.
Is there any smart way to extract those two values? What (Oracle) technology could be used here? Or should I just try to, somehow, set SUBSTR values correctly (dynamically)?
If there was a way to uniquely fetch those values (instead of getting the whole web page and digging for what I'm interested in), it would be extraordinary.
Could someone point me into the right direction, please?
-
Attachment: utl_tcp.PNG
(Size: 6.31KB, Downloaded 630 times)
|
|
|
|
| Re: Extracting values acquired via UTL_TCP [message #467654 is a reply to message #467650] |
Mon, 26 July 2010 07:48   |
ThomasG
Messages: 2893 Registered: April 2005 Location: Heilbronn, Germany
|
Senior Member |
|
|
One possibility:
- Search for the first <td> and the first </td> after that with insrt, and then put that string into an XMLType.
- Extract the data with extractValue somewhat like :
SELECT extractValue(col,'/td/span/text()'),
extractValue(col,'/td/b/text()')
FROM (
SELECT NEW xmltype('<td><span>foo</span><b>bar</b></td>') col FROM dual
) tab;
|
|
|
|
|
|
|
|
|
|
| Re: Extracting values acquired via UTL_TCP [message #467675 is a reply to message #467660] |
Mon, 26 July 2010 10:48   |
 |
Littlefoot
Messages: 17256 Registered: June 2005 Location: Croatia, Europe
|
Senior Member Account Moderator |
|
|
Huh, yes - I thought that "XML"-something will be in your answer(s). That's something I'm not familiar with - yet. Anyway, this is a good reason to start reading about it.
As of Michel's remark about specifications: well, I'd rather if it was simpler. The page I'm working with looks like a stock exchange report (hrportfolio). *Somewhere* on the page is a "date" and a "value":

The page changes - new funds open, some of them close, and values I need are always there, but not always at the same place.
My last attempt (after creating a topic) was on Thomas' first message's path - find a "point" that doesn't change (using the INSTR function), then extract the value. Today - it worked. I don't know will it work tomorrow (as my yesterday's attempt didn't work today). Gee ...
OK then, thank you for the guidelines, all of you. I'll do some reading and, hopefully, make it work.
|
|
|
|
|
|
|
|
| Re: Extracting values acquired via UTL_TCP [message #467784 is a reply to message #467650] |
Tue, 27 July 2010 05:43   |
 |
Michel Cadot
Messages: 54712 Registered: March 2007 Location: Nanterre, France, http://...
|
Senior Member Account Moderator |
|
|
The hard part was the unstability of the page to normalize it to a valid XML data:
drop type fonds;
drop type fond;
create or replace type fond as object (
fond_name varchar2(30),
value_date varchar2(10),
value varchar2(20),
currency varchar2(8)
)
/
create or replace type fonds is table of fond
/
set define #
create or replace function get_fond (p_fond in varchar2)
return fonds pipelined
is
l_page clob;
procedure get_page
-- Retrieve the page from url and record it into l_page variable
is
l_pieces utl_http.html_pieces;
begin
-- If proxy is used the following statement must be replaced by:
-- l_pieces := utl_http.request_pieces (
-- 'http://www.hrportfolio.com/',
-- proxy => 'user:password@proxy_url:proxy_port'
-- );
l_pieces := utl_http.request_pieces('http://www.hrportfolio.com/');
dbms_lob.createtemporary (lob_loc => l_page,
cache => TRUE,
dur => dbms_lob.call);
for i in 1..l_pieces.count loop
l_page := l_page || l_pieces(i);
end loop;
end;
procedure get_table
-- Retrieve and keep only the wanted table from l_page
is
begin
-- Table is retrieved from its id, tabelaTec1, that can be anywhere
-- in the <TABLE> tag which can can have any case.
-- First remove all before and including the <TABLE> tag of wanted table
-- (any character case)
l_page := substr(l_page,
regexp_instr(l_page,
'<table [^>]+ id=["]{0,1}tabelaTec1["]{0,1}[^>]+',
1, 1, 1, 'i'));
-- Reinsert the <TABLE> tag at the beginning
l_page := '<TABLE' || l_page;
-- Remove all after the first </TABLE> tag (no nested table inside our one)
l_page := substr(l_page, 1, regexp_instr(l_page, '</table>', 1, 1, 1, 'i'));
end;
procedure normalize
--Normalize the l_page content to be a simple "TABLE" XML page
is
begin
-- Replace any contiguous space string by a single space
l_page := regexp_replace(l_page, '[[:space:]]+', ' ');
-- Replace " " string by a space (Oracle seems to not like " ")
l_page := replace(l_page, ' ', ' ');
-- Remove IMG element (any character case)
l_page := regexp_replace(l_page, '<IMG[^>]+>', '', 1, 0, 'i');
-- Remove all attributes in tags
l_page := regexp_replace(l_page, '[[:alpha:]]+=[^>]+', '');
-- Remove <SUP> parts (references to footnote in page) (any character case)
l_page := regexp_replace(l_page, '<SUP *>[[:digit:]]+</SUP *>', '', 1, 0, 'i');
-- Remove <A> tags (any character case)
l_page := regexp_replace(l_page, '<[/]{0,1}A *>', '', 1, 0, 'i');
-- Remove <DIV> tags (any character case)
l_page := regexp_replace(l_page, '<[/]{0,1}DIV *>', '', 1, 0, 'i');
-- Put remaining tags in upper case as Oracle XML query is case sensitive
l_page := replace(l_page, '</table>', '</TABLE>');
l_page := replace(l_page, '<tbody>', '<TBODY>');
l_page := replace(l_page, '</tbody>', '</TBODY>');
l_page := regexp_replace(l_page, 'td *>', 'TD>');
l_page := regexp_replace(l_page, 'tr *>', 'TR>');
end;
procedure print (p_string in clob)
-- Display a string in lines of "line_lg" width
is
line_lg constant pls_integer := 250; -- Lines of 250 characters max.
lg constant pls_integer := length(p_string);
i pls_integer := 1;
begin
while i <= lg loop
dbms_output.put_line(substr(p_string, i, line_lg));
i := i + line_lg;
end loop;
end;
begin
-- Get the wanted table and normalize it at simple XML page
get_page;
get_table;
normalize;
-- Retrieve the wanted data for the matched fonds
for rec in (
select extractvalue(value(tr), '//TD[position()=1]') "FOND",
extractvalue(value(tr), '//TD[position()=5]') "DATE",
extractvalue(value(tr), '//TD[position()=6]') "VALUE",
extractvalue(value(tr), '//TD[position()=7]') "CURRENCY"
from ( select xmltype(l_page) val from dual ) tt,
table(xmlsequence(extract(tt.val, '/TABLE/TBODY/TR'))) tr
where lower(extractvalue(value(tr), '//TD[position()=1]')) like
'%'||lower(p_fond)||'%'
order by 1
) loop
pipe row (fond(rec.fond, rec.date, rec.value, rec.currency));
end loop;
exception
when no_data_needed then
-- Nothing to do in this case
null;
when others then -- for debugging only
print ('>>> l_page content when exception occured <<<');
print (l_page);
print ('>>> End of l_page content <<<');
print (' ');
raise;
end;
/
show error
set define &
SQL> select * from table(get_fond('growth'));
FOND_NAME VALUE_DATE VALUE CURRENCY
------------------------------ ---------- -------------------- --------
HI-growth 26.07. 8,0564 €
1 row selected.
SQL> select * from table(get_fond('KD'));
FOND_NAME VALUE_DATE VALUE CURRENCY
------------------------------ ---------- -------------------- --------
KD Energija 26.07. 9,5609 kn
KD Nova Europa 26.07. 6,4640 kn
KD Prvi izbor 26.07. 12,0338 kn
KD Victoria 26.07. 14,5112 kn
4 rows selected.
Regards
Michel
[Updated on: Mon, 13 September 2010 13:04] Report message to a moderator
|
|
|
|
| Re: Extracting values acquired via UTL_TCP [message #467789 is a reply to message #467784] |
Tue, 27 July 2010 06:10   |
 |
Littlefoot
Messages: 17256 Registered: June 2005 Location: Croatia, Europe
|
Senior Member Account Moderator |
|
|
Now, Michel, that's something wonderful! Thank you so much for your time and all the effort you put into creating a solution.
My attempt (after doing some research, based on links and code already posted) was far much simpler and - most probably - as much worse. A problem with testing is that page refreshes only once a day so ... one has to be patient.
It'll take some time until I understand what you did. Once again, thank you!
[EDIT] No, it won't take some time, but probably my whole life.
[Updated on: Tue, 27 July 2010 06:17] Report message to a moderator
|
|
|
|
| Re: Extracting values acquired via UTL_TCP [message #467997 is a reply to message #467789] |
Wed, 28 July 2010 08:41   |
 |
Michel Cadot
Messages: 54712 Registered: March 2007 Location: Nanterre, France, http://...
|
Senior Member Account Moderator |
|
|
@Littlefoot,
I added some comments inside the code to make it easier to understand and improve (usage of proxy, change line length in "print" procedure, this procedure is only used in case of exception).
It was tested in 10.2.0.4 and seems to still work today. 
Regards
Michel
[Updated on: Wed, 28 July 2010 08:45] Report message to a moderator
|
|
|
|
| Re: Extracting values acquired via UTL_TCP [message #468086 is a reply to message #467997] |
Wed, 28 July 2010 16:18  |
 |
Littlefoot
Messages: 17256 Registered: June 2005 Location: Croatia, Europe
|
Senior Member Account Moderator |
|
|
Thank you, that's very nice and - above all - helpful. Your comments are more than welcome; looking at code, I was puzzled, asking myself why did he do that?!?. Now it is much easier to understand it.
Developing a "simple" solution by my own, it seems that we share some problems (such as Oracle not liking  ). Proud to say - my procedure did the job two consequential days, which is an improvement. However, I'm not satisfied with it; I'm afraid that if certain strings aren't where they were, it will fail. Therefore, I'm more than interested in your code.
This:MichelSo if you can get and post a html version of the page, we can extract only the wanted table and then it is a xml and then we can easily extract the values. raised my attention; I thought that it is really cool, but I couldn't imagine how the heck could one extract only the wanted table ...
There's so many things I don't know, that it's a shame. Pointing to a right direction is helpful, presenting a solution is unmeasurable. I'm speechless.
|
|
|
|
Goto Forum:
Current Time: Wed Jun 19 13:19:07 CDT 2013
Total time taken to generate the page: 0.11562 seconds
|