Home » SQL & PL/SQL » SQL & PL/SQL » Converting HTML format text to Plain Text (Oracle 10g)
|
|
|
Re: Converting HTML format text to Plain Text [message #436402 is a reply to message #436397] |
Wed, 23 December 2009 06:37 |
JRowbottom
Messages: 5933 Registered: June 2006 Location: Sunny North Yorkshire, ho...
|
Senior Member |
|
|
If your question can be restated as 'How do I get the text from this string that is not enclosed by <...>' then one way of doing it would be:with src as (select '<html><head></head><body>Welcome</body></html>' html from dual)
select translate(regexp_substr(html,'>[^<>]+<'),' <>',' ') from src;
|
|
|
|
|
|
Re: Converting HTML format text to Plain Text [message #436423 is a reply to message #436408] |
Wed, 23 December 2009 08:50 |
JRowbottom
Messages: 5933 Registered: June 2006 Location: Sunny North Yorkshire, ho...
|
Senior Member |
|
|
Here's an expansion of my previous solution, which will extract each piece of text in the html into a seperate field:with src as (select '<html><head>Header Text</head><body>Body Text</body></html>' html from dual)
,fields as (select html
,regexp_replace(regexp_replace(html
,'<[^<>]+>'
,'<')
,'<+'
,'<') data
,length(regexp_replace(regexp_replace(regexp_replace(html
,'<[^<>]+>'
,'<')
,'<+'
,'<')
,'[^<]+'
,''))-1 num_fields
from src)
select regexp_substr(data,'[^<]+',1,level) field
from fields
connect by level <= num_fields;
|
|
|
|
|
Re: Converting HTML format text to Plain Text [message #436984 is a reply to message #436834] |
Wed, 30 December 2009 03:20 |
JRowbottom
Messages: 5933 Registered: June 2006 Location: Sunny North Yorkshire, ho...
|
Senior Member |
|
|
I thought about that, but HTML doesn't require that all opening tags have a closing tag, and you get tags like <P> that don't have closing tags at all - if you feed something like that into an XML parser, I'd expect it to be very unhappy.
|
|
|
Re: Converting HTML format text to Plain Text [message #445835 is a reply to message #436984] |
Thu, 04 March 2010 05:34 |
|
ramoradba
Messages: 2456 Registered: January 2009 Location: AndhraPradesh,Hyderabad,I...
|
Senior Member |
|
|
Yes and even ...
See the Below condition..
SQL> SELECT str_html('<br><p><tr>14>13</td></tr></table><p>') FROM dual;
STR_HTML('<BR><P><TR>14>13</TD></TR></TABLE><P>')
--------------------------------------------------------------------------------
1413
SQL> SELECT str_html('<br><p><tr>14<13</td></tr></table><p>') FROM dual;
STR_HTML('<BR><P><TR>14<13</TD></TR></TABLE><P>')
--------------------------------------------------------------------------------
14
SQL> SELECT str_html('<br><p><tr>12<14<13</td></tr></table><p>') FROM dual;
STR_HTML('<BR><P><TR>12<14<13</TD></TR></TABLE><P>')
--------------------------------------------------------------------------------
12
SQL> with src as (select '<html><head>Header Text</head><body>Body Text</body></html>' html from dual)
2 ,fields as (select html
3 ,regexp_replace(regexp_replace(html
4 ,'<[^<>]+>'
5 ,'<')
6 ,'<+'
7 ,'<') data
8 ,length(regexp_replace(regexp_replace(regexp_replace(html
9 ,'<[^<>]+>'
10 ,'<')
11 ,'<+'
12 ,'<')
13 ,'[^<]+'
14 ,''))-1 num_fields
15 from src)
16 select regexp_substr(data,'[^<]+',1,level) field
17 from fields
18 connect by level <= num_fields;
FIELD
-----------------------
Header Text
Body Text
SQL> ed
Wrote file afiedt.buf
1 with src as (select '<br><p><tr>12<14<13</td></tr></table><p>' html from dual)
2 ,fields as (select html
3 ,regexp_replace(regexp_replace(html
4 ,'<[^<>]+>'
5 ,'<')
6 ,'<+'
7 ,'<') data
8 ,length(regexp_replace(regexp_replace(regexp_replace(html
9 ,'<[^<>]+>'
10 ,'<')
11 ,'<+'
12 ,'<')
13 ,'[^<]+'
14 ,''))-1 num_fields
15 from src)
16 select regexp_substr(data,'[^<]+',1,level) field
17 from fields
18* connect by level <= num_fields
SQL> /
FIELD
----------
12
14
13
SQL> with src as (select '<html><head></head><body>Welcome</body></html>' html from dual)
2 select translate(regexp_substr(html,'>[^<>]+<'),' <>',' ') from src;
TRANSLA
-------
Welcome
SQL> ed
Wrote file afiedt.buf
1 with src as (select '<br><p><tr>12<14<13</td></tr></table><p>' html from dual)
2* select translate(regexp_substr(html,'>[^<>]+<'),' <>',' ') from src
SQL> /
TR
--
12
SQL>
But the actual Output should be "12<14<13"
So what ever the answers provided Work only for some conditions.
Any one have the solution for this kind of situation.
Sriram
|
|
|
|
|
|
|
|
|
Re: Converting HTML format text to Plain Text [message #446517 is a reply to message #445989] |
Tue, 09 March 2010 00:22 |
|
ramoradba
Messages: 2456 Registered: January 2009 Location: AndhraPradesh,Hyderabad,I...
|
Senior Member |
|
|
Even i tried this...
SQL> CREATE OR REPLACE PACKAGE html
2 IS
3 PROCEDURE convertToText ( html_in IN VARCHAR2, plain_text OUT VARCHAR2 )
4 IS
5 language java
6 name 'html.convertToText( java.lang.String, java.lang.String[] )';
7
8 FUNCTION to_text ( html_in IN VARCHAR2 )
9 RETURN VARCHAR2
10 IS
11 language java
12 name 'html.to_text( java.lang.String ) return java.lang.String';
13
14 END html;
15 /
Package created.
SQL> CREATE OR REPLACE AND COMPILE
2 JAVA SOURCE NAMED "html"
3 AS
4 import javax.swing.text.BadLocationException;
5 import javax.swing.text.Document;
6 import javax.swing.text.html.HTMLEditorKit;
7 import java.io.*;
8
9 public class html extends Object
10 {
11 public static void convertToText( java.lang.String p_in,
12 java.lang.String[] p_out )
13 throws IOException, BadLocationException
14 {
15 // test for null inputs to avoid java.lang.NullPointerException when input is null
16 if ( p_in != null )
17 { HTMLEditorKit kit = new HTMLEditorKit();
18 Document doc = kit.createDefaultDocument();
19 kit.read(new StringReader(p_in), doc, 0);
20 p_out[0] = doc.getText(0, doc.getLength());
21 }
22 else p_out[0] = null;
23 }
24
25 public static String to_text(String p_in)
26 throws IOException, BadLocationException
27 {
28 // test for null inputs to avoid java.lang.NullPointerException when input is null
29 if (p_in != null)
30 { HTMLEditorKit kit = new HTMLEditorKit();
31 Document doc = kit.createDefaultDocument();
32 kit.read(new StringReader(p_in), doc, 0);
33 return doc.getText(0, doc.getLength());
34 }
35 else return null;
36 }
37 }
38 /
Java created.
SQL> with data as (select '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2 <HTML>
3 <HEAD>
4 <TITLE> New Document </TITLE>
5 <META NAME="Generator" CONTENT="EditPlus">
6 <META NAME="Author" CONTENT="">
7 <META NAME="Keywords" CONTENT="">
8 <META NAME="Description" CONTENT="">
9 </HEAD>
10 <BODY>
11 This is for testing that 11<12<13
12 </BODY>
13 </HTML>' page from dual)
14 select html.to_text( page ) AS plain_text
15 FROM data
16 /
PLAIN_TEXT
--------------------------------------------------
This is for testing that 111213
SQL>
But the Out put should be "This is for testing that 11<12<13"i.e.The actual Content how it will display in a browser the same output it should return.
Any suggestions ?
Sriram
[Updated on: Tue, 09 March 2010 00:24] Report message to a moderator
|
|
|
Re: Converting HTML format text to Plain Text [message #446521 is a reply to message #446517] |
Tue, 09 March 2010 01:29 |
_jum
Messages: 577 Registered: February 2008
|
Senior Member |
|
|
If you'd manage to build valid, wellformed XML you could use XML-functions as @kevin advised:
WITH DATA AS
(SELECT XMLTYPE ('
<HTML>
<HEAD>
<TITLE> New Document </TITLE>
</HEAD>
<BODY>This is for testing that 11>12>13</BODY>
</HTML>') xml_data
FROM dual)
SELECT
EXTRACTVALUE (xml_data, '//BODY') htmltxt
FROM DATA;
htmltxt
------------------------------------------
This is for testing that 11>12>13
[Updated on: Tue, 09 March 2010 01:32] Report message to a moderator
|
|
|
|
|
Re: Converting HTML format text to Plain Text [message #446528 is a reply to message #446524] |
Tue, 09 March 2010 02:40 |
_jum
Messages: 577 Registered: February 2008
|
Senior Member |
|
|
XMLType has a parameter to use it without validation, but Extractvalue has not, so I doubt, if this is of use for you:
WITH DATA AS
(SELECT XMLTYPE ('
<HTML>
<HEAD>
<TITLE> New Document </TITLE>
</HEAD>
<BODY>This is for testing that 11>12>13<14</BODY>
</HTML>',NULL,0,0) xml_data
FROM dual)
SELECT * FROM data
-->ORA-31011: ...
WITH DATA AS
(SELECT XMLTYPE ('
<HTML>
<HEAD>
<TITLE> New Document </TITLE>
</HEAD>
<BODY>This is for testing that 11>12>13<14</BODY>
</HTML>',NULL,0,1) xml_data
FROM dual)
SELECT * FROM data
--> no error
[Updated on: Tue, 09 March 2010 02:48] Report message to a moderator
|
|
|
|
Re: Converting HTML format text to Plain Text [message #446583 is a reply to message #446524] |
Tue, 09 March 2010 07:02 |
Frank
Messages: 7901 Registered: March 2000
|
Senior Member |
|
|
ramoradba wrote on Tue, 09 March 2010 09:01
10 <BODY>
11 < This is Sample HTML, To display 10 < 12 > 11
12 </BODY>
I cannot use xmltype as we don`t know what are the tags involved On the User`s input.
This is not HTML, this is gibberish.
What would this mean, according to you?
11 < This is Sample HTML, To display a < p > c
12 </BODY>
According to me (and a lot of browsers), the "< p >" bit means the start of a new paragraph.
The leading "<" makes it all even worse.
|
|
|
Re: Converting HTML format text to Plain Text [message #446678 is a reply to message #446583] |
Tue, 09 March 2010 23:05 |
|
ramoradba
Messages: 2456 Registered: January 2009 Location: AndhraPradesh,Hyderabad,I...
|
Senior Member |
|
|
Quote:This is not HTML, this is gibberish.
I know what is HTML.I Don't Know anything about this..."gibberish" May be your Language.....
Quote:11 < This is Sample HTML, To display a < p > c
12 </BODY>
.
1) We had some issues...with some individual "<" and ">" characters.That's why we put some special characters On the body and tested for our requirement.
Quote:According to me (and a lot of browsers), the "< p >" bit means the start of a new paragraph.
The leading "<" makes it all even worse.
2)<p> this is new paragraph not < p > OK.
IF it like < p > Then as per our requirement it should display that as it is.
SQL> select replace(html.to_text(replace('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2 <HTML>
3 <HEAD>
4 <TITLE> New Document </TITLE>
5 <META NAME="Generator" CONTENT="EditPlus">
6 <META NAME="Author" CONTENT="">
7 <META NAME="Keywords" CONTENT="">
8 <META NAME="Description" CONTENT="">
9 </HEAD>
10 <BODY>
11 < This is Sample HTML, To display 10 < 12 > 11
12 </BODY>
13 </HTML>','< ','~')),'~','< ') AS plain_text
14 FROM dual
15 /
PLAIN_TEXT
--------------------------------------------------
< This is Sample HTML, To display 10 < 12 > 11
SQL> ed
Wrote file afiedt.buf
1 select replace(html.to_text(replace('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2 <HTML>
3 <HEAD>
4 <TITLE> New Document </TITLE>
5 <META NAME="Generator" CONTENT="EditPlus">
6 <META NAME="Author" CONTENT="">
7 <META NAME="Keywords" CONTENT="">
8 <META NAME="Description" CONTENT="">
9 </HEAD>
10 <BODY>
11 < < This is Sample HTML, To display a < p > c
12 </BODY>
13 </HTML>','< ','~')),'~','< ') AS plain_text
14* FROM dual
15 /
PLAIN_TEXT
--------------------------------------------------
< < This is Sample HTML, To display a < p > c
SQL> ed
Wrote file afiedt.buf
1 select replace(html.to_text(replace('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
2 <HTML>
3 <HEAD>
4 <TITLE> New Document </TITLE>
5 <META NAME="Generator" CONTENT="EditPlus">
6 <META NAME="Author" CONTENT="">
7 <META NAME="Keywords" CONTENT="">
8 <META NAME="Description" CONTENT="">
9 </HEAD>
10 <BODY>
11 < < This is Sample HTML, To display a <p> c
12 </BODY>
13 </HTML>','< ','~')),'~','< ') AS plain_text
14* FROM dual
SQL> /
PLAIN_TEXT
--------------------------------------------------
< < This is Sample HTML, To display a
c
SQL> .
Criticising is Not the answer for all.No one is born Genious..
Is there anything written/posted in forum guide lines like Moderators can do anything (insulting/critising/etc..).
IMO giving hints is perfect.
Thank you
sriram
|
|
|
Re: Converting HTML format text to Plain Text [message #446706 is a reply to message #446678] |
Wed, 10 March 2010 01:10 |
Frank
Messages: 7901 Registered: March 2000
|
Senior Member |
|
|
First of all, I am posting as a member, just like anyone else, so this "what can moderators do" stuff does not make sense.
Second, the topic title is "Converting HTML format text to Plain Text", so I tell you that your input is not valid HTML. HTML has no idea of whitespaces for example, so in HTML, < p > and <p> are considered equal.
Third, if you can't stand to be criticized, you will love this new thing called the Internet..
Fourth: please show me where I insulted or etc you.
|
|
|
Goto Forum:
Current Time: Fri May 03 07:50:26 CDT 2024
|