htmlparser-user Mailing List for HTML Parser (Page 37)
Brought to you by:
derrickoswald
You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
| 2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
| 2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
| 2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
| 2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
| 2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
| 2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
| 2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
| 2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
| 2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
| 2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
| 2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
| 2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
| 2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
| 2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
| 2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
| 2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
| 2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
| 2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
| 2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
|
From: Jay K. <jy...@eq...> - 2006-05-30 20:15:49
|
Let me describe more on the the problems of using StringBean as a
NodeVisitor.
Here is my code snippet:
private class TestVisitor extends StringBean {
@Override
public void visitStringNode(Text text) {
System.out.println("text=3D" + text.getText());
}
}
TestVisitor visitor =3D new TestVisitor();
visitor.setCollapse(false);
htmlParser.visitAllNodesWith(visitor);
And, if I feed the sample HTML below, the visitStringNode() methods does
not detect the second 'AAAAA' as one word, but instead, it splits into
two words ('AAA' and 'AA'), which is basically the same problem that I
described in the first email.
Please let me know.
Thanks,
=20
Jay=20
-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of Jay
Kim
Sent: Tuesday, May 30, 2006 10:45 AM
To: htm...@li...
Subject: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word
Derrick,
Thank you so much for your quick respond, and getting back to me with
the solution.
Now that I'm able to count the number of words appears in a HTML file
correctly, my next task is to find out the offset (start position) of
each words. I'm guessing that I probably have to use NodeVisitor with
StringBean, but I'd like to get some guidelines before I dig into the
APIs.
So, for the following sample HTML:
<HTML>
<head>
<title>Test HTML</title>
</head>
<body>
<p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
</body>
</HTML>
If I search for 'AAAAA', I want to get three matches with their starting
positions (offsets), such as,
Match 1 offset =3D 58
Match 2 offset =3D 70
Match 3 offset =3D 108
Could you show me how to achieve this?
Thanks a lot,
=20
Jay
=20
-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of
Derrick Oswald
Sent: Monday, May 29, 2006 4:45 AM
To: htm...@li...
Subject: Re: [Htmlparser-user] Finding a whole word
Jay
The text you want can be obtained with the StringBean if Collapse is
false.
When collapse is true, there is a bug in the StringBean.
I've logged this as bug #1496863 StringBean collapse() adds extra=20
whitespace=20
<http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&gr=
oup_
id=3D24399&atid=3D381399>=20
so you can track it.
Derrick
Jay Kim wrote:
> Hi,
>
> I'm trying to get the word count using htmlparser, but it doesn't seem
> to be able to handle the following example.
>
> Let's say the source html looks like this:
>
> <HTML>
>
> <head>
>
> <title>Test HTML</title>
>
> </head>
>
> <body>
>
> <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
>
> </body>
>
> </HTML>
>
> And, if you load it in a browser, you'll see the word 'AAAAA' three=20
> times.
>
> But, if you parse this html, it returns following nodes:
>
> AAAAA BBBBB AAA AA BBBBB AAAAA
>
> So, it breaks down the second 'AAAAA' into two words because of the=20
> font tag in the middle. And, the word count from the parsed text would
> be "2".
>
> Is there any way that I can get the same text/string/word that I see=20
> on the browser?
>
> Thanks,
>
> Jay
>
-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications
in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat=
=3D121642
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications
in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmdk&kid=107521&bid$8729&dat=121642
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|
|
From: Jay K. <jy...@eq...> - 2006-05-30 17:45:49
|
Derrick, Thank you so much for your quick respond, and getting back to me with the solution. Now that I'm able to count the number of words appears in a HTML file correctly, my next task is to find out the offset (start position) of each words. I'm guessing that I probably have to use NodeVisitor with StringBean, but I'd like to get some guidelines before I dig into the APIs. So, for the following sample HTML: <HTML> <head> <title>Test HTML</title> </head> <body> <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> </body> </HTML> If I search for 'AAAAA', I want to get three matches with their starting positions (offsets), such as, Match 1 offset =3D 58 Match 2 offset =3D 70 Match 3 offset =3D 108 Could you show me how to achieve this? Thanks a lot, =20 Jay =20 -----Original Message----- From: htm...@li... [mailto:htm...@li...] On Behalf Of Derrick Oswald Sent: Monday, May 29, 2006 4:45 AM To: htm...@li... Subject: Re: [Htmlparser-user] Finding a whole word Jay The text you want can be obtained with the StringBean if Collapse is false. When collapse is true, there is a bug in the StringBean. I've logged this as bug #1496863 StringBean collapse() adds extra=20 whitespace=20 <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&gr= oup_ id=3D24399&atid=3D381399>=20 so you can track it. Derrick Jay Kim wrote: > Hi, > > I'm trying to get the word count using htmlparser, but it doesn't seem > to be able to handle the following example. > > Let's say the source html looks like this: > > <HTML> > > <head> > > <title>Test HTML</title> > > </head> > > <body> > > <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p> > > </body> > > </HTML> > > And, if you load it in a browser, you'll see the word 'AAAAA' three=20 > times. > > But, if you parse this html, it returns following nodes: > > AAAAA BBBBB AAA AA BBBBB AAAAA > > So, it breaks down the second 'AAAAA' into two words because of the=20 > font tag in the middle. And, the word count from the parsed text would > be "2". > > Is there any way that I can get the same text/string/word that I see=20 > on the browser? > > Thanks, > > Jay > ------------------------------------------------------- All the advantages of Linux Managed Hosting--Without the Cost and Risk! Fully trained technicians. The highest number of Red Hat certifications in the hosting industry. Fanatical Support. Click to learn more http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat= =3D121642 _______________________________________________ Htmlparser-user mailing list Htm...@li... https://lists.sourceforge.net/lists/listinfo/htmlparser-user |
|
From: Derrick O. <Der...@Ro...> - 2006-05-30 11:04:51
|
Nilius,
I'm surprised it works as you've coded it.
I would have thought you would need to operform
parser.parse (null);
before the getEncoding ();
Otherwise it would still be set to the default encoding.
Derrick
Nilius Fabian wrote:
>This is more a feature request than a question, since our solutions
>seems to work.
>
>We are parsing existing html files, directly from the file system.
>It seems to be quite complicated to handle charset/unicode issues
>correctly.
>
>One of the basic problems is that Parser.createParser doesn't take a
>byte[] as
>argument, but a String.
>
>To transfer a File (which is basically a byte[]) to a String I do need
>to know
>the charset/encoding. To know this, I would like to use
>parser.getEncoding(), to
>read the meta tags (Content-Type). So, in the sample code attached, we
>are
>reading the html file twice: Once with plain ascii encoding (which
>should
>be OK for the HTML HEAD), once whith the encoding then provided by the
>HTML parser.
>
>It would be great if the html parser could handle byte[] and would sort
>out
>the encoding stuff itself (some guessing might also be done, e.g. handle
>BOMs
>(byte order marks)).
>
>Thanks
>
>Fabian
>
>
>
> String readHtmlFile (String fileName) throws IOException,
>UnsupportedEncodingException
> {
> String source;
> String result;
>
> try
> {
> source = _readFile (fileName, null);
> }
> catch (UnsupportedEncodingException e)
> {
> throw new RuntimeException ("Programming error: Default encoding
>unsupported?", e);
> }
>
> Parser parser = Parser.createParser (source, null);
>
> String sourceCodepage;
>
> String encoding = parser.getEncoding ();
>
> try
> {
> sourceCodepage = readFile (fileName, encoding);
> result = sourceCodepage;
> }
> catch (UnsupportedEncodingException e)
> {
> System.err.println ("Unsupported HTMl encoding \"" + encoding
> + "\", using default.");
> result = source;
> }
>
> return result;
> }
>
>
> String _readFile (String fileName, String codepage)
> throws FileNotFoundException, UnsupportedEncodingException,
>IOException
> {
> File file = new File (fileName);
> long length = file.length ();
> char[] buffer = new char[(int) length];
>
> FileInputStream fileInputStream = new FileInputStream (file);
>
> InputStreamReader inputStreamReader;
> if (codepage == null)
> {
> inputStreamReader = new InputStreamReader (fileInputStream);
> }
> else
> {
> inputStreamReader = new InputStreamReader (fileInputStream,
>codepage);
> }
>
> BufferedReader bufferedReader = new BufferedReader
>(inputStreamReader);
>
> int noCharRead = bufferedReader.read (buffer, 0 /* offset */, (int)
>length);
> return new String (buffer, 0, noCharRead);
> }
>
>
>
>-------------------------------------------------------
>All the advantages of Linux Managed Hosting--Without the Cost and Risk!
>Fully trained technicians. The highest number of Red Hat certifications in
>the hosting industry. Fanatical Support. Click to learn more
>http://sel.as-us.falkag.net/sel?cmd=k&kid7521&bid$8729&dat1642
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
>
|
|
From: Nilius F. <Fab...@co...> - 2006-05-30 09:39:39
|
This is more a feature request than a question, since our solutions
seems to work.
We are parsing existing html files, directly from the file system.
It seems to be quite complicated to handle charset/unicode issues
correctly.=20
One of the basic problems is that Parser.createParser doesn't take a
byte[] as=20
argument, but a String.=20
To transfer a File (which is basically a byte[]) to a String I do need
to know=20
the charset/encoding. To know this, I would like to use
parser.getEncoding(), to=20
read the meta tags (Content-Type). So, in the sample code attached, we
are=20
reading the html file twice: Once with plain ascii encoding (which
should=20
be OK for the HTML HEAD), once whith the encoding then provided by the
HTML parser.
It would be great if the html parser could handle byte[] and would sort
out
the encoding stuff itself (some guessing might also be done, e.g. handle
BOMs=20
(byte order marks)).
Thanks
Fabian
String readHtmlFile (String fileName) throws IOException,
UnsupportedEncodingException
{
String source;
String result;
try
{
source =3D _readFile (fileName, null);
}
catch (UnsupportedEncodingException e)
{
throw new RuntimeException ("Programming error: Default encoding
unsupported?", e);
}
Parser parser =3D Parser.createParser (source, null);
String sourceCodepage;
String encoding =3D parser.getEncoding ();
try
{
sourceCodepage =3D readFile (fileName, encoding);
result =3D sourceCodepage;
}
catch (UnsupportedEncodingException e)
{
System.err.println ("Unsupported HTMl encoding \"" + encoding
+ "\", using default.");
result =3D source;
}
return result;
}
String _readFile (String fileName, String codepage)
throws FileNotFoundException, UnsupportedEncodingException,
IOException
{
File file =3D new File (fileName);
long length =3D file.length ();
char[] buffer =3D new char[(int) length];
FileInputStream fileInputStream =3D new FileInputStream (file);
InputStreamReader inputStreamReader;
if (codepage =3D=3D null)
{
inputStreamReader =3D new InputStreamReader (fileInputStream);
}
else
{
inputStreamReader =3D new InputStreamReader (fileInputStream,
codepage);
}
BufferedReader bufferedReader =3D new BufferedReader
(inputStreamReader);
int noCharRead =3D bufferedReader.read (buffer, 0 /* offset */, =
(int)
length);
return new String (buffer, 0, noCharRead);
}
|
|
From: Derrick O. <Der...@Ro...> - 2006-05-29 11:45:06
|
Jay The text you want can be obtained with the StringBean if Collapse is false. When collapse is true, there is a bug in the StringBean. I've logged this as bug #1496863 StringBean collapse() adds extra whitespace <http://sourceforge.net/tracker/index.php?func=detail&aid=1496863&group_id=24399&atid=381399> so you can track it. Derrick Jay Kim wrote: > Hi, > > I’m trying to get the word count using htmlparser, but it doesn’t seem > to be able to handle the following example. > > Let’s say the source html looks like this: > > <HTML> > > <head> > > <title>Test HTML</title> > > </head> > > <body> > > <p>AAAAA BBBBB AAA<font color='red'>AA</font> BBBBB AAAAA</p> > > </body> > > </HTML> > > And, if you load it in a browser, you’ll see the word ‘AAAAA’ three > times. > > But, if you parse this html, it returns following nodes: > > AAAAA BBBBB AAA AA BBBBB AAAAA > > So, it breaks down the second ‘AAAAA’ into two words because of the > font tag in the middle. And, the word count from the parsed text would > be “2”. > > Is there any way that I can get the same text/string/word that I see > on the browser? > > Thanks, > > Jay > |
|
From: Jay K. <jy...@eq...> - 2006-05-28 19:10:50
|
Hi,
=20
I'm trying to get the word count using htmlparser, but it doesn't seem
to be able to handle the following example.
Let's say the source html looks like this:
=20
<HTML>
<head>
<title>Test HTML</title>
</head>
<body>
<p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
</body>
</HTML>
=20
And, if you load it in a browser, you'll see the word 'AAAAA' three
times.=20
But, if you parse this html, it returns following nodes:
=20
AAAAA BBBBB AAA AA BBBBB AAAAA
=20
So, it breaks down the second 'AAAAA' into two words because of the font
tag in the middle. And, the word count from the parsed text would be
"2".
Is there any way that I can get the same text/string/word that I see on
the browser?
=20
Thanks,
=20
Jay
=20
|
|
From: Derrick O. <Der...@Ro...> - 2006-05-22 12:26:37
|
I would subclass StringFilter or RegexFilter and override it's accept
method to something like:
public boolean accept (Node node)
{
String string;
boolean ret;
ret = false;
if (node instanceof Remark)
{
string = ((Remark)node).getText ();
if (!getCaseSensitive ())
string = string.toUpperCase (getLocale ());
ret = (-1 != string.indexOf (mUpperPattern));
}
return (ret);
}
Then you could get the two remarks (your example indicates they have the
same text) with:
NodeList list = parser.extractAllNodesThatMatch (new RemarkFilter ("--
this is a comment --");
From the two Remark nodes returned you should be able to navigate to the
text between them with something like:
Remark first = (Remark)(list.elementAt (0));
NodeList siblings = remark.getParent ().getChildren ();
for (int i = 0; i < siblings.length (); i++)
if (siblings.elementAt (i) == first)
mytext = siblings.elementAt (i + 1).toHtml ();
Jiang Zhiguo wrote:
> Hi,ALL
> if I want to get the content in comment tag;
> like this:
> <!-- this is a comment -->
> * aaaaaaaaa*
> <!-- this is a comment -->
> I want to get the *"aaaaaaaaa"*
> how to do it?
> thanks!
|
|
From: Jiang Z. <jia...@ip...> - 2006-05-22 04:06:28
|
SGksQUxMDQogICAgaWYgSSB3YW50IHRvIGdldCB0aGUgY29udGVudCBpbiBjb21tZW50IHRhZzsN CiAgICBsaWtlIHRoaXM6DQogICAgIDwhLS0gdGhpcyBpcyBhIGNvbW1lbnQgLS0+DQogICAgICBh YWFhYWFhYWENCiAgICAgPCEtLSB0aGlzIGlzIGEgY29tbWVudCAtLT4NCiAgICAgSSB3YW50IHRv IGdldCB0aGUgImFhYWFhYWFhYSINCiAgICAgaG93IHRvIGRvIGl0Pw0KICAgICB0aGFua3Mh |
|
From: Smarty <zm...@gm...> - 2006-05-21 16:14:21
|
Thank you very much. The toHtml method worked great. On 5/19/06, Derrick Oswald <Der...@ro...> wrote: > > If you want to replace all tags with plain text use the StringBean class > to return the text contents of a page. > > If it's just the LinkTag nodes you want to replace, its a bit more > complicated. > > Perhaps the easiest way would be to create your own derived LinkTag > class and override it's toHtml() method to just return the toHtml() or > toPlainTextString() of all it's children, without the enclosing <A> and > </A>. You would then register an instance of this custom tag with a > PrototypicalNodeFactory and pass the factory to the parser via > setNodeFacory(). Then when you print the toHtml() of the NodeList > returned from the parser, the overridden method is called and your tag > get's to do it's thing in the midst of the page. > > Smarty wrote: > > > > > Hi everyone. > > > > Could you please tell me the easiest way to transform all LinkTag > > nodes from a NodeList to Text nodes that contain the text of the > > link's children? > > > > From an html standpoint this should convert > > > > This is a <a href=3D''><b>link</b></a>... > > > > to > > > > This is a link... > > > > or > > > > This is a <a href=3D''><b>link</b></a>... > > > > to > > > > This is a <b>link</b>... > > > > (both ways are ok for me) > > > > > > Thanks for your time, > > Ovidiu Dan > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronim= o > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D120709&bid=3D263057&dat= =3D121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > --=20 Ovidiu Dan, Technical Staff, Playfuls.com |
|
From: Derrick O. <Der...@Ro...> - 2006-05-19 12:01:13
|
If you want to replace all tags with plain text use the StringBean class to return the text contents of a page. If it's just the LinkTag nodes you want to replace, its a bit more complicated. Perhaps the easiest way would be to create your own derived LinkTag class and override it's toHtml() method to just return the toHtml() or toPlainTextString() of all it's children, without the enclosing <A> and </A>. You would then register an instance of this custom tag with a PrototypicalNodeFactory and pass the factory to the parser via setNodeFacory(). Then when you print the toHtml() of the NodeList returned from the parser, the overridden method is called and your tag get's to do it's thing in the midst of the page. Smarty wrote: > > Hi everyone. > > Could you please tell me the easiest way to transform all LinkTag > nodes from a NodeList to Text nodes that contain the text of the > link's children? > > From an html standpoint this should convert > > This is a <a href=''><b>link</b></a>... > > to > > This is a link... > > or > > This is a <a href=''><b>link</b></a>... > > to > > This is a <b>link</b>... > > (both ways are ok for me) > > > Thanks for your time, > Ovidiu Dan |
|
From: Smarty <zm...@gm...> - 2006-05-17 15:19:12
|
Hi everyone. Could you please tell me the easiest way to transform all LinkTag nodes fro= m a NodeList to Text nodes that contain the text of the link's children? From an html standpoint this should convert This is a <a href=3D''><b>link</b></a>... to This is a link... or This is a <a href=3D''><b>link</b></a>... to This is a <b>link</b>... (both ways are ok for me) Thanks for your time, Ovidiu Dan |
|
From: Tiago F. <tia...@gm...> - 2006-05-17 14:02:46
|
Tks, its working.
here is the complete code:
StringBean sb =3D new StringBean();
Parser parser =3D new Parser();
parser.setInputHTML(origem); // String with html
parser.visitAllNodesWith (sb);
sb.setLinks (false);
String s =3D sb.getStrings();
On 5/17/06, Ian Macfarlane <ian...@gm...> wrote:
> You want this:
>
> Parser parser =3D new Parser();
> parser.setInputHTML(source);
> NodeList nodes =3D parser.parse(null);
>
> (catching the ParseException it will throw)
>
> Ian
>
> On 5/17/06, Derrick Oswald <der...@ro...> wrote:
> > I believe you can get the Parser the StringExtractor is using and then =
use
> > SetInputHtml() to make your string the source.
> >
> >
> > Tiago Fischer <tia...@gm...> wrote:
> >
> > Hi all!
> >
> > I have a html document in a string, i want to strip html tags and so on=
...
> > But, the StringExtractor only accept files and links.
> >
> > var "origem" is a string with a html document:
> > try {
> > StringExtractor temp =3D new StringExtractor(origem);
> > origem =3D temp.extractStrings(false);
> > } catch (org.htmlparser.util.ParserException e) {
> > new Log("ParserException ("+e.getMessage()+")");
> > origem =3D e.getMessage();
> > }
> >
> > Any idea?
> > Tks!
> > []s
> > FlycKER
> >
> >
> > -------------------------------------------------------
> > Using Tomcat but need to do more? Need to support web services, securit=
y?
> > Get stuff done quickly with pre-integrated technology to make your job
> > easier
> > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geron=
imo
> > http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=120709&bid&3057&dat=12164=
2
> >
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
>
>
> -------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job ea=
sier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronim=
o
> http://sel.as-us.falkag.net/sel?cmdlnk&kid=120709&bid&3057&dat=121642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
|
|
From: Ian M. <ian...@gm...> - 2006-05-17 09:13:23
|
You want this:
Parser parser =3D new Parser();
parser.setInputHTML(source);
NodeList nodes =3D parser.parse(null);
(catching the ParseException it will throw)
Ian
On 5/17/06, Derrick Oswald <der...@ro...> wrote:
> I believe you can get the Parser the StringExtractor is using and then us=
e
> SetInputHtml() to make your string the source.
>
>
> Tiago Fischer <tia...@gm...> wrote:
>
> Hi all!
>
> I have a html document in a string, i want to strip html tags and so on..=
.
> But, the StringExtractor only accept files and links.
>
> var "origem" is a string with a html document:
> try {
> StringExtractor temp =3D new StringExtractor(origem);
> origem =3D temp.extractStrings(false);
> } catch (org.htmlparser.util.ParserException e) {
> new Log("ParserException ("+e.getMessage()+")");
> origem =3D e.getMessage();
> }
>
> Any idea?
> Tks!
> []s
> FlycKER
>
>
> -------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job
> easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronim=
o
> http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=120709&bid&3057&dat=121642
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
|
|
From: Derrick O. <der...@ro...> - 2006-05-16 23:31:05
|
I believe you can get the Parser the StringExtractor is using and then use SetInputHtml() to make your string the source.
Tiago Fischer <tia...@gm...> wrote: Hi all!
I have a html document in a string, i want to strip html tags and so on...
But, the StringExtractor only accept files and links.
var "origem" is a string with a html document:
try {
StringExtractor temp = new StringExtractor(origem);
origem = temp.extractStrings(false);
} catch (org.htmlparser.util.ParserException e) {
new Log("ParserException ("+e.getMessage()+")");
origem = e.getMessage();
}
Any idea?
Tks!
[]s
FlycKER
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user
|
|
From: Tiago F. <tia...@gm...> - 2006-05-16 20:21:30
|
Hi all!
I have a html document in a string, i want to strip html tags and so on...
But, the StringExtractor only accept files and links.
var "origem" is a string with a html document:
try {
StringExtractor temp =3D new StringExtractor(origem);
origem =3D temp.extractStrings(false);
} catch (org.htmlparser.util.ParserException e) {
new Log("ParserException ("+e.getMessage()+")");
origem =3D e.getMessage();
}
Any idea?
Tks!
[]s
FlycKER
|
|
From: Ian M. <ian...@gm...> - 2006-05-15 09:35:28
|
Dear Victor, The API for HTMLParser is found at http://htmlparser.sourceforge.net/javadoc/ and includes some examples (though they are a bit old). For stop word deletion, you'll have to do that seperately with your own list of stop words. Ian On 5/12/06, Victor Egea Hernando <pul...@ya...> wrote: > Hello all java htmlparser users and developers, > > I want get individial words of string stractor and delete stop words... > to, for, the, etc... for a search engine and to get individual links, al= so. > > Can you help me for find the source code or the API functions for make > this. > > Thank you. > Sorry for my poor english. > __________________________________________________ > Regards Something in the WWW. > > > ________________________________ > > LLama Gratis a cualquier PC del Mundo. > Llamadas a fijos y m=F3viles desde 1 c=E9ntimo por minuto. > http://es.voice.yahoo.com > > |
|
From: Victor E. H. <pul...@ya...> - 2006-05-12 14:53:10
|
Hello all java htmlparser users and developers,
I want get individial words of string stractor and delete stop words...
to, for, the, etc... for a search engine and to get individual links, also.
Can you help me for find the source code or the API functions for make this.
Thank you.
Sorry for my poor english.
__________________________________________________
Regards Something in the WWW.
---------------------------------
LLama Gratis a cualquier PC del Mundo.
Llamadas a fijos y móviles desde 1 céntimo por minuto.
http://es.voice.yahoo.com |
|
From: Victor E. H. <pul...@ya...> - 2006-05-12 14:46:47
|
Hello all java htmlparser users and developers, I want get individial words of string stractor and delete stop words... to, for, the, etc... for a search engine and to get individual links, also. Can you help me for find the source code or the API functions for make this. Thank you. Sorry for my poor english. __________________________________________________ Regards Something in the WWW. --------------------------------- LLama Gratis a cualquier PC del Mundo. Llamadas a fijos y móviles desde 1 céntimo por minuto. http://es.voice.yahoo.com |
|
From: Derrick O. <Der...@Ro...> - 2006-05-09 03:17:45
|
Randy, These are available in CVS under /htmlparser/docs/wiki/index.php Beware: Some of them are pretty old and missleading. Derrick Randy Paries wrote: > hello, > there were some samples at > http://htmlparser.sourceforge.net/wiki/index.php/SamplePrograms > that i used alot. > > I need to create my own TagNode > > looking for a simple example on how to create one > > thanks > Randy > > |
|
From: Randy P. <rtp...@gm...> - 2006-05-09 02:59:20
|
hello, there were some samples at http://htmlparser.sourceforge.net/wiki/index.php/SamplePrograms that i used alot. I need to create my own TagNode looking for a simple example on how to create one thanks Randy |
|
From: Basar O. K. <bas...@gm...> - 2006-05-08 07:02:52
|
Hi, i am new to use htmlparser. i need to read a table structure data. Here is an small piece of html source as example. <html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dwindows-1254"></head><body> <TABLE> <TR> <TD align=3Dleft width=3D81> <FONT size=3D2>MyData1 </FONT></TD> <TD align=3Dleft width=3D249> <FONT size=3D2>MyData2 </FONT></TD> <TD align=3Dright > <FONT size=3D2>MyData3 </FONT></TD> <TD align=3Dright width=3D250><FONT size=3D2>MyData4 </FONT></TD> <TD align=3Dright width=3D250><FONT size=3D2>MyData5 </FONT></TD> <TD align=3Dright width=3D250><FONT size=3D2>MyData6 </FONT></TD> </TR> </TABLE> <BR>MyData7 </body></html> here, i need to read MyDatas. can you send me a small java code to parse something like that. -- Basar Ozgur Kahraman |
|
From: Subramanya S. <sa...@cs...> - 2006-05-08 03:21:58
|
Riaz, For now, check http://cvs.sourceforge.net/viewcvs.py/newsrack/newsrack/WEB-INF/classes/news_rack/archiver/HTMLFilter.java?view=markup This is a CVS version of code that does precisely this task. This code does a lot of what you want. Couple of samples of output is at: http://floss.sarai.net/newsrack/DisplayNewsItem.do?ni=5.5.2006%2Frediff.business%2Fni9.05tata2.htm http://floss.sarai.net/newsrack/DisplayNewsItem.do?ni=2.5.2006%2Fsify.finance%2Fni3.fullstory.php_id%3D14195512 I had written this code that used the built-in JDK swing parser earlier. But, someone else working on this project (newsrack) helped me migrate this over to HTMLParser. I will be checking in a newer version of this code in a couple day's time. If you plan to use this code, please credit 'Subramanya Sastry' and 'Jaikishan Jalan'. At this time, code for the entire project is being released under GPL. In future, other licences (apache) will be incorporated. Would also appreciate any improvements you make to the code. Thanks, Subbu. > Riaz, > > You will probably need to use a filter to pick out the content you want. > Run the FilterBuilder tool (bin/filterbuilder) and create a filter that > gets the content you want. > It has a little help and a tutorial to get you going. > Then use the filter code generated by the tool and pass it to a > FilterBean, which has a convenience method, called getText() I think, > that will apply a StringBean to the results of the filter. > > Derrick > > Riaz uddin wrote: > > > Hi, > > > > I have this code snippet from htmlparser.sourcefourge.net for StringBean: > > > >StringBean sb = new StringBean (); > > sb.setLinks (false); > > sb.setReplaceNonBreakingSpaces (true); > > sb.setCollapse (true); > > sb.setURL ("http://news.yahoo.com/s/ap/20060507/ap_on_re_mi_ea/iraq;_ylt=AoeY5mkiWMfGQ8KbE6W5xxas0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--"); // the HTTP is performed here > > String s = sb.getStrings (); > > > > How can I get rid of other text and get only the news content from > > this URL? > > The unwanted text(links) are like: 'Home', 'U.S.', etc appearing in > > the output. > > > > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
|
From: Derrick O. <Der...@Ro...> - 2006-05-08 00:03:48
|
Riaz, You will probably need to use a filter to pick out the content you want. Run the FilterBuilder tool (bin/filterbuilder) and create a filter that gets the content you want. It has a little help and a tutorial to get you going. Then use the filter code generated by the tool and pass it to a FilterBean, which has a convenience method, called getText() I think, that will apply a StringBean to the results of the filter. Derrick Riaz uddin wrote: > Hi, > > I have this code snippet from htmlparser.sourcefourge.net for StringBean: > >StringBean sb = new StringBean (); > sb.setLinks (false); > sb.setReplaceNonBreakingSpaces (true); > sb.setCollapse (true); > sb.setURL ("http://news.yahoo.com/s/ap/20060507/ap_on_re_mi_ea/iraq;_ylt=AoeY5mkiWMfGQ8KbE6W5xxas0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--"); // the HTTP is performed here > String s = sb.getStrings (); > > How can I get rid of other text and get only the news content from > this URL? > The unwanted text(links) are like: 'Home', 'U.S.', etc appearing in > the output. > > |
|
From: Derrick O. <Der...@Ro...> - 2006-05-07 23:59:56
|
Sue,
The FormTag has a couple of methods to get INPUT and TEXTAREA tags. You
should use something like those:
/**
* Get the list of input fields.
* @return Input elements in the form.
*/
public NodeList getFormInputs()
{
return (searchFor (InputTag.class, true));
}
but search for SelectTag.class and call the method getSelects(). If you
want a specific named SELECT, then you should use something like the
equivalent methods for INPUT and TEXTAREA:
/**
* Get the input tag in the form corresponding to the given name
* @param name The name of the input tag to be retrieved
* @return Tag The input tag corresponding to the name provided
*/
public InputTag getInputTag (String name)
{
InputTag inputTag;
boolean found;
String inputTagName;
inputTag = null;
found = false;
for (SimpleNodeIterator e =
getFormInputs().elements();e.hasMoreNodes() && !found;)
{
inputTag = (InputTag)e.nextNode();
inputTagName = inputTag.getAttribute("NAME");
if (inputTagName!=null && inputTagName.equalsIgnoreCase(name))
found=true;
}
if (found)
return (inputTag);
else
return (null);
}
but use your new getSelects() method.
Derrick
sue asdic wrote:
> Hi
>
> I want to extract the select tag from the form,how to do it?
>
> for example:
>
> the page likes this,and I want to know in the form1 there is one
> select tag and in the form1 there is one select tag too!
>
> <html>
>
> <form name = form1>
> <input type=text name="usrname">
> <select name="sex">
> <option value="female" selected>female</option>
> <option value="male">male</option>
> </select>
> </form>
>
> <form name = form2>
> <input type=text name="email">
> <select name="lang">
> <option value="english" selected>english</option>
> <option value="french">french</option>
> </select>
> </form>
>
> </html>
>
> Regards
> sue
>
>
> -------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job
> easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache
> Geronimo
> http://sel.as-us.falkag.net/sel?cmd=k&kid0709&bid&3057&dat1642
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>
|
|
From: Riaz u. <ru...@ya...> - 2006-05-07 18:59:21
|
Hi, I have this code snippet from htmlparser.sourcefourge.net for StringBean: StringBean sb = new StringBean (); sb.setLinks (false); sb.setReplaceNonBreakingSpaces (true); sb.setCollapse (true); sb.setURL ("http://news.yahoo.com/s/ap/20060507/ap_on_re_mi_ea/iraq;_ylt=AoeY5mkiWMfGQ8KbE6W5xxas0NUE;_ylu=X3oDMTA2Z2szazkxBHNlYwN0bQ--"); // the HTTP is performed here String s = sb.getStrings (); How can I get rid of other text and get only the news content from this URL? The unwanted text(links) are like: 'Home', 'U.S.', etc appearing in the output. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |