htmlparser-user Mailing List for HTML Parser (Page 37)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Let me describe more on the the problems of using StringBean as a
NodeVisitor.
Here is my code snippet:

	private class TestVisitor extends StringBean {
		@Override
		public void visitStringNode(Text text) {
			System.out.println("text=3D" + text.getText());
		}
	}

	TestVisitor visitor =3D new TestVisitor();
	visitor.setCollapse(false);
	htmlParser.visitAllNodesWith(visitor);

And, if I feed the sample HTML below, the visitStringNode() methods does
not detect the second 'AAAAA' as one word, but instead, it splits into
two words ('AAA' and 'AA'), which is basically the same problem that I
described in the first email.
Please let me know.
Thanks,
=20
Jay=20

-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of Jay
Kim
Sent: Tuesday, May 30, 2006 10:45 AM
To: htm...@li...
Subject: [Htmlparser-user] RE: [Htmlparser-user] Finding a whole word

Derrick,

Thank you so much for your quick respond, and getting back to me with
the solution.
Now that I'm able to count the number of words appears in a HTML file
correctly, my next task is to find out the offset (start position) of
each words. I'm guessing that I probably have to use NodeVisitor with
StringBean, but I'd like to get some guidelines before I dig into the
APIs.
So, for the following sample HTML:

<HTML>
<head>
<title>Test HTML</title>
</head>
<body>
<p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
</body>
</HTML>

If I search for 'AAAAA', I want to get three matches with their starting
positions (offsets), such as,
	Match 1 offset =3D 58
	Match 2 offset =3D 70
	Match 3 offset =3D 108

Could you show me how to achieve this?
Thanks a lot,
=20
Jay
=20

-----Original Message-----
From: htm...@li...
[mailto:htm...@li...] On Behalf Of
Derrick Oswald
Sent: Monday, May 29, 2006 4:45 AM
To: htm...@li...
Subject: Re: [Htmlparser-user] Finding a whole word

Jay
The text you want can be obtained with the StringBean if Collapse is
false.

When collapse is true, there is a bug in the StringBean.
I've logged this as bug #1496863 StringBean collapse() adds extra=20
whitespace=20
<http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1496863&gr=
oup_
id=3D24399&atid=3D381399>=20
so you can track it.
Derrick

Jay Kim wrote:

> Hi,
>
> I'm trying to get the word count using htmlparser, but it doesn't seem

> to be able to handle the following example.
>
> Let's say the source html looks like this:
>
> <HTML>
>
> <head>
>
> <title>Test HTML</title>
>
> </head>
>
> <body>
>
> <p>AAAAA BBBBB AAA<font color=3D'red'>AA</font> BBBBB AAAAA</p>
>
> </body>
>
> </HTML>
>
> And, if you load it in a browser, you'll see the word 'AAAAA' three=20
> times.
>
> But, if you parse this html, it returns following nodes:
>
> AAAAA BBBBB AAA AA BBBBB AAAAA
>
> So, it breaks down the second 'AAAAA' into two words because of the=20
> font tag in the middle. And, the word count from the parsed text would

> be "2".
>
> Is there any way that I can get the same text/string/word that I see=20
> on the browser?
>
> Thanks,
>
> Jay
>

-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications
in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=3Dlnk&kid=3D107521&bid=3D248729&dat=
=3D121642
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user

-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications
in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmdk&kid=107521&bid$8729&dat=121642
_______________________________________________
Htmlparser-user mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-user

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2002	Jan (7)	Feb	Mar (9)	Apr (50)	May (20)	Jun (47)	Jul (37)	Aug (32)	Sep (30)	Oct (11)	Nov (37)	Dec (47)
2003	Jan (31)	Feb (70)	Mar (67)	Apr (34)	May (66)	Jun (25)	Jul (48)	Aug (43)	Sep (58)	Oct (25)	Nov (10)	Dec (25)
2004	Jan (38)	Feb (17)	Mar (24)	Apr (25)	May (11)	Jun (6)	Jul (24)	Aug (42)	Sep (13)	Oct (17)	Nov (13)	Dec (44)
2005	Jan (10)	Feb (16)	Mar (16)	Apr (23)	May (6)	Jun (19)	Jul (39)	Aug (15)	Sep (40)	Oct (49)	Nov (29)	Dec (41)
2006	Jan (28)	Feb (24)	Mar (52)	Apr (41)	May (31)	Jun (34)	Jul (22)	Aug (12)	Sep (11)	Oct (11)	Nov (11)	Dec (4)
2007	Jan (39)	Feb (13)	Mar (16)	Apr (24)	May (13)	Jun (12)	Jul (21)	Aug (61)	Sep (31)	Oct (13)	Nov (32)	Dec (15)
2008	Jan (7)	Feb (8)	Mar (14)	Apr (12)	May (23)	Jun (20)	Jul (9)	Aug (6)	Sep (2)	Oct (7)	Nov (3)	Dec (2)
2009	Jan (5)	Feb (8)	Mar (10)	Apr (22)	May (85)	Jun (82)	Jul (45)	Aug (28)	Sep (26)	Oct (50)	Nov (8)	Dec (16)
2010	Jan (3)	Feb (11)	Mar (39)	Apr (56)	May (80)	Jun (64)	Jul (49)	Aug (48)	Sep (16)	Oct (3)	Nov (5)	Dec (5)
2011	Jan (13)	Feb	Mar (1)	Apr (7)	May (7)	Jun (7)	Jul (7)	Aug (8)	Sep	Oct (6)	Nov (2)	Dec
2012	Jan (5)	Feb	Mar (3)	Apr (3)	May (4)	Jun (8)	Jul (1)	Aug (5)	Sep (10)	Oct (3)	Nov (2)	Dec (4)
2013	Jan (4)	Feb (2)	Mar (7)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug	Sep (1)	Oct	Nov	Dec
2014	Jan	Feb (2)	Mar (1)	Apr	May (3)	Jun (1)	Jul	Aug	Sep (1)	Oct (4)	Nov (2)	Dec (4)
2015	Jan (4)	Feb (2)	Mar (8)	Apr (7)	May (6)	Jun (7)	Jul (3)	Aug (1)	Sep (1)	Oct (4)	Nov (3)	Dec (4)
2016	Jan (4)	Feb (6)	Mar (9)	Apr (9)	May (6)	Jun (1)	Jul (1)	Aug	Sep	Oct (1)	Nov (1)	Dec (1)
2017	Jan	Feb (1)	Mar (3)	Apr (1)	May	Jun (1)	Jul (2)	Aug (3)	Sep (6)	Oct (3)	Nov (2)	Dec (5)
2018	Jan (3)	Feb (13)	Mar (28)	Apr (5)	May (4)	Jun (2)	Jul (2)	Aug (8)	Sep (2)	Oct (1)	Nov (5)	Dec (1)
2019	Jan (8)	Feb (1)	Mar	Apr (1)	May (4)	Jun	Jul (1)	Aug	Sep	Oct	Nov (2)	Dec (2)
2020	Jan	Feb	Mar (1)	Apr (1)	May (1)	Jun (2)	Jul (1)	Aug (1)	Sep (1)	Oct	Nov (1)	Dec (1)
2021	Jan (3)	Feb (2)	Mar (1)	Apr (1)	May (2)	Jun (1)	Jul (2)	Aug (1)	Sep	Oct	Nov	Dec
2022	Jan	Feb	Mar	Apr (1)	May (1)	Jun (1)	Jul	Aug (1)	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug (1)	Sep	Oct	Nov	Dec
2024	Jan (2)	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct (1)	Nov	Dec

htmlparser-user Mailing List for HTML Parser (Page 37)

htmlparser-user — The user mailing list for users of the htmlparser library