Re: [Htmlparser-user] HTMLParser 1.6 : Unexpected behavior in getNext/getPrevSibling()
Brought to you by:
derrickoswald
|
From: Ian M. <ian...@gm...> - 2005-12-07 13:27:00
|
Well there's two things going on:
Firstly, look at this code:
Node prevSibling =3D string; //.getPreviousSibling();
while(prevSibling !=3D null) {
System.out.println("Prev Sibling " + prevSibling);
prevSibling =3D prevSibling.getPreviousSibling();
}
It sets the variable 'prevSibling' to be the current node (
//.getPreviousSibling(); does nothing).
Next it prints it out (as it's not called getPreviousSibling() yet),
so it will print out the current node.
After that, it exits the loop, because prevSibling is now null.
Why? Because this is the node structure (the formatting might not come
out right, I'll also explain below):
<body>
=09<TABLE WIDTH=3D"651" CELLPADDING=3D"0" CELLSPACING=3D"0" BORDER=3D"0">
=09=09<TR VALIGN=3D"TOP">
=09=09=09<TD BGCOLOR=3D"#FFFFFF" ALIGN=3D"LEFT">
=09=09=09=09<FONT face=3D"helvetica, arial" size=3D"1">
=09=09=09=09=09<IMG SRC=3D"http://www.comics.com/comics/dilbert/daily_dilbe=
rt/images/bullet2.gif"
WIDTH=3D"14" HEIGHT=3D"11" ALT=3D"" BORDER=3D"0">
=09=09=09=09=09<A HREF=3D"https://members.comics.com/members/registration/s=
howDilbertLogin.do?aid=3D1"
target=3D"_blank">
=09=09=09=09=09=09Unsubscribe
=09=09=09=09=09</A>
=09=09=09=09=09<A HREF=3D"https://members.comics.com/members/registration/s=
howDilbertLogin.do?aid=3D1"
target=3D"_blank">
=09=09=09=09=09=09Modify
=09=09=09=09=09</A>
=09=09=09=09</FONT>
=09=09=09</TD>
=09=09</TR>
=09</TABLE>
</body>
Ok, basically the situation is that the A tag is a CompositeTag, i.e.
it can have children. In this case, the text node you have found is a
_child_ of the A tag, not a sibling. If you wanted the previous
sibling to the A tag enclosing that text node, you want to do
Node.getParent().getPreviousSibling().
Hope that helps
Ian
On 12/7/05, Madhur Kumar Tanwani <mad...@gm...> wrote:
> Hi,
> I'm facing a problem using HTMLParser 1.6 (integration release) to parse =
an HTML document, described here.
> I'm using the getNextSibling and getPrevSibling function from the new Nod=
e interface to to back and forward from a a text node.
>
> The snippet of the HTML page causing the problem is here (table tag inser=
ted into a body tag).
>
>
> ><body>
> ><TABLE WIDTH=3D"651" CELLPADDING=3D"0" CELLSPACING=3D"0" BORDER=3D"0"> <=
TR VALIGN=3D"TOP"> <TD BGCOLOR=3D"#FFFFFF" ALIGN=3D"LEFT"> <FONT face=3D"he=
lvetica, arial" size=3D"1">
> ><IMG SRC=3D"http://www.comics.com/comics/dilbert/daily_dilbert/images/bu=
llet2.gif" WIDTH=3D"14" HEIGHT=3D"11" ALT=3D"" BORDER=3D"0">
> ><A HREF=3D"https://members.comics.com/members/registration/showDilbertLo=
gin.do?aid=3D1" target=3D"_blank"> Unsubscribe </A>/
> ><A HREF=3D"https://members.comics.com/members/registration/showDilbertLo=
gin.do?aid=3D1" target=3D"_blank"
> >> Modify </A></FONT></TD></TR></TABLE></body>
>
>
>
> The code that I am using is as follows :- (in my custom visitor class)
>
> >public void visitStringNode(Text string) {
> > if(string.getText().contains("Unsubscribe")) {
> > Node prevSibling =3D string; //.getPreviousSibling();
> > while(prevSibling !=3D null) {
> > System.out.println("Prev Sibling " + prevSibling);
> > prevSibling =3D prevSibling.getPreviousSibling();
> > }
> >
> > Node nextSibling =3D string;
> > while(nextSibling !=3D null) {
> > System.out.println("Next Sibling " + nextSibling);
> > nextSibling =3D nextSibling.getNextSibling();
> > }
> > }
> >}
>
>
> However the output that is seen when the code runs is as follows :-
>
>
> >String : Unsubscribe
> >Prev Sibling Txt (389[3,100],402[3,113]): Unsubscribe
> >Next Sibling Txt (389[3,100],402[3,113]): Unsubscribe
>
>
> I expected that the parser would treat the <A> tag and the <IMG> just bef=
ore the text "Unsubscribe"
> as siblings and wold return those.
>
> Please could you tell me where I;m going wrong? Or is it that the Parser =
is not correctly getting the siblings?
>
> Thanks,
>
>
> --
> Madhur Kumar Tanwani
> "If opportunity knocks only once then build more doors"......
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log fi=
les
> for problems? Stop! Download the new AJAX search engine that makes
> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=3D7637&alloc_id=3D16865&op=3Dclick
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
|