Thread: [Htmlparser-user] HTMLParser 1.6 : Unexpected behavior in getNext/getPrevSibling()
Brought to you by:
derrickoswald
|
From: Madhur K. T. <mad...@gm...> - 2005-12-07 12:21:24
|
Hi, I'm facing a problem using HTMLParser 1.6 (integration release) to parse an HTML document, described here. I'm using the getNextSibling and getPrevSibling function from the new Node interface to to back and forward from a a text node. The snippet of the HTML page causing the problem is here (table tag inserted into a body tag). ><body> ><TABLE WIDTH="651" CELLPADDING="0" CELLSPACING="0" BORDER="0"> <TR VALIGN="TOP"> <TD BGCOLOR="#FFFFFF" ALIGN="LEFT"> <FONT face="helvetica, arial" size="1"> ><IMG SRC="http://www.comics.com/comics/dilbert/daily_dilbert/images/bullet2.gif" WIDTH="14" HEIGHT="11" ALT="" BORDER="0"> ><A HREF="https://members.comics.com/members/registration/showDilbertLogin.do?aid=1" target="_blank"> Unsubscribe </A>/ ><A HREF="https://members.comics.com/members/registration/showDilbertLogin.do?aid=1" target="_blank" >> Modify </A></FONT></TD></TR></TABLE></body> The code that I am using is as follows :- (in my custom visitor class) >public void visitStringNode(Text string) { > if(string.getText().contains("Unsubscribe")) { > Node prevSibling = string; //.getPreviousSibling(); > while(prevSibling != null) { > System.out.println("Prev Sibling " + prevSibling); > prevSibling = prevSibling.getPreviousSibling(); > } > > Node nextSibling = string; > while(nextSibling != null) { > System.out.println("Next Sibling " + nextSibling); > nextSibling = nextSibling.getNextSibling(); > } > } >} However the output that is seen when the code runs is as follows :- >String : Unsubscribe >Prev Sibling Txt (389[3,100],402[3,113]): Unsubscribe >Next Sibling Txt (389[3,100],402[3,113]): Unsubscribe I expected that the parser would treat the <A> tag and the <IMG> just before the text "Unsubscribe" as siblings and wold return those. Please could you tell me where I;m going wrong? Or is it that the Parser is not correctly getting the siblings? Thanks, -- Madhur Kumar Tanwani "If opportunity knocks only once then build more doors"...... |
|
From: Ian M. <ian...@gm...> - 2005-12-07 13:27:00
|
Well there's two things going on:
Firstly, look at this code:
Node prevSibling =3D string; //.getPreviousSibling();
while(prevSibling !=3D null) {
System.out.println("Prev Sibling " + prevSibling);
prevSibling =3D prevSibling.getPreviousSibling();
}
It sets the variable 'prevSibling' to be the current node (
//.getPreviousSibling(); does nothing).
Next it prints it out (as it's not called getPreviousSibling() yet),
so it will print out the current node.
After that, it exits the loop, because prevSibling is now null.
Why? Because this is the node structure (the formatting might not come
out right, I'll also explain below):
<body>
=09<TABLE WIDTH=3D"651" CELLPADDING=3D"0" CELLSPACING=3D"0" BORDER=3D"0">
=09=09<TR VALIGN=3D"TOP">
=09=09=09<TD BGCOLOR=3D"#FFFFFF" ALIGN=3D"LEFT">
=09=09=09=09<FONT face=3D"helvetica, arial" size=3D"1">
=09=09=09=09=09<IMG SRC=3D"http://www.comics.com/comics/dilbert/daily_dilbe=
rt/images/bullet2.gif"
WIDTH=3D"14" HEIGHT=3D"11" ALT=3D"" BORDER=3D"0">
=09=09=09=09=09<A HREF=3D"https://members.comics.com/members/registration/s=
howDilbertLogin.do?aid=3D1"
target=3D"_blank">
=09=09=09=09=09=09Unsubscribe
=09=09=09=09=09</A>
=09=09=09=09=09<A HREF=3D"https://members.comics.com/members/registration/s=
howDilbertLogin.do?aid=3D1"
target=3D"_blank">
=09=09=09=09=09=09Modify
=09=09=09=09=09</A>
=09=09=09=09</FONT>
=09=09=09</TD>
=09=09</TR>
=09</TABLE>
</body>
Ok, basically the situation is that the A tag is a CompositeTag, i.e.
it can have children. In this case, the text node you have found is a
_child_ of the A tag, not a sibling. If you wanted the previous
sibling to the A tag enclosing that text node, you want to do
Node.getParent().getPreviousSibling().
Hope that helps
Ian
On 12/7/05, Madhur Kumar Tanwani <mad...@gm...> wrote:
> Hi,
> I'm facing a problem using HTMLParser 1.6 (integration release) to parse =
an HTML document, described here.
> I'm using the getNextSibling and getPrevSibling function from the new Nod=
e interface to to back and forward from a a text node.
>
> The snippet of the HTML page causing the problem is here (table tag inser=
ted into a body tag).
>
>
> ><body>
> ><TABLE WIDTH=3D"651" CELLPADDING=3D"0" CELLSPACING=3D"0" BORDER=3D"0"> <=
TR VALIGN=3D"TOP"> <TD BGCOLOR=3D"#FFFFFF" ALIGN=3D"LEFT"> <FONT face=3D"he=
lvetica, arial" size=3D"1">
> ><IMG SRC=3D"http://www.comics.com/comics/dilbert/daily_dilbert/images/bu=
llet2.gif" WIDTH=3D"14" HEIGHT=3D"11" ALT=3D"" BORDER=3D"0">
> ><A HREF=3D"https://members.comics.com/members/registration/showDilbertLo=
gin.do?aid=3D1" target=3D"_blank"> Unsubscribe </A>/
> ><A HREF=3D"https://members.comics.com/members/registration/showDilbertLo=
gin.do?aid=3D1" target=3D"_blank"
> >> Modify </A></FONT></TD></TR></TABLE></body>
>
>
>
> The code that I am using is as follows :- (in my custom visitor class)
>
> >public void visitStringNode(Text string) {
> > if(string.getText().contains("Unsubscribe")) {
> > Node prevSibling =3D string; //.getPreviousSibling();
> > while(prevSibling !=3D null) {
> > System.out.println("Prev Sibling " + prevSibling);
> > prevSibling =3D prevSibling.getPreviousSibling();
> > }
> >
> > Node nextSibling =3D string;
> > while(nextSibling !=3D null) {
> > System.out.println("Next Sibling " + nextSibling);
> > nextSibling =3D nextSibling.getNextSibling();
> > }
> > }
> >}
>
>
> However the output that is seen when the code runs is as follows :-
>
>
> >String : Unsubscribe
> >Prev Sibling Txt (389[3,100],402[3,113]): Unsubscribe
> >Next Sibling Txt (389[3,100],402[3,113]): Unsubscribe
>
>
> I expected that the parser would treat the <A> tag and the <IMG> just bef=
ore the text "Unsubscribe"
> as siblings and wold return those.
>
> Please could you tell me where I;m going wrong? Or is it that the Parser =
is not correctly getting the siblings?
>
> Thanks,
>
>
> --
> Madhur Kumar Tanwani
> "If opportunity knocks only once then build more doors"......
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc. Do you grep through log fi=
les
> for problems? Stop! Download the new AJAX search engine that makes
> searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
> http://ads.osdn.com/?ad_id=3D7637&alloc_id=3D16865&op=3Dclick
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
|
|
From: Madhur K. T. <mad...@gm...> - 2005-12-09 05:57:31
Attachments:
Code_snippet_for_prevTag_nextTag.java
|
Hey,
Thanks Ian!! great!! That was a clear cut explanation... cool!!
Ok.. so suit my situation, at least, I've designed and implemented code
snippets, which would get the Previous and Next Node. I've attached code
for the same with this mail.
I've tested the code with many HTML pages. It works fine. In case
useful, the code is free to use, by anybody anywhere, but I expect that
you would preserve the ownership details.
Please, if possible, could anyone comment on the code with critics or
suggestions. One probably important thing is that I could start
supporting filters in the function (something like get me the previous
link node only).
I'm not sure of the procedures and standards but if this code with
whatever tweaks required could make it to some version of HTML parser,
I'll be obliged. I did not post it to the HTML Dev mailing list, since I
think that it would be too early to announce the code.
So, HTMLParser Users, I need your comments and suggestions.
Looking forward to comments,
Thanks,
Ian Macfarlane wrote:
>After that, it exits the loop, because prevSibling is now null.
>
>Why? Because this is the node structure (the formatting might not come
>out right, I'll also explain below):
>
>On 12/7/05, Madhur Kumar Tanwani <mad...@gm...> wrote:
>
>
>>>String : Unsubscribe
>>>Prev Sibling Txt (389[3,100],402[3,113]): Unsubscribe
>>>Next Sibling Txt (389[3,100],402[3,113]): Unsubscribe
>>>
>>>
>>I expected that the parser would treat the <A> tag and the <IMG> just before the text "Unsubscribe"
>>as siblings and wold return those.
>>
>>
--
__________________________
Madhur Kumar Tanwani
mad...@gm...
Ph.: 0253-5614792.
__________________________
Always remember that you are absolutely unique. Just like everyone else.
|
|
Re: Tweaked : Re: [Htmlparser-user] HTMLParser 1.6 : Unexpected behavior in getNext/getPrevSibling()
From: Ian M. <ian...@gm...> - 2005-12-09 15:10:17
|
That looks something like a depth-first search algorithm for fetching
next and previous nodes.
I've already volunteered the possibility of breadth-first traversal to
the project, so we just have to see if the people who lead the project
would like to accept it, then both could be contributed.
By the way, the code would deal with Node's rather than Tag's (the
logic is tree traversal), so you wouldn't want to check if it was a
tag or not (what you'd instead do is get next node, and loop until it
matches whatever you wanted it to match).
I envisaged these methods:
getNextSibling(Node currentNode, Node rootNode, boolean depthFirst)
getPreviousSibling(Node currentNode, Node rootNode, boolean depthFirst)
and as the depth-first is likely to be the more common use-case, wrappers:
getNextSibling(Node currentNode, Node rootNode)
getPreviousSibling(Node currentNode, Node rootNode)
and indeed, as the entire document is likely to be what we are searching:
getNextSibling(Node currentNode, boolean depthFirst)
getPreviousSibling(Node currentNode, boolean depthFirst)
getNextSibling(Node currentNode)
getPreviousSibling(Node currentNode)
Though those last ones would have to wait till the getNext/Previous
node methods could deal with documents with multiple root nodes.
Either that, or there ought to be a DocumentNode that holds the entire
document. I'm not yet sure what the best way is.
By the way, there is an inefficiency in your code that you'd want to
change, in addition to changing Tag to generic Node.
Instead of:
if(tempNode.getNextSibling() !=3D null) {
nextNode =3D tempNode.getNextSibling();
break;
}
it's more efficient to do this:
tempNode2 =3D tempNode.getNextSibling()
if(tempNode2 !=3D null) {
nextNode =3D tempNode2;
break;
}
That way it only calls getNextSibling at that point once, not twice.
Kind regards,
Ian Macfarlane
On 12/9/05, Madhur Kumar Tanwani <mad...@gm...> wrote:
> Hey,
> Thanks Ian!! great!! That was a clear cut explanation... cool!!
>
> Ok.. so suit my situation, at least, I've designed and implemented code
> snippets, which would get the Previous and Next Node. I've attached code
> for the same with this mail.
>
> I've tested the code with many HTML pages. It works fine. In case
> useful, the code is free to use, by anybody anywhere, but I expect that
> you would preserve the ownership details.
>
> Please, if possible, could anyone comment on the code with critics or
> suggestions. One probably important thing is that I could start
> supporting filters in the function (something like get me the previous
> link node only).
>
> I'm not sure of the procedures and standards but if this code with
> whatever tweaks required could make it to some version of HTML parser,
> I'll be obliged. I did not post it to the HTML Dev mailing list, since I
> think that it would be too early to announce the code.
>
> So, HTMLParser Users, I need your comments and suggestions.
> Looking forward to comments,
>
> Thanks,
>
> Ian Macfarlane wrote:
>
> >After that, it exits the loop, because prevSibling is now null.
> >
> >Why? Because this is the node structure (the formatting might not come
> >out right, I'll also explain below):
> >
> >On 12/7/05, Madhur Kumar Tanwani <mad...@gm...> wrote:
> >
> >
> >>>String : Unsubscribe
> >>>Prev Sibling Txt (389[3,100],402[3,113]): Unsubscribe
> >>>Next Sibling Txt (389[3,100],402[3,113]): Unsubscribe
> >>>
> >>>
> >>I expected that the parser would treat the <A> tag and the <IMG> just b=
efore the text "Unsubscribe"
> >>as siblings and wold return those.
> >>
> >>
>
> --
> __________________________
> Madhur Kumar Tanwani
> mad...@gm...
> Ph.: 0253-5614792.
> __________________________
> Always remember that you are absolutely unique. Just like everyone else.
>
>
>
>
|