Re: Tweaked : Re: [Htmlparser-user] HTMLParser 1.6 : Unexpected behavior in getNext/getPrevSibling()

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

That looks something like a depth-first search algorithm for fetching
next and previous nodes.

I've already volunteered the possibility of breadth-first traversal to
the project, so we just have to see if the people who lead the project
would like to accept it, then both could be contributed.

By the way, the code would deal with Node's rather than Tag's (the
logic is tree traversal), so you wouldn't want to check if it was a
tag or not (what you'd instead do is get next node, and loop until it
matches whatever you wanted it to match).

I envisaged these methods:

getNextSibling(Node currentNode, Node rootNode, boolean depthFirst)
getPreviousSibling(Node currentNode, Node rootNode, boolean depthFirst)

and as the depth-first is likely to be the more common use-case, wrappers:

getNextSibling(Node currentNode, Node rootNode)
getPreviousSibling(Node currentNode, Node rootNode)

and indeed, as the entire document is likely to be what we are searching:

getNextSibling(Node currentNode, boolean depthFirst)
getPreviousSibling(Node currentNode, boolean depthFirst)
getNextSibling(Node currentNode)
getPreviousSibling(Node currentNode)

Though those last ones would have to wait till the getNext/Previous
node methods could deal with documents with multiple root nodes.
Either that, or there ought to be a DocumentNode that holds the entire
document. I'm not yet sure what the best way is.

By the way, there is an inefficiency in your code that you'd want to
change, in addition to changing Tag to generic Node.

Instead of:

if(tempNode.getNextSibling() !=3D null) {
nextNode =3D tempNode.getNextSibling();
break;
}

it's more efficient to do this:

tempNode2 =3D tempNode.getNextSibling()
if(tempNode2 !=3D null) {
nextNode =3D tempNode2;
break;
}

That way it only calls getNextSibling at that point once, not twice.

Kind regards,

Ian Macfarlane

On 12/9/05, Madhur Kumar Tanwani <mad...@gm...> wrote:
> Hey,
>     Thanks Ian!! great!! That was a clear cut explanation... cool!!
>
> Ok.. so suit my situation, at least, I've designed and implemented code
> snippets, which would get the Previous and Next Node. I've attached code
> for the same with this mail.
>
> I've tested the code with many HTML pages. It works fine. In case
> useful, the code is free to use, by anybody anywhere, but I expect that
> you would preserve the ownership details.
>
> Please, if possible, could anyone comment on the code with critics or
> suggestions. One probably important thing is that I could start
> supporting filters in the function (something like get me the previous
> link node only).
>
> I'm not sure of the procedures and standards but if this code with
> whatever tweaks required could make it to some version of HTML parser,
> I'll be obliged. I did not post it to the HTML Dev mailing list, since I
> think that it would be too early to announce the code.
>
> So, HTMLParser Users, I need your comments and suggestions.
> Looking forward to comments,
>
> Thanks,
>
> Ian Macfarlane wrote:
>
> >After that, it exits the loop, because prevSibling is now null.
> >
> >Why? Because this is the node structure (the formatting might not come
> >out right, I'll also explain below):
> >
> >On 12/7/05, Madhur Kumar Tanwani <mad...@gm...> wrote:
> >
> >
> >>>String :  Unsubscribe
> >>>Prev Sibling Txt (389[3,100],402[3,113]):  Unsubscribe
> >>>Next Sibling Txt (389[3,100],402[3,113]):  Unsubscribe
> >>>
> >>>
> >>I expected that the parser would treat the <A> tag and the <IMG> just b=
efore the text "Unsubscribe"
> >>as siblings and wold return those.
> >>
> >>
>
> --
> __________________________
> Madhur Kumar Tanwani
> mad...@gm...
> Ph.: 0253-5614792.
> __________________________
>  Always remember that you are absolutely unique. Just like everyone else.
>
>
>
>