[Htmlparser-user] Composite Tag Scanning

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I am using HTML parser to create a DOM of HTML documents. I used the following code,
Parser parser = new Parser("http://www.yahoo.com/");
NodeList rootNodes = parser.parse(null);

This code works fine, and generates a DOM, returning all the root nodes into that NodeList. But while traversing the tree, I found out that a lot of nodes are having a flat structure.

Eg. If I have a 'h1' node, it creates TEXT and /h1 as children of h1. But for 'b' node, it creates TEXT and /b as siblings of 'b' node instead of one similar to 'h1'.

I figured out that CompositeTag are parsed using the CompositeTagScanner, and are made into a tree like hierarchy.. While others are parsed using TagScanner. According to the definition of composite tag, it is any tag with an ending tag. But this isn't working as expected.

Can someone tell me how to tell html parser to treat every node as a composite tag? I figured out that only nodes which extend CompositeTag, like HeadingTag TableTag etc, use CompositeTagScanner. Is there a way to force all nodes to be treated as composite? Or any other workaround that'd help my tree structure be consistent (not affected by whether its a h1 or b).

Regards,
Anurag.

      Add more friends to your messenger and enjoy! Go to http://messenger.yahoo.com/invite/