HTML Parser / Discussion / Help: Filtering HTML Email Removing Tags & Attribs

Hi Guys,
i've just started to use HTMLparser and i'm trying to do a similar thing to what's discussed on this thread

http://sourceforge.net/forum/forum.php?thread_id=1452561&forum_id=77089

I have a couple of questions if anyone has time to give me a few tips.

I want to filter out specific tags and attributes from incoming html email so as to tidy it up for a webmail application.

1. Using the following code works well for tag classes in most situations, but not for tags that don't have a parent.

Parser parser = new Parser("C:\\webroot\\tmp\\htmltests\\tb.html");
NodeList root = parser.parse(null);

// NodeClassFilter's
ArrayList FilterArray = new ArrayList();
FilterArray.add(new NodeClassFilter(HeadTag.class));
FilterArray.add(new NodeClassFilter(ScriptTag.class));
FilterArray.add(new NodeClassFilter(StyleTag.class));
FilterArray.add(new NodeClassFilter(FrameTag.class));
FilterArray.add(new NodeClassFilter(FrameSetTag.class));
FilterArray.add(new NodeClassFilter(FormTag.class));
FilterArray.add(new NodeClassFilter(BaseHrefTag.class));
FilterArray.add(new NodeClassFilter(ObjectTag.class));
FilterArray.add(new NodeClassFilter(AppletTag.class));
FilterArray.add(new NodeClassFilter(MetaTag.class));
FilterArray.add(new NodeClassFilter(ImageTag.class));
FilterArray.add(new NodeClassFilter(DoctypeTag.class));
FilterArray.add(new NodeClassFilter(ProcessingInstructionTag.class));

for (int j=0; j < FilterArray.size(); j++) {
    System.out.println("---- REMOVING NODE CLASS TAGS ----" + ((NodeClassFilter) FilterArray.get(j)).getMatchClass());
    NodeList nl = root.extractAllNodesThatMatch((NodeClassFilter) FilterArray.get(j), true);
    System.out.println("found " + nl.size() + "tags");

    for(int i=0; i < nl.size(); i++) {
        TagNode node = (TagNode) nl.elementAt(i);
        System.out.println(node);
        if(node.getParent().getChildren().remove(node)) {
            System.out.println("removed node");
        }
    }
}

This throws an exception on DoctypeTag.class. How to remove it if it doesn't have a partent.

2. I'm also using the TagNameFilter to remove other tags like IFRAME and FONT, but it doesn't remove the end tag. What's the best way to do that and leave the text within the tag intact.

3. In the html parser docs it says that it can be used to tidy up the html. Does this mean it has similar functionality to jTidy or should I run jTidy over it before I parse out the tags with HTMLparser?

Any hints would be appreciated.

Filtering HTML Email Removing Tags & Attribs

Forums

Help

Filtering HTML Email Removing Tags & Attribs document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Filtering HTML Email Removing Tags & Attribs