This throws an exception on DoctypeTag.class. How to remove it if it doesn't have a partent.
2. I'm also using the TagNameFilter to remove other tags like IFRAME and FONT, but it doesn't remove the end tag. What's the best way to do that and leave the text within the tag intact.
3. In the html parser docs it says that it can be used to tidy up the html. Does this mean it has similar functionality to jTidy or should I run jTidy over it before I parse out the tags with HTMLparser?
Any hints would be appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1. The root list you have contains the doctype tag, so you'll need to remove it from root.
2. Hmmm, it should, because the tagname of </FONT> is FONT, maybe try searching and removing "/FONT" tag names.
3. I don't think anyone can answer that for you, since it depends on what you want to fix. Automatic correction code in HTML Parser is rather limited, so you should probably use JTidy.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Guys,
i've just started to use HTMLparser and i'm trying to do a similar thing to what's discussed on this thread
http://sourceforge.net/forum/forum.php?thread_id=1452561&forum_id=77089
I have a couple of questions if anyone has time to give me a few tips.
I want to filter out specific tags and attributes from incoming html email so as to tidy it up for a webmail application.
1. Using the following code works well for tag classes in most situations, but not for tags that don't have a parent.
Parser parser = new Parser("C:\\webroot\\tmp\\htmltests\\tb.html");
NodeList root = parser.parse(null);
// NodeClassFilter's
ArrayList FilterArray = new ArrayList();
FilterArray.add(new NodeClassFilter(HeadTag.class));
FilterArray.add(new NodeClassFilter(ScriptTag.class));
FilterArray.add(new NodeClassFilter(StyleTag.class));
FilterArray.add(new NodeClassFilter(FrameTag.class));
FilterArray.add(new NodeClassFilter(FrameSetTag.class));
FilterArray.add(new NodeClassFilter(FormTag.class));
FilterArray.add(new NodeClassFilter(BaseHrefTag.class));
FilterArray.add(new NodeClassFilter(ObjectTag.class));
FilterArray.add(new NodeClassFilter(AppletTag.class));
FilterArray.add(new NodeClassFilter(MetaTag.class));
FilterArray.add(new NodeClassFilter(ImageTag.class));
FilterArray.add(new NodeClassFilter(DoctypeTag.class));
FilterArray.add(new NodeClassFilter(ProcessingInstructionTag.class));
for (int j=0; j < FilterArray.size(); j++) {
System.out.println("---- REMOVING NODE CLASS TAGS ----" + ((NodeClassFilter) FilterArray.get(j)).getMatchClass());
NodeList nl = root.extractAllNodesThatMatch((NodeClassFilter) FilterArray.get(j), true);
System.out.println("found " + nl.size() + "tags");
for(int i=0; i < nl.size(); i++) {
TagNode node = (TagNode) nl.elementAt(i);
System.out.println(node);
if(node.getParent().getChildren().remove(node)) {
System.out.println("removed node");
}
}
}
This throws an exception on DoctypeTag.class. How to remove it if it doesn't have a partent.
2. I'm also using the TagNameFilter to remove other tags like IFRAME and FONT, but it doesn't remove the end tag. What's the best way to do that and leave the text within the tag intact.
3. In the html parser docs it says that it can be used to tidy up the html. Does this mean it has similar functionality to jTidy or should I run jTidy over it before I parse out the tags with HTMLparser?
Any hints would be appreciated.
1. The root list you have contains the doctype tag, so you'll need to remove it from root.
2. Hmmm, it should, because the tagname of </FONT> is FONT, maybe try searching and removing "/FONT" tag names.
3. I don't think anyone can answer that for you, since it depends on what you want to fix. Automatic correction code in HTML Parser is rather limited, so you should probably use JTidy.
Thanks Derrick, it's definately not removing the end tags using a TagNameFilter. I tried searching for /tagname and it doesn't find anything.