Menu

Filtering HTML Email Removing Tags & Attribs

Help
2006-04-28
2013-04-27
  • Jason Sheedy

    Jason Sheedy - 2006-04-28

    Hi Guys,
    i've just started to use HTMLparser and i'm trying to do a similar thing to what's discussed on this thread

    http://sourceforge.net/forum/forum.php?thread_id=1452561&forum_id=77089

    I have a couple of questions if anyone has time to give me a few tips.

    I want to filter out specific tags and attributes from incoming html email so as to tidy it up for a webmail application.

    1. Using the following code works well for tag classes in most situations, but not for tags that don't have a parent.

    Parser parser = new Parser("C:\\webroot\\tmp\\htmltests\\tb.html");
    NodeList root = parser.parse(null);

    // NodeClassFilter's
    ArrayList FilterArray = new ArrayList();
    FilterArray.add(new NodeClassFilter(HeadTag.class));
    FilterArray.add(new NodeClassFilter(ScriptTag.class));
    FilterArray.add(new NodeClassFilter(StyleTag.class));
    FilterArray.add(new NodeClassFilter(FrameTag.class));
    FilterArray.add(new NodeClassFilter(FrameSetTag.class));
    FilterArray.add(new NodeClassFilter(FormTag.class));
    FilterArray.add(new NodeClassFilter(BaseHrefTag.class));
    FilterArray.add(new NodeClassFilter(ObjectTag.class));
    FilterArray.add(new NodeClassFilter(AppletTag.class));
    FilterArray.add(new NodeClassFilter(MetaTag.class));
    FilterArray.add(new NodeClassFilter(ImageTag.class));
    FilterArray.add(new NodeClassFilter(DoctypeTag.class));
    FilterArray.add(new NodeClassFilter(ProcessingInstructionTag.class));

    for (int j=0; j < FilterArray.size(); j++) {
        System.out.println("---- REMOVING NODE CLASS TAGS ----" + ((NodeClassFilter) FilterArray.get(j)).getMatchClass());
        NodeList nl = root.extractAllNodesThatMatch((NodeClassFilter) FilterArray.get(j), true);
        System.out.println("found " + nl.size() + "tags");
        
        for(int i=0; i < nl.size(); i++) {
            TagNode node = (TagNode) nl.elementAt(i);
            System.out.println(node);
            if(node.getParent().getChildren().remove(node)) {
                System.out.println("removed node");
            }
        }
    }

    This throws an exception on DoctypeTag.class. How to remove it if it doesn't have a partent.

    2. I'm also using the TagNameFilter to remove other tags like IFRAME and FONT, but it doesn't remove the end tag. What's the best way to do that and leave the text within the tag intact.

    3. In the html parser docs it says that it can be used to tidy up the html. Does this mean it has similar functionality to jTidy or should I run jTidy over it before I parse out the tags with HTMLparser?

    Any hints would be appreciated.

     
    • Derrick Oswald

      Derrick Oswald - 2006-04-28

      1. The root list you have contains the doctype tag, so you'll need to remove it from root.

      2. Hmmm, it should, because the tagname of </FONT> is FONT, maybe try searching and removing "/FONT" tag names.

      3. I don't think anyone can answer that for you, since it depends on what you want to fix. Automatic correction code in HTML Parser is rather limited, so you should probably use JTidy.

       
    • Jason Sheedy

      Jason Sheedy - 2006-05-02

      Thanks Derrick, it's definately not removing the end tags using a TagNameFilter. I tried searching for /tagname and it doesn't find anything.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.