HTML Parser / Discussion / Help: Simple HTML Transformation.

Clive Haworth - 2006-03-01

Hi. I'm stuck. Here is an HTML snippet from a multi-part MIME email:

<html>
<body>
<p>
Hello
<footer><p>Footer Text</footer>
<p>
Again
</body>
</html>

I want to remove the 'footer' tag (and it's text node) which could occur anywhere in the doc. So I try something like:

addNewFooter(Part part) {

    Page page = new Page(part.getInputStream(), encoding);
    Lexer lexer = new Lexer(page);
    Parser parser = new Parser(lexer);
    try {
        log.info("Stripping HTML footer");
        NodeList complete = parser.parse(null);
        NodeList stripped = complete.extractAllNodesThatMatch(new NotFilter(new TagNameFilter("footer")));

//        part.setContent(complete.toHtml(), "text/html; charset=" + encoding);
        log.info(complete.toHtml());
    } ....

This doesn't work and it isn't clear how I do it.

Next I want to add a new 'footer' tag just before the body close:

<html>
<body>
<p>
Hello
<p>
Again
<footer><p>Footer Text</footer>
</body>
</html>

I have no idea how to do this. I can see that you add nodes to node lists, but how exactly in this case ?

Do I create a node like this ..

            TagNode footerTag = new TagNode();
            footerTag.setTagName("footer");
            TextNode footerText = new TextNode("Footer Text");
            footerText.setParent(footerTag);

... and how do I stick it in the correct place in the tree (node list) ?

Thanks for any help you can offer.

Clive

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2006-03-02
  
  removing:
  
  The NotFilter is bound to get you every node but the footer nodes, however it will be a linear list.
  
  I would filter for the footer nodes and remove them from their parent:
  
  NodeList footers = complete.extractAllNodesThatMatch (new TagNameFilter("footer"));
  ... foreach footer in the list
  footer.getParent ().remove (footer);
  
  adding:
  
  The footerText needs to be added to the footerTag's children list:
  footerTag.getChildren ().add (footerText);
  
  Adding the footer just before the end of the <html> tag is the same, it's a simple add() which puts it at the end:
  
  HtmlTag html;
  
  ... get the html tag somehow
  html.getChildren ().add (my_new_footer);
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Clive Haworth - 2006-03-02
  
  Ah. Got it. I thought extractAllNodesThatMatch() extracted a list unrelated to the main node list. It is basically just selecting which nodes to act on - maybe selectAllNodesThatMatch() would have been clearer, but what's in a name? Thanks for the help.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Clive Haworth - 2006-03-02
  
  Nope. Not what I thought. Consider the following code:
  
          try {
              Parser parser = new Parser("file:///clive.html");
              NodeList root = parser.parse(null);
              NodeList divs = root.extractAllNodesThatMatch(new NodeClassFilter(Div.class), true);
  
              System.out.println("found " + divs.size() + " div tags");
  
              for(int i = 0; i < divs.size(); i++) {
                  TagNode div = (TagNode) divs.elementAt(i);
                  String id = div.getAttribute("id");
                  if(id != null && id.equals("__footer__")) {
                      System.out.println("found footer: " + div);
                      if(divs.remove(div)) {
                          System.out.println("removed node");
                      }
                  }
              }
              System.out.println(root.toHtml());
          } catch(ParserException e) {
              e.printStackTrace();
          }
  
  This removes the div node from the divs list, not the root list which are obviously different.
  All I want to do is print out the HTML (all of it) without the footer div ?
  
  If I try:
  
      root.remove(div)
  
  in place of:
  
      divs.remove(div)
  
  It doesn't find it (returns false) ....
  
  How, exactly do I do this ?
  
  Regards
  Clive
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2006-03-02
  
  You need:
  
  div.getParent ().getChildren ().remove (div);
  
  Need to go to the next node up the tree and remove the div node from
  it's list of children.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Clive Haworth - 2006-03-03
  
  Great. Now I get it ! Thanks
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Simple HTML Transformation.

Forums

Help

Simple HTML Transformation. document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Simple HTML Transformation.