HTML Parser / Discussion / Help: Filtering a HTML Document

Saleh matani - 2003-12-04

I have alot of HTML Dokuments and need to filter them from Javascripts Links (maybe more!)
has anybody an example how to do this ?

Q2: the Parser use to Parse a URL odr Local HTML Dokument , i would like to know if i can give the Parser a HTML String and define Filter that i get back a HTML Code that dose not contain tags that i have filtered!

thanks :)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Saleh matani - 2003-12-05
  
  how to Filter this html code:
  <html>
  <head>
  <title>test</title>
  </head>
  <body>
  <p><a href="http://www.google.de">http://www.google.de</a></p>
  <p>text</p>
  <p>text</p>
  <table border="1" width="100%">
  <tr>
      <td width="50%">table</td>
      <td width="50%">table</td>
  </tr>
  </table>
  <form method="POST" action="--WEBBOT-SELF--">
  <p><input type="button" value="Button" name="B3"></p>
  </form>
  <p> the text</p>
  </body>
  </html>
  
  to this html Code:
  
  <p>text</p>
  <p>text</p>
  <table border="1" width="100%">
  <tr>
      <td width="50%">table</td>
      <td width="50%">table</td>
  </tr>
  </table>
  <p> </p>
  <p> the text</p>
  
  -----------------------------------------
  
  thats mean : to parse the html site and remove the html tag , title tag ,form tags , link tags and get as result whhat between <Body> and </Body>
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Derrick Oswald - 2003-12-06
    
    You might try the NodeVisitor pattern. Create a class that implements NodeVisitor and overrides some of the methods:
    
    class MyVisitor implements NodeVisitor
    {
        boolean inbody = false;
    
        public void visitTag (Tag tag)
        {
            if (inbody)
                if (!tag.getTagName().equals("A"))
                    System.out.println (tag.toHtml ());
            if (tag.getTagName().equals("BODY"))
                inbody = true;
        }
    
        public void visitEndTag (Tag tag)
        {
            if (tag.getTagName().equals("BODY"))
                inbody = false;
            if (inbody)
                System.out.println (tag.toHtml ());
        }
    }
    
    Then run through all the nodes with:
        parser.visitAllNodesWith(new MyVisitor ());
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Derrick Oswald - 2003-12-06
      
      oops, extends not implements
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Saleh matani - 2003-12-07
  
  thank you for help , thats work but it dose not return back the Tags that i need with the text!! i am getting back just the tags!
  
  i am getting this :
  
  <p></p>
  <p></p>
  <table border="1" width="100%">
  <tr>
  <td width="50%"></td>
  <td width="50%"></td>
  </tr>
  </table>
  <p></p>
  <p> </p>
  
  schuld be this :
  
  <p>text</p>
  <p>text</p>
  <table border="1" width="100%">
  <tr>
  <td width="50%">table</td>
  <td width="50%">table</td>
  </tr>
  </table>
  <p> </p>
  <p> the text</p>
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2003-12-12
  
  add this method:
  
      public void visitStringNode (StringNode stringNode)
      {
          System.out.println (stringNode.toHtml ());
      }
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Filtering a HTML Document

Forums

Help

Filtering a HTML Document document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Filtering a HTML Document