Menu

What about unknown tags?

Help
Norb
2006-02-08
2013-04-27
  • Norb

    Norb - 2006-02-08

    I would like to remove HTML tags from a text, but I need to keep any combination of comparison operators (e.g. “1 < 2 < 3” or “Monday < Tuesday and Friday > Wednesday”). This is equivalent to say that I need to keep unknown HTML tags.
    So How can I differentiate Nodes that look like tags but are not HTML tags?

     
    • sidhu

      sidhu - 2006-02-08

      this will be taken care by parser itself

       
    • Norb

      Norb - 2006-02-08

      I am not satisfy with the parser's job. Let's have a look on this example:

      public class parseTags {

          public static void main(String[] args)
                                       throws Exception {
          String myHtml = "<span>blabla</span><BR>and other blabla<H1>Big</H1><font face=\"arial\" color=\"RED\" size=\"2\"><b>font and so on</b></font> plus <mytag>and now, 2<3, but <hidden text> and also <Why can I see this ?> more <<<<<<<(7) and >>>>>>>>>(9)";

          String textDescription = "";
          Lexer lex = new Lexer(myHtml);
          Node nono = lex.nextNode();
          while (nono != null) {
              if (nono instanceof TextNode) {
              textDescription += nono.getText();
              }
              nono = lex.nextNode();
          }
          System.out.println(textDescription);
          }
      }

      This returns:
      "blablaand other blablaBigfont and so on plus and now, 2<3, but  and also  more <<<<<<<(7) and >>>>>>>>>(9)"

      So, "<mytag>", "<hidden text>" and "<Why can I see this ?>" are missing.

      Where am I wrong?

       
    • sidhu

      sidhu - 2006-02-13

      Dear norb,
      my solutions may not be efficient but two solutions can help you
      1) register all the tags having
                                      in javaoc it is given how to create you custom tags eg:
      import org.htmlparser.tags.CompositeTag ;

      public  class  MyFontTag  extends  CompositeTag
      {
              public  static  StringBuffer  mBuffer  =  new  StringBuffer ();

      /**
      * The set of names handled by this tag.
      */
          private static final String[] mIds = new String[] {"FONT","H1","SPAN","BR","B"};//,"BR","TABLE"};
           private static final String[] mEndTagEnders = new String[] {"BODY", "HTML","TABLE","TD","TR","FONT"};

      /**
      * Create a new text tag.
      */
              public MyFontTag ()
              {
                  setThisScanner (mDefaultCompositeScanner);
              }
          public String[] getEndTagEnders ()
            {
                return (mEndTagEnders);
            }
          
            public String[] getEnders()
          {
                return (mEndTagEnders);
            }

      /**
      * Return the set of names handled by this tag.
      * @return The names to be matched that create tags of this type.
      */
              public String[] getIds ()
              {
                  return (mIds);
              }

      }
      now in you program
      public class parseTags {

      public static void main(String[] args) 
      throws Exception {
      String myHtml = "<span>blabla</span><BR>and other blabla<H1>Big</H1><font face=\"arial\" color=\"RED\" size=\"2\"><b>font and so on</b></font> plus <mytag>and now, 2<3, but <hidden text> and also <Why can I see this ?> more <<<<<<<(7) and >>>>>>>>>(9)";

          String textDescription = "";
          Lexer lex = new Lexer(myHtml);
          Parser parser = new Parser(lex);

          PrototypicalNodeFactory factory = new PrototypicalNodeFactory ();
          factory.registerTag (new MyFontTag ());
          parser.setNodeFactory (factory);
         
          for(NodeIterator e=parser.elements();e.hasMoreNodes();){
              Node node =e.nextNode();
              if(!(node instanceof CompositeTag)||node instanceof TextNode)System.out.println (node.toHtml());
              if(node instanceof CompositeTag  )System.out.println (node.toPlainTextString());
              }   
          }
      }
      2) you can create a table of HTML Tags and check for the name in it as problem is created by relaxed handling of tags .

       
    • Norb

      Norb - 2006-02-16

      Ok, that's what I was afraid of. Thanks a lot.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.