Menu

Whitespaces in parser output

Help
2006-03-09
2013-04-27
  • SpecialAgentX

    SpecialAgentX - 2006-03-09

    Hey guys,

    im just want to traverse all the html elements in my html file and to put out only the names of the html tags. In advanced, I would like to use this as a DOM inspector. But if i traverse the tree with my function ( below ). There are so many whitespaces in the output ( in the Nodelist ), and I dont know how to erase them.

    Code:

        public static void getText(Parser parser) throws ParserException {
           
            // get the whole document
            NodeList list = parser.parse (null);
           
            // set pointer to the root
            Node node = list.elementAt(0);

            // traverse all childs of the root
            for ( int i = 0; i < list.size(); i++ ) {
                node = list.elementAt(i);
                printChild(node);
            }
           
        }
       
        public static void printChild(Node node) {
            // check if the node has only an escape sequenz
            if ( node.getText().compareTo("\n") != 0 &&
                 node.getText().compareTo("\t") != 0 &&
                 node.getText().compareTo("\b") != 0 &&
                 node.getText().compareTo("\f") != 0 &&
                 node.getText().compareTo("\r") != 0 &&
                 node.getText().compareTo(" ") != 0)
                // print out the content of the node
                System.out.println(node.getText());
           
            // check if node has more childs
            if ( node.getChildren() != null ) {
                // if yes, traverse all childs
               
                NodeList nodelist = node.getChildren();
                Node newNode;
                         
                for ( int i = 0; i < nodelist.size(); i++ ) {
                    newNode = nodelist.elementAt(i);
                    printChild(newNode);
                }
            }
        }

     
    • SpecialAgentX

      SpecialAgentX - 2006-03-09

      BTW: Here is the program output for the site www.heise.de
      ( output starts at the first double quote and ended with the last one ) :

      "

      html
      head

      Site Navigation Bar

      link rel="copyright" title="Heise Zeitschriften Verlag" href="/kontakt/impressum.shtml"
      link rel="start"  title="Start" href="/"
      link rel="author" title="Kontakt" href="mailto:kontakt%40heise.de?subject=heise%20online"
      link rel="home" title="home:heise online" href="/"

      "

      As you can see... so many line breaks, although i checked all the whitespaces.

      ( Each line is stored in a Node ! )

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.