HTML Parser / Discussion / Help: Whitespaces in parser output

Whitespaces in parser output

Forum: Help

Creator: SpecialAgentX

Created: 2006-03-09

Updated: 2013-04-27

SpecialAgentX - 2006-03-09

Hey guys,

im just want to traverse all the html elements in my html file and to put out only the names of the html tags. In advanced, I would like to use this as a DOM inspector. But if i traverse the tree with my function ( below ). There are so many whitespaces in the output ( in the Nodelist ), and I dont know how to erase them.

Code:

    public static void getText(Parser parser) throws ParserException {

        // get the whole document
        NodeList list = parser.parse (null);

        // set pointer to the root
        Node node = list.elementAt(0);

        // traverse all childs of the root
        for ( int i = 0; i < list.size(); i++ ) {
            node = list.elementAt(i);
            printChild(node);
        }

    }

    public static void printChild(Node node) {
        // check if the node has only an escape sequenz
        if ( node.getText().compareTo("\n") != 0 &&
             node.getText().compareTo("\t") != 0 &&
             node.getText().compareTo("\b") != 0 &&
             node.getText().compareTo("\f") != 0 &&
             node.getText().compareTo("\r") != 0 &&
             node.getText().compareTo(" ") != 0)
            // print out the content of the node
            System.out.println(node.getText());

        // check if node has more childs
        if ( node.getChildren() != null ) {
            // if yes, traverse all childs

            NodeList nodelist = node.getChildren();
            Node newNode;

            for ( int i = 0; i < nodelist.size(); i++ ) {
                newNode = nodelist.elementAt(i);
                printChild(newNode);
            }
        }
    }

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- SpecialAgentX - 2006-03-09
  
  BTW: Here is the program output for the site www.heise.de
  ( output starts at the first double quote and ended with the last one ) :
  
  "
  
  html
  head
  
  Site Navigation Bar
  
  link rel="copyright" title="Heise Zeitschriften Verlag" href="/kontakt/impressum.shtml"
  link rel="start" title="Start" href="/"
  link rel="author" title="Kontakt" href="mailto:kontakt%40heise.de?subject=heise%20online"
  link rel="home" title="home:heise online" href="/"
  
  "
  
  As you can see... so many line breaks, although i checked all the whitespaces.
  
  ( Each line is stored in a Node ! )
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Whitespaces in parser output

Forums

Help

Whitespaces in parser output document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Whitespaces in parser output