im just want to traverse all the html elements in my html file and to put out only the names of the html tags. In advanced, I would like to use this as a DOM inspector. But if i traverse the tree with my function ( below ). There are so many whitespaces in the output ( in the Nodelist ), and I dont know how to erase them.
Code:
public static void getText(Parser parser) throws ParserException {
// get the whole document
NodeList list = parser.parse (null);
// set pointer to the root
Node node = list.elementAt(0);
// traverse all childs of the root
for ( int i = 0; i < list.size(); i++ ) {
node = list.elementAt(i);
printChild(node);
}
}
public static void printChild(Node node) {
// check if the node has only an escape sequenz
if ( node.getText().compareTo("\n") != 0 &&
node.getText().compareTo("\t") != 0 &&
node.getText().compareTo("\b") != 0 &&
node.getText().compareTo("\f") != 0 &&
node.getText().compareTo("\r") != 0 &&
node.getText().compareTo(" ") != 0)
// print out the content of the node
System.out.println(node.getText());
// check if node has more childs
if ( node.getChildren() != null ) {
// if yes, traverse all childs
BTW: Here is the program output for the site www.heise.de
( output starts at the first double quote and ended with the last one ) :
"
html
head
Site Navigation Bar
link rel="copyright" title="Heise Zeitschriften Verlag" href="/kontakt/impressum.shtml"
link rel="start" title="Start" href="/"
link rel="author" title="Kontakt" href="mailto:kontakt%40heise.de?subject=heise%20online"
link rel="home" title="home:heise online" href="/"
"
As you can see... so many line breaks, although i checked all the whitespaces.
( Each line is stored in a Node ! )
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hey guys,
im just want to traverse all the html elements in my html file and to put out only the names of the html tags. In advanced, I would like to use this as a DOM inspector. But if i traverse the tree with my function ( below ). There are so many whitespaces in the output ( in the Nodelist ), and I dont know how to erase them.
Code:
public static void getText(Parser parser) throws ParserException {
// get the whole document
NodeList list = parser.parse (null);
// set pointer to the root
Node node = list.elementAt(0);
// traverse all childs of the root
for ( int i = 0; i < list.size(); i++ ) {
node = list.elementAt(i);
printChild(node);
}
}
public static void printChild(Node node) {
// check if the node has only an escape sequenz
if ( node.getText().compareTo("\n") != 0 &&
node.getText().compareTo("\t") != 0 &&
node.getText().compareTo("\b") != 0 &&
node.getText().compareTo("\f") != 0 &&
node.getText().compareTo("\r") != 0 &&
node.getText().compareTo(" ") != 0)
// print out the content of the node
System.out.println(node.getText());
// check if node has more childs
if ( node.getChildren() != null ) {
// if yes, traverse all childs
NodeList nodelist = node.getChildren();
Node newNode;
for ( int i = 0; i < nodelist.size(); i++ ) {
newNode = nodelist.elementAt(i);
printChild(newNode);
}
}
}
BTW: Here is the program output for the site www.heise.de
( output starts at the first double quote and ended with the last one ) :
"
html
head
Site Navigation Bar
link rel="copyright" title="Heise Zeitschriften Verlag" href="/kontakt/impressum.shtml"
link rel="start" title="Start" href="/"
link rel="author" title="Kontakt" href="mailto:kontakt%40heise.de?subject=heise%20online"
link rel="home" title="home:heise online" href="/"
"
As you can see... so many line breaks, although i checked all the whitespaces.
( Each line is stored in a Node ! )