Thread: [Htmlparser-user] Getting all tags??
Brought to you by:
derrickoswald
From: Thomas Z. <li...@th...> - 2006-04-24 18:04:04
|
Dear list, I'm very new to the htmlparser and have some problems with the documentation ... I need nothing else than a little program which extracts *all* HTML-tags of a HTML-document. I took a look at the docs and find this example: Typical usage of the parser is: | | Parser parser = new Parser ("http://whatever"); NodeList list = parser.parse (); // do something with your list of nodes. But when I try to NodeList list = parser.parse(), it tells me that it needs an "NodeFilter filter" as argument. But I don't need any filterm, I want all tags in the doc ... how can I do this? Thank you very much for your labour! Best regards, Tom |
From: Derrick O. <Der...@Ro...> - 2006-04-24 22:13:40
|
Sorry about that. I fixed the ocumentation. Just supply a null... NodeList list = parser.parse (null); Note that the tags will be nested so the list is only as long as the count of enclosing tags, usually just one, i.e. <HTML>. If you want nodes in a simple sequential order without nesting, use the lexer... Parser parser = new Parser ("http://whatever"); Lexer lexer = parser.getLexer (); Node node; while (null != (node = lexer.nextNode ()) ... do something with the node Thomas Zastrow wrote: > Dear list, > > I'm very new to the htmlparser and have some problems with the > documentation ... I need nothing else than a little program which > extracts *all* HTML-tags of a HTML-document. > > I took a look at the docs and find this example: > > Typical usage of the parser is: | | > > Parser parser = new Parser ("http://whatever"); > NodeList list = parser.parse (); > // do something with your list of nodes. > > But when I try to NodeList list = parser.parse(), it tells me that it > needs an "NodeFilter filter" as argument. But I don't need any > filterm, I want all tags in the doc ... how can I do this? > > Thank you very much for your labour! > > Best regards, > > Tom > > > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Thomas Z. <li...@th...> - 2006-04-25 19:04:58
|
Derrick Oswald schrieb: > Sorry about that. I fixed the ocumentation. Just supply a null... > NodeList list = parser.parse (null); > Note that the tags will be nested so the list is only as long as the > count of enclosing tags, usually just one, i.e. <HTML>. > > If you want nodes in a simple sequential order without nesting, use > the lexer... > Parser parser = new Parser ("http://whatever"); > Lexer lexer = parser.getLexer (); > Node node; > while (null != (node = lexer.nextNode ()) > ... do something with the node > Dear Derrick, thank you for your help ;-) So, maybe I can ask another question ... I got this code: Parser parser = new Parser("/gb/testfiles/abraham/fabeln/antwort.htm"); Lexer lexer = parser.getLexer(); Node node; String s; while(null != lexer.nextNode()){ node = lexer.nextNode(); s = node.toPlainTextString(); System.out.println(s); } Works fine, but it prints me the content of the tags, not the names of the tags? But I just need to know which tags are used in the document... Thank you very much! Greetings, Tom |
From: Derrick O. <Der...@Ro...> - 2006-04-26 01:07:23
|
You will need to cast it to a tag if possible and use getTagName (): if (node instanceof Tag) System.out.println (((Tag)node).getTagName ()); Thomas Zastrow wrote: > Derrick Oswald schrieb: > >> Sorry about that. I fixed the ocumentation. Just supply a null... >> NodeList list = parser.parse (null); >> Note that the tags will be nested so the list is only as long as the >> count of enclosing tags, usually just one, i.e. <HTML>. >> >> If you want nodes in a simple sequential order without nesting, use >> the lexer... >> Parser parser = new Parser ("http://whatever"); >> Lexer lexer = parser.getLexer (); >> Node node; >> while (null != (node = lexer.nextNode ()) >> ... do something with the node >> > Dear Derrick, > > thank you for your help ;-) > > So, maybe I can ask another question ... I got this code: > > Parser parser = new > Parser("/gb/testfiles/abraham/fabeln/antwort.htm"); > Lexer lexer = parser.getLexer(); > Node node; > String s; > while(null != lexer.nextNode()){ > node = lexer.nextNode(); > s = node.toPlainTextString(); > System.out.println(s); > } > > Works fine, but it prints me the content of the tags, not the names of > the tags? But I just need to know which tags are used in the document... > > Thank you very much! > > Greetings, > > Tom > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Thomas Z. <li...@th...> - 2006-04-26 18:20:11
|
Derrick Oswald schrieb: > > You will need to cast it to a tag if possible and use getTagName (): > if (node instanceof Tag) > System.out.println (((Tag)node).getTagName ()); Step by step, I'll get it ... ;-) Now, this code produces no output, Am I still doing something wrong: Parser parser = new Parser("/gb/testfiles/abraham/fabeln/antwort.htm"); Lexer lexer = parser.getLexer(); Node node; String s; while(null != lexer.nextNode()){ node = lexer.nextNode(); if(node instanceof Tag){ System.out.println(((Tag)node).getTagName()); } // if } Greetings and thank you again, I hope that sometimes I'll manage it on my own ... Best regards, Tom |
From: Derrick O. <Der...@Ro...> - 2006-04-29 10:57:23
|
It's not clear why you aren't getting any output. The same loop is in the Lexer mainline: manager = Page.getConnectionManager (); lexer = new Lexer (manager.openConnection (args[0])); while (null != (node = lexer.nextNode (false))) System.out.println (node.toString ()); The guard on the if statement should be satisfied for anything that looks like a tag, i.e. <XXX>. Thomas Zastrow wrote: > Derrick Oswald schrieb: > >> >> You will need to cast it to a tag if possible and use getTagName (): >> if (node instanceof Tag) >> System.out.println (((Tag)node).getTagName ()); > > > Step by step, I'll get it ... ;-) > > Now, this code produces no output, Am I still doing something wrong: > > Parser parser = new > Parser("/gb/testfiles/abraham/fabeln/antwort.htm"); > Lexer lexer = parser.getLexer(); > Node node; > String s; > while(null != lexer.nextNode()){ > node = lexer.nextNode(); > if(node instanceof Tag){ > System.out.println(((Tag)node).getTagName()); > } // if } > > Greetings and thank you again, I hope that sometimes I'll manage it on > my own ... > > Best regards, > > Tom > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |