htmlparser-user Mailing List for HTML Parser (Page 38)
Brought to you by:
derrickoswald
You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(7) |
Feb
|
Mar
(9) |
Apr
(50) |
May
(20) |
Jun
(47) |
Jul
(37) |
Aug
(32) |
Sep
(30) |
Oct
(11) |
Nov
(37) |
Dec
(47) |
2003 |
Jan
(31) |
Feb
(70) |
Mar
(67) |
Apr
(34) |
May
(66) |
Jun
(25) |
Jul
(48) |
Aug
(43) |
Sep
(58) |
Oct
(25) |
Nov
(10) |
Dec
(25) |
2004 |
Jan
(38) |
Feb
(17) |
Mar
(24) |
Apr
(25) |
May
(11) |
Jun
(6) |
Jul
(24) |
Aug
(42) |
Sep
(13) |
Oct
(17) |
Nov
(13) |
Dec
(44) |
2005 |
Jan
(10) |
Feb
(16) |
Mar
(16) |
Apr
(23) |
May
(6) |
Jun
(19) |
Jul
(39) |
Aug
(15) |
Sep
(40) |
Oct
(49) |
Nov
(29) |
Dec
(41) |
2006 |
Jan
(28) |
Feb
(24) |
Mar
(52) |
Apr
(41) |
May
(31) |
Jun
(34) |
Jul
(22) |
Aug
(12) |
Sep
(11) |
Oct
(11) |
Nov
(11) |
Dec
(4) |
2007 |
Jan
(39) |
Feb
(13) |
Mar
(16) |
Apr
(24) |
May
(13) |
Jun
(12) |
Jul
(21) |
Aug
(61) |
Sep
(31) |
Oct
(13) |
Nov
(32) |
Dec
(15) |
2008 |
Jan
(7) |
Feb
(8) |
Mar
(14) |
Apr
(12) |
May
(23) |
Jun
(20) |
Jul
(9) |
Aug
(6) |
Sep
(2) |
Oct
(7) |
Nov
(3) |
Dec
(2) |
2009 |
Jan
(5) |
Feb
(8) |
Mar
(10) |
Apr
(22) |
May
(85) |
Jun
(82) |
Jul
(45) |
Aug
(28) |
Sep
(26) |
Oct
(50) |
Nov
(8) |
Dec
(16) |
2010 |
Jan
(3) |
Feb
(11) |
Mar
(39) |
Apr
(56) |
May
(80) |
Jun
(64) |
Jul
(49) |
Aug
(48) |
Sep
(16) |
Oct
(3) |
Nov
(5) |
Dec
(5) |
2011 |
Jan
(13) |
Feb
|
Mar
(1) |
Apr
(7) |
May
(7) |
Jun
(7) |
Jul
(7) |
Aug
(8) |
Sep
|
Oct
(6) |
Nov
(2) |
Dec
|
2012 |
Jan
(5) |
Feb
|
Mar
(3) |
Apr
(3) |
May
(4) |
Jun
(8) |
Jul
(1) |
Aug
(5) |
Sep
(10) |
Oct
(3) |
Nov
(2) |
Dec
(4) |
2013 |
Jan
(4) |
Feb
(2) |
Mar
(7) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
(2) |
Mar
(1) |
Apr
|
May
(3) |
Jun
(1) |
Jul
|
Aug
|
Sep
(1) |
Oct
(4) |
Nov
(2) |
Dec
(4) |
2015 |
Jan
(4) |
Feb
(2) |
Mar
(8) |
Apr
(7) |
May
(6) |
Jun
(7) |
Jul
(3) |
Aug
(1) |
Sep
(1) |
Oct
(4) |
Nov
(3) |
Dec
(4) |
2016 |
Jan
(4) |
Feb
(6) |
Mar
(9) |
Apr
(9) |
May
(6) |
Jun
(1) |
Jul
(1) |
Aug
|
Sep
|
Oct
(1) |
Nov
(1) |
Dec
(1) |
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
(1) |
May
|
Jun
(1) |
Jul
(2) |
Aug
(3) |
Sep
(6) |
Oct
(3) |
Nov
(2) |
Dec
(5) |
2018 |
Jan
(3) |
Feb
(13) |
Mar
(28) |
Apr
(5) |
May
(4) |
Jun
(2) |
Jul
(2) |
Aug
(8) |
Sep
(2) |
Oct
(1) |
Nov
(5) |
Dec
(1) |
2019 |
Jan
(8) |
Feb
(1) |
Mar
|
Apr
(1) |
May
(4) |
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
(2) |
Dec
(2) |
2020 |
Jan
|
Feb
|
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
(2) |
Jul
(1) |
Aug
(1) |
Sep
(1) |
Oct
|
Nov
(1) |
Dec
(1) |
2021 |
Jan
(3) |
Feb
(2) |
Mar
(1) |
Apr
(1) |
May
(2) |
Jun
(1) |
Jul
(2) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2022 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
(1) |
Jun
(1) |
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2023 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
2024 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2025 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: Riaan S. <ud...@un...> - 2006-05-04 12:37:22
|
Hi I was reading though the email archives, and saw the posts with this subject, I'm facing the same problem, and also for me setting those properties : System.setProperty ("sun.net.client.defaultReadTimeout", "7000"); System.setProperty ("sun.net.client.defaultConnectTimeout", "7000"); does not have any effect. Has anyone found any solution to this problem, or faced anything similar ? Regards Riaan |
From: Derrick O. <Der...@Ro...> - 2006-05-03 10:55:37
|
You have bigger problems than that. The security model of the sandbox will prohibit the parser from looking at any pages that are outside of it's site. But assuming you know what you are doing, parsing either strings or streams.... Package up your applet and the classes extracted from htmlparser.jar into another big jar and point the archive attribute in your applet tag to the aggregate jar. Riaz uddin wrote: > Hi, > I am really stuck with this, > I have a program StringExtract.java which uses HTMLParser and it works > fine, I am trying to bring the output of this program to the webpage > through an applet. > > In order to do this, I am trying to instantiate StringExtract in the > applet class(sumdisplay), it is something like this: > > > StringExtract yahoosum = new StringExtract(); > > But the applet cannot be instantiated and it displays the following > error along with other errors: > > java.lang.NoClassDefFoundError: org/htmlparser/util/NodeIterator > > > How can solve this problem, please help. > > > > <http://us.rd.yahoo.com/mail_us/taglines/postman8/*http://us.rd.yahoo.com/evt=39663/*http://voice.yahoo.com> |
From: Riaz u. <ru...@ya...> - 2006-05-02 17:05:15
|
Hi, I am really stuck with this, I have a program StringExtract.java which uses HTMLParser and it works fine, I am trying to bring the output of this program to the webpage through an applet. In order to do this, I am trying to instantiate StringExtract in the applet class(sumdisplay), it is something like this: StringExtract yahoosum = new StringExtract(); But the applet cannot be instantiated and it displays the following error along with other errors: java.lang.NoClassDefFoundError: org/htmlparser/util/NodeIterator How can solve this problem, please help. --------------------------------- How low will we go? Check out Yahoo! Messengers low PC-to-Phone call rates. |
From: Derrick O. <Der...@Ro...> - 2006-04-29 10:57:23
|
It's not clear why you aren't getting any output. The same loop is in the Lexer mainline: manager = Page.getConnectionManager (); lexer = new Lexer (manager.openConnection (args[0])); while (null != (node = lexer.nextNode (false))) System.out.println (node.toString ()); The guard on the if statement should be satisfied for anything that looks like a tag, i.e. <XXX>. Thomas Zastrow wrote: > Derrick Oswald schrieb: > >> >> You will need to cast it to a tag if possible and use getTagName (): >> if (node instanceof Tag) >> System.out.println (((Tag)node).getTagName ()); > > > Step by step, I'll get it ... ;-) > > Now, this code produces no output, Am I still doing something wrong: > > Parser parser = new > Parser("/gb/testfiles/abraham/fabeln/antwort.htm"); > Lexer lexer = parser.getLexer(); > Node node; > String s; > while(null != lexer.nextNode()){ > node = lexer.nextNode(); > if(node instanceof Tag){ > System.out.println(((Tag)node).getTagName()); > } // if } > > Greetings and thank you again, I hope that sometimes I'll manage it on > my own ... > > Best regards, > > Tom > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: ywj<yw...@ya...> - 2006-04-27 09:57:28
|
aGksDQogICAgdGhlcmUncmUgamF2YXNjcmlwdCBsaW5rcyBsaWtlICJuZXh0IHBhZ2UiIGFuZCAi cHJldmlvdXMgcGFnZSIgaW4gc29tZSB3ZWIgcGFnZXMuICB3aGVuIGNsaWNrZWQsIGEgZm9ybSB3 aWxsIGJlIHN1Ym1pdHRlZCBhbmQgcmV0dXJuZWQgdGhlIG5leHQgcGFnZS4gaXMgaXQgcG9zc2li bGUgdG8gZ2V0IHRoZSBuZXh0IHBhZ2Ugd2hpbGUgcGFyc2luZyB0aG9zZSBqYXZhc2NyaXB0IGxp bmtzIGluIHRoZSBmaXJzdCBwYWdlPw0K |
From: Thomas Z. <li...@th...> - 2006-04-26 18:20:11
|
Derrick Oswald schrieb: > > You will need to cast it to a tag if possible and use getTagName (): > if (node instanceof Tag) > System.out.println (((Tag)node).getTagName ()); Step by step, I'll get it ... ;-) Now, this code produces no output, Am I still doing something wrong: Parser parser = new Parser("/gb/testfiles/abraham/fabeln/antwort.htm"); Lexer lexer = parser.getLexer(); Node node; String s; while(null != lexer.nextNode()){ node = lexer.nextNode(); if(node instanceof Tag){ System.out.println(((Tag)node).getTagName()); } // if } Greetings and thank you again, I hope that sometimes I'll manage it on my own ... Best regards, Tom |
From: Derrick O. <Der...@Ro...> - 2006-04-26 01:07:23
|
You will need to cast it to a tag if possible and use getTagName (): if (node instanceof Tag) System.out.println (((Tag)node).getTagName ()); Thomas Zastrow wrote: > Derrick Oswald schrieb: > >> Sorry about that. I fixed the ocumentation. Just supply a null... >> NodeList list = parser.parse (null); >> Note that the tags will be nested so the list is only as long as the >> count of enclosing tags, usually just one, i.e. <HTML>. >> >> If you want nodes in a simple sequential order without nesting, use >> the lexer... >> Parser parser = new Parser ("http://whatever"); >> Lexer lexer = parser.getLexer (); >> Node node; >> while (null != (node = lexer.nextNode ()) >> ... do something with the node >> > Dear Derrick, > > thank you for your help ;-) > > So, maybe I can ask another question ... I got this code: > > Parser parser = new > Parser("/gb/testfiles/abraham/fabeln/antwort.htm"); > Lexer lexer = parser.getLexer(); > Node node; > String s; > while(null != lexer.nextNode()){ > node = lexer.nextNode(); > s = node.toPlainTextString(); > System.out.println(s); > } > > Works fine, but it prints me the content of the tags, not the names of > the tags? But I just need to know which tags are used in the document... > > Thank you very much! > > Greetings, > > Tom > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Thomas Z. <li...@th...> - 2006-04-25 19:04:58
|
Derrick Oswald schrieb: > Sorry about that. I fixed the ocumentation. Just supply a null... > NodeList list = parser.parse (null); > Note that the tags will be nested so the list is only as long as the > count of enclosing tags, usually just one, i.e. <HTML>. > > If you want nodes in a simple sequential order without nesting, use > the lexer... > Parser parser = new Parser ("http://whatever"); > Lexer lexer = parser.getLexer (); > Node node; > while (null != (node = lexer.nextNode ()) > ... do something with the node > Dear Derrick, thank you for your help ;-) So, maybe I can ask another question ... I got this code: Parser parser = new Parser("/gb/testfiles/abraham/fabeln/antwort.htm"); Lexer lexer = parser.getLexer(); Node node; String s; while(null != lexer.nextNode()){ node = lexer.nextNode(); s = node.toPlainTextString(); System.out.println(s); } Works fine, but it prints me the content of the tags, not the names of the tags? But I just need to know which tags are used in the document... Thank you very much! Greetings, Tom |
From: Derrick O. <Der...@Ro...> - 2006-04-24 22:13:40
|
Sorry about that. I fixed the ocumentation. Just supply a null... NodeList list = parser.parse (null); Note that the tags will be nested so the list is only as long as the count of enclosing tags, usually just one, i.e. <HTML>. If you want nodes in a simple sequential order without nesting, use the lexer... Parser parser = new Parser ("http://whatever"); Lexer lexer = parser.getLexer (); Node node; while (null != (node = lexer.nextNode ()) ... do something with the node Thomas Zastrow wrote: > Dear list, > > I'm very new to the htmlparser and have some problems with the > documentation ... I need nothing else than a little program which > extracts *all* HTML-tags of a HTML-document. > > I took a look at the docs and find this example: > > Typical usage of the parser is: | | > > Parser parser = new Parser ("http://whatever"); > NodeList list = parser.parse (); > // do something with your list of nodes. > > But when I try to NodeList list = parser.parse(), it tells me that it > needs an "NodeFilter filter" as argument. But I don't need any > filterm, I want all tags in the doc ... how can I do this? > > Thank you very much for your labour! > > Best regards, > > Tom > > > > > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Thomas Z. <li...@th...> - 2006-04-24 18:04:04
|
Dear list, I'm very new to the htmlparser and have some problems with the documentation ... I need nothing else than a little program which extracts *all* HTML-tags of a HTML-document. I took a look at the docs and find this example: Typical usage of the parser is: | | Parser parser = new Parser ("http://whatever"); NodeList list = parser.parse (); // do something with your list of nodes. But when I try to NodeList list = parser.parse(), it tells me that it needs an "NodeFilter filter" as argument. But I don't need any filterm, I want all tags in the doc ... how can I do this? Thank you very much for your labour! Best regards, Tom |
From: Brandy Y. <ye...@gm...> - 2006-04-23 00:31:45
|
Hello, all I'm a newbie of Htmlparser. I have a question when I wrote my first sample using Htmlparser, something to show the html structures. When I use "getParent()" to get the parent of a text node, some tags such a= s "<b>" and "<i>" are not treated as its parent node. The html to be parsed: <html> <title>test.html</title> <body> <b> <h1> content </h1> </font> </body> </html> and the parent nodes of "content" are: h1, body, html (but NO b). Is it the expected behaviour? I found headingTag (h1,h2...) was not treated as parent node too in Htmlparser1.5. |
From: Derrick O. <Der...@Ro...> - 2006-04-22 11:15:40
|
Brandy, As your example illustrates, <B> is not often closed by a </B> which causes some grief for the parser. For this heuristic reason, not all possible tags are registered as CompositeTag nodes, which is what gives the 'parent/child' nesting relationship. The heading tags were just added recently in version 1.6, which alters the heuristic a bit, but seems to be acceptable to most people. Derrick Brandy Ye wrote: > Hello, all > > I'm a newbie of Htmlparser. I have a question when I wrote my > first sample using Htmlparser, something to show the html structures. > > When I use "getParent()" to get the parent of a text node, some tags > such as "<b>" and "<i>" are not treated as its parent node. > > The html to be parsed: > > <html> > <title>test.html</title> > <body> > <b> > <h1> > content > </h1> > </font> > </body> > </html> > > and the parent nodes of "content" are: h1, body, html (but NO b). > > Is it the expected behaviour? I found headingTag (h1,h2...) was not > treated as parent node too in Htmlparser1.5. > > Thanks in advance! |
From: Brandy Y. <ye...@gm...> - 2006-04-22 06:41:22
|
Hello, all I'm a newbie of Htmlparser. I have a question when I wrote my first sample using Htmlparser, something to show the html structures. When I use "getParent()" to get the parent of a text node, some tags such a= s "<b>" and "<i>" are not treated as its parent node. The html to be parsed: <html> <title>test.html</title> <body> <b> <h1> content </h1> </font> </body> </html> and the parent nodes of "content" are: h1, body, html (but NO b). Is it the expected behaviour? I found headingTag (h1,h2...) was not treated as parent node too in Htmlparser1.5. Thanks in advance! |
From: Derrick O. <Der...@Ro...> - 2006-04-21 01:17:54
|
It looks like a double/single quote nesting problem. The quoting is correct from what I can see, but the lexer is probably getting confused trying to nest the single quotes inside the doublequotes.... it's only a stupid program. Qingyi Gu wrote: > Hi, > > Sorry, please ignore my last email. I double check line by line. The > problem is on the line after these two. The following is the line > which breaks the parser. > > ss = "onclick='SetValue(" + aVal + ")'" + "id='List_" + index + "' > theNumber='"+ aVal +"'>"; > > I double check it for couple of times. If I remove this line, I got > what I want. I don't what's wrong in this line. > > Thanks, > J > > > |
From: Qingyi Gu <q_z...@ya...> - 2006-04-20 20:10:34
|
Hi, Sorry, please ignore my last email. I double check line by line. The problem is on the line after these two. The following is the line which breaks the parser. ss = "onclick='SetValue(" + aVal + ")'" + "id='List_" + index + "' theNumber='"+ aVal +"'>"; I double check it for couple of times. If I remove this line, I got what I want. I don't what's wrong in this line. Thanks, J --------------------------------- Blab-away for as little as 1¢/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice. |
From: Qingyi Gu <q_z...@ya...> - 2006-04-20 19:33:01
|
Hey, I have been used this utility for a while. Everything works fine so far. But recently I got a problem. I have a new page to parse. I cannot get right head tag info. The reason is there is a javascript and inside script there are some lines like below: var spanStart = "<span class='span1'>"; var spanEnd = "</span>"; If I remove them, the parser works correctly. Does anybody know any workaround to fix this problem? Thanks. BR, Jenny --------------------------------- Blab-away for as little as 1¢/min. Make PC-to-Phone Calls using Yahoo! Messenger with Voice. |
From: Derrick O. <Der...@Ro...> - 2006-04-20 12:01:39
|
The parse should proceed correctly and give you one META tag. There is no need for a HTML tag. The redirection will need to be coded by you though, as there isn't any support in the parser to do that other than what comes for free with the FollowRedirects of a normal URLConnection. Eugeny N Dzhurinsky wrote: >Hi! > >I facing a problem: some pages consists just of single line: > ><META HTTP-EQUIV=Refresh CONTENT="0; URL=/internet/index.jsp"> > >so there is no HTML tag, and thus HTMLParser seems not parses this page at >all. Is there any way to parse such pages and extract tags? > >I'm asking because all browsers works fine with such pages, they seems to >parse them, extract redirection URI and process redirects. > > > |
From: Eugeny N D. <bo...@re...> - 2006-04-20 08:19:38
|
Hi! I facing a problem: some pages consists just of single line: <META HTTP-EQUIV=Refresh CONTENT="0; URL=/internet/index.jsp"> so there is no HTML tag, and thus HTMLParser seems not parses this page at all. Is there any way to parse such pages and extract tags? I'm asking because all browsers works fine with such pages, they seems to parse them, extract redirection URI and process redirects. -- Eugene N Dzhurinsky |
From: Wen <log...@ya...> - 2006-04-18 05:47:36
|
Hi I have a code like this: Parser parser = new Parser(); try{ parser.setURL("http://www.pbs.org/journeyintoamazonia"); }catch(ParserException pe){ pe.printStackTrace(); } However, the try-catch brace can't catch the exception. I still got the following exception and the execution stopped in this case. org.htmlparser.util.ParserException: Connection timed out: connect; java.net.ConnectException: Connection timed out: connect at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(Unknown Source) at java.net.PlainSocketImpl.connectToAddress(Unknown Source) Does anyone know what's happening? p.s. thanks Derrick. Your solution for my last reply is working fine. : ) wen --------------------------------- Love cheap thrills? Enjoy PC-to-Phone calls to 30+ countries for just 2¢/min with Yahoo! Messenger with Voice. |
From: Derrick O. <der...@ro...> - 2006-04-18 01:05:25
|
You should be able to use a NodeNameFilter and then get the SRC attribute. parser.parse (new NodeNameFilter ("EMBED")).elementAt (0).getAttribute ("SRC"); Wen <log...@ya...> wrote: Hi, I'm trying to get the src link in embed tag. EX. height=400 type="text/plain; charset=UTF-8"> I checked the library, there is no a tag called EmbedTag, so I can't use NodeFilter. Does any one have ideas on how to get it. Thanks in advanced. Wen |
From: Wen <log...@ya...> - 2006-04-18 00:33:17
|
Hi, I'm trying to get the src link in embed tag. EX. <EMBED src=http://gorillamask.net/Media/skatingdog.wmv width=500 height=400 type="text/plain; charset=UTF-8"> I checked the library, there is no a tag called EmbedTag, so I can't use NodeFilter. Does any one have ideas on how to get it. Thanks in advanced. Wen __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com |
From: Bastian H. <ho...@fm...> - 2006-04-17 20:09:28
|
Hi, thank you for your help. Now it works for me and I've attached the source code of the tag class to this mail. Perhaps anybody else needs it. It really simple and I didn't spend too much time about terminating tags (just copied them from LinkTag) so perhaps bugs appear here. I'll make some junit test cases and we'll see... Bastian |
From: Derrick O. <Der...@Ro...> - 2006-04-17 11:48:04
|
Bastian, The CODE tag has not been added as a subclass of CompositeTag, so you're getting the default behaviour -- just a simple NodeTag that has the name CODE. Perhaps the 'phrase elements' (EM, STRONG, DFN, CODE, SAMP, KBD, VAR, CITE, ABBR, and ACRONYM see http://www.w3.org/TR/html4/struct/text.html#h-9.2.1) should be added. You can raise this as a request for enhancement (RFE) or you can do this yourself by copying another tag based on CompositeTag and editing it a bit, and then register the new tag with the PrototypicalNodeFactory: PrototypicalNodeFactory factory = new PrototypicalNodeFactory (); factory.registerTag (new MyCodeTag ()); parser.setNodeFactory (factory); See for example PrototypicalNodeFactory.registerTags(). The problem becomes detecting when the tag doesn't have a </CODE> like it should, so getEnders() and getEndTagEnders should probably have all the block level tag names. Derrick Bastian Hoesch wrote: > Hello, > > given this text string > > "<html><body><a href="xy">test</a></body></html>" > > HTMLParser creates this nodelist: > > Tag (0[0,0],6[0,6]): html > Tag (6[0,6],12[0,12]): body > Tag (12[0,12],25[0,25]): a href="xy" > Txt (25[0,25],29[0,29]): test > End (29[0,29],33[0,33]): /a > End (33[0,33],40[0,40]): /body > End (40[0,40],47[0,47]): /html > > > So, the text "test" is child element of the tag node for the element > <A>. I like this behaviour and I think thats correct way to do that. > > But: > > from this text string > > "<html><body><code>test</code></body></html>" > > the parser creates the following node list: > > Tag (0[0,0],6[0,6]): html > Tag (6[0,6],12[0,12]): body > Tag (12[0,12],18[0,18]): code > Txt (18[0,18],22[0,22]): test > End (22[0,22],29[0,29]): /code > End (29[0,29],36[0,36]): /body > End (36[0,36],43[0,43]): /html > > so, the text "test" is not a child element of the tag <code>. > Why does this happen? Is it a bug or feature? > > Thank you for your help, > > greetings > Bastian Hoesch > > > ------------------------------------------------------- > This SF.Net email is sponsored by xPML, a groundbreaking scripting > language > that extends applications into web and mobile media. Attend the live > webcast > and join the prime developer group breaking into this new coding > territory! > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |
From: Bastian H. <ho...@fm...> - 2006-04-17 10:21:28
|
Hello, given this text string "<html><body><a href="xy">test</a></body></html>" HTMLParser creates this nodelist: Tag (0[0,0],6[0,6]): html Tag (6[0,6],12[0,12]): body Tag (12[0,12],25[0,25]): a href="xy" Txt (25[0,25],29[0,29]): test End (29[0,29],33[0,33]): /a End (33[0,33],40[0,40]): /body End (40[0,40],47[0,47]): /html So, the text "test" is child element of the tag node for the element <A>. I like this behaviour and I think thats correct way to do that. But: from this text string "<html><body><code>test</code></body></html>" the parser creates the following node list: Tag (0[0,0],6[0,6]): html Tag (6[0,6],12[0,12]): body Tag (12[0,12],18[0,18]): code Txt (18[0,18],22[0,22]): test End (22[0,22],29[0,29]): /code End (29[0,29],36[0,36]): /body End (36[0,36],43[0,43]): /html so, the text "test" is not a child element of the tag <code>. Why does this happen? Is it a bug or feature? Thank you for your help, greetings Bastian Hoesch |
From: Bram <br...@av...> - 2006-04-13 08:13:37
|
Hello Derrick, > The mini <A/> tag in your test case is considered a content-less XML > tag (see isEmptyXmlTag() in TagNode). > This causes the parser to set the end tag reference to be the start tag > - this is done so that there will always be an end tag, which may be a > bad design decision, but it was thought to be better than inventing a > non-existent tag, or returning null. > > When you add an HREF attribute in this case, the only attribute (the tag > name with the slash) is no longer the last attribute and hence > isEmptyXmlTag returns false, but the end tag reference is still pointing > to the start tag, which causes the recursion. > Thanks for the explanation. > I guess the add attribute code could be smarter and detect this > pathological situation, but I'm wondering if that's the real solution or > just a band-aid. > That sounds like a band-aid indeed, but it's a viable solution to prevent this problem until something better has been worked out, IMHO. > Was this discovered in the wild? Why is the XML syntax used for an > empty link? Is this an XML file? Perhaps an XML parser would be a > better choice. > Apparently I over-simplified the problem I occasionally occur. I tried to use the exact same test case on the URL that gave me the problems, and now it worked flawlessly. After a bit of tinkering I noticed I changed the name of the attribute (from 'onclick' to 'href') while trying to isolate the problem yesterday. Changing it back to 'onclick' will fail the test. This can be explained by the fact that the broken <a> tag already had a 'href' attribute, and simply changing an existing attribute doesn't trigger the problem. This is demonstrated in the newly attached JUnit test. Sadly, '<a ... />' tags do occur in the wild, so it would be very good to prevent this from happing, from a user's point of view. Bram -- Light travels faster than sound. This is why some people appear bright until you hear them speak. |