Thread: [Htmlparser-user] Not CompositeNode ("ADDRESS", "CENTER" TAG, etc...)
Brought to you by:
derrickoswald
From: <ka...@ex...> - 2006-02-07 06:46:36
|
Hi, all. I parsed a html, and create a dom , using HTMLParser Version 1.6 (Integration Build Nov 12, 2005) The "P" tag has "P" END TAG as child. (It's is same at "HEAD", "TITLE", "BODY", etc...) The othe hand, there are 2 "ADDRESS" Tag ("ADDRESS" and "/ADDRESS") on the same level in dom. (It's the same thing at "CENTER" tag.) I expected that ADDRESS tag become like "P" tag, but not. Why the reason ? How can I that the paser recognize ADDRESS tag as a single CompositeTag. Thank you, all. Sorry my poor english. ---------code----------- import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.util.NodeList; import org.htmlparser.util.ParserException; public class SampleHTMLParserJ { /** * HTMLParser sample * * @param args */ public static void main(String[] args) { try { Parser parser = new Parser( "file:///D:/data/test03.html"); NodeList list = parser.parse(null); Node node = list.elementAt(0); System.out.println(node); } catch (ParserException e) { e.printStackTrace(); } } } ---------stdout----------- Tag (0[0,0],57[0,57]): Html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ja" Txt (57[0,57],60[1,1]): \n Tag (60[1,1],66[1,7]): head Txt (66[1,7],70[2,2]): \n Tag (70[2,2],77[2,9]): title Txt (77[2,9],88[2,20]): title title End (88[2,20],96[2,28]): /title Txt (96[2,28],99[3,1]): \n End (99[3,1],106[3,8]): /head Txt (106[3,8],109[4,1]): \n Tag (109[4,1],115[4,7]): body Txt (115[4,7],121[6,2]): \n\n Tag (121[6,2],130[6,11]): address Txt (130[6,11],137[6,18]): My name End (137[6,18],147[6,28]): /address Txt (147[6,28],151[7,2]): \n Tag (151[7,2],159[7,10]): CENTER Txt (159[7,10],165[7,16]): CENTER End (165[7,16],174[7,25]): /CENTER Txt (174[7,25],178[8,2]): \n Tag (178[8,2],181[8,5]): p Tag (181[8,5],220[8,44]): img src="welcome.gif" alt="welcome" / End (220[8,44],224[8,48]): /p Txt (224[8,48],230[10,2]): \n\n Tag (230[10,2],234[10,6]): h1 Txt (234[10,6],238[10,10]): main End (238[10,10],243[10,15]): /h1 Txt (243[10,15],247[11,2]): \n Tag (247[11,2],253[11,8]): hr / Txt (253[11,8],256[12,1]): \n End (256[12,1],263[12,8]): /body Txt (263[12,8],265[13,0]): \n End (265[13,0],272[13,7]): /html ---------html----------- <Html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ja"> <head> <title>title title</title> </head> <body> <address>My name</address> <CENTER>CENTER</CENTER> <p><img src="welcome.gif" alt="welcome" /></p> <h1>main</h1> <hr /> </body> </html> ------------------ |
From: <ka...@ex...> - 2006-02-07 07:14:24
|
Hi, all. I notice that correct way. I created a AddressTag.java that is almost copy of ParagraphTag.java And add same code like this. PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); factory.registerTag (new AddressTag()); parser.setNodeFactory (factory); It's ok, but Should I create another alot of HTML tag classes ? I think that there are almost Html Tag classes already. How can I get ? Thank you, all. > Hi, all. > > I parsed a html, and create a dom , using > HTMLParser Version 1.6 (Integration Build Nov 12, 2005) > > The "P" tag has "P" END TAG as child. > (It's is same at "HEAD", "TITLE", "BODY", etc...) > > The othe hand, there are 2 "ADDRESS" Tag ("ADDRESS" and "/ADDRESS") > on the same level in dom. > (It's the same thing at "CENTER" tag.) > > I expected that ADDRESS tag become like "P" tag, but not. > > Why the reason ? > > How can I that the paser recognize ADDRESS tag as a single > CompositeTag. > > Thank you, all. Sorry my poor english. > |
From: Derrick O. <Der...@Ro...> - 2006-02-07 18:34:13
|
Tags are omitted because heuristically the tighter rule that assumes all tags are composite tags fails to parse correctly because of bad HTML out in the wild. You are welcome to try replacing the default tag (see PrototypicalNodeFactory.setTagPrototype()) with a composite tag that ends with a matching slash name, but my guess is it will parse very poorly. 加藤 千典 wrote: >Hi, all. > >I notice that correct way. > >I created a AddressTag.java that is almost copy of ParagraphTag.java > >And add same code like this. > > PrototypicalNodeFactory factory = new PrototypicalNodeFactory(); > factory.registerTag (new AddressTag()); > parser.setNodeFactory (factory); > >It's ok, but Should I create another alot of HTML tag classes ? > >I think that there are almost Html Tag classes already. >How can I get ? > >Thank you, all. > > > >>Hi, all. >> >>I parsed a html, and create a dom , using >>HTMLParser Version 1.6 (Integration Build Nov 12, 2005) >> >>The "P" tag has "P" END TAG as child. >>(It's is same at "HEAD", "TITLE", "BODY", etc...) >> >>The othe hand, there are 2 "ADDRESS" Tag ("ADDRESS" and "/ADDRESS") >>on the same level in dom. >>(It's the same thing at "CENTER" tag.) >> >>I expected that ADDRESS tag become like "P" tag, but not. >> >>Why the reason ? >> >>How can I that the paser recognize ADDRESS tag as a single >>CompositeTag. >> >>Thank you, all. Sorry my poor english. >> >> >> > > > >------------------------------------------------------- >This SF.net email is sponsored by: Splunk Inc. Do you grep through log files >for problems? Stop! Download the new AJAX search engine that makes >searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! >http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642 >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |