Thread: Re: [Htmlparser-user] Parsing Partial HTML text
Brought to you by:
derrickoswald
From: Derrick O. <der...@ro...> - 2007-09-27 01:15:32
|
=0AThe tbody tag you are getting is a generic tag - because the parser does= n't know about tbody.=0AHence it has no children because it is not a compos= ite node.=0AYou can make your own tbody composite node as described here: h= ttp://htmlparser.sourceforge.net/faq.html#composite=0A=0A=0A----- Original = Message ----=0AFrom: "mic...@no..." <michaeld.jones@novartis= .com>=0ATo: htm...@li...=0ASent: Wednesday, Septem= ber 26, 2007 10:30:45 AM=0ASubject: [Htmlparser-user] Parsing Partial HTML = text=0A=0A=0A=0AI am having trouble parsing html tagged=0Atext. It seems th= at I can retrieve a node but that element does not have=0Athe child nodes a= s expected. =0A=0A=0A=0A String table=0A=3D=0A=0A =0A "<= tbody>\n" +=0A=0A =0A "<tr>\n" +=0A=0A =0A "<td><= span>brain_normal_GSM80627</span></td>\n"=0A+=0A=0A =0A "<td><= span>normal</span></td>\n"=0A+=0A=0A =0A "<td><span>cerebral c= ortex</span></td>\n"=0A+=0A=0A =0A "<td><span>brain</span></td= >\n"=0A+=0A=0A =0A "</tr>\n" +=0A=0A =0A "</tbody= >\n";=0A=0A=0A=0A Parser parser=0A=3D new Parser(new Lexer(table));= =0A=0A try {=0A=0A =0ANode tBodyNode =3D parser.extractAll= NodesThatMatch(new TagNameFilter("tbody")).elementAt(0);=0A=0A = =0ASystem.out.println(tBodyNode.getChildren()); // Prints null <----------= -----=0A=0A } catch=0A(ParserException e) {=0A=0A =0Ae.pri= ntStackTrace(); //To change body of catch statement use File=0A| Settings = | File Templates.=0A=0A }=0A=0A=0A=0ADoes HTML Parser not handle tex= t input=0Aor partial html files well?=0A=0A=0A=0A_________________________= =0A=0A=0A=0ACONFIDENTIALITY NOTICE=0A=0A=0A=0AThe information contained in = this e-mail message is intended only for the=0Aexclusive use of the individ= ual or entity named above and may contain information=0Athat is privileged,= confidential or exempt from disclosure under applicable=0Alaw. If the read= er of this message is not the intended recipient, or the=0Aemployee or agen= t responsible for delivery of the message to the intended=0Arecipient, you = are hereby notified that any dissemination, distribution=0Aor copying of th= is communication is strictly prohibited. If you have received=0Athis commun= ication in error, please notify the sender immediately by e-mail=0Aand dele= te the material from any computer. Thank you.=0A=0A=0A=0A=0A |
From: Derrick O. <der...@ro...> - 2007-09-27 12:03:36
|
=0AThis has been on the Request For Enhancement list for a couple of years = now.=0ARFE #888158 block & inline tag differentiation=0ARFE #1395202 suppor= t for well known tags like b,i,u,iframe=0A=0AAdding the equivalent of compo= site tags isn't too difficult, but each tag has semantics that are more sub= tle.=0ALike for example the MAP and AREA example would need accessors like = MapTag.getName() and AreaTag.getShape() and AreaTag.getCoords().=0A=0A... i= t's just a little code =0A=0A----- Original Message ----=0AFrom: Madhur Kum= ar Tanwani <mad...@gm...>=0ATo: htmlparser user list <htmlparser= -u...@li...>=0ASent: Thursday, September 27, 2007 12:22:16 = AM=0ASubject: Re: [Htmlparser-user] Parsing Partial HTML text=0A=0AHi Derri= ck,=0AI understand that the parser wraps unknown/unregistered tags in gener= ic =0Atag classes. But do you think, with so many normal tags not registere= d =0Awith HmlParser, it would be time to get them incorporated in the set o= f =0Atags the node factory registers? I mean, the HTML tags - MAP, AREA, = =0ATBODY, THEAD etc... are very common in a HTML content and I think it =0A= would be good to have them registered, so that people can work on it.=0A=0A= Would like to hear your comments on the same.=0AThanks,=0A=0ADerrick Oswald= wrote:=0A>=0A> The tbody tag you are getting is a generic tag - because th= e parser =0A> doesn't know about tbody.=0A> Hence it has no children becaus= e it is not a composite node.=0A> You can make your own tbody composite nod= e as described here: =0A> http://htmlparser.sourceforge.net/faq.html#compos= ite=0A>=0A>=0A> ----- Original Message ----=0A> From: "michaeld.jones@novar= tis.com" <mic...@no...>=0A> To: htm...@li...urce= forge.net=0A> Sent: Wednesday, September 26, 2007 10:30:45 AM=0A> Subject: = [Htmlparser-user] Parsing Partial HTML text=0A>=0A>=0A> I am having trouble= parsing html tagged text. It seems that I can =0A> retrieve a node but tha= t element does not have the child nodes as =0A> expected.=0A>=0A> St= ring table =3D=0A> "<tbody>\n" +=0A> "<tr>\= n" +=0A> "<td><span>brain_normal_GSM80627</span></td>\n" += =0A> "<td><span>normal</span></td>\n" +=0A> = "<td><span>cerebral cortex</span></td>\n" +=0A> "<td><span= >brain</span></td>\n" +=0A> "</tr>\n" +=0A> = "</tbody>\n";=0A>=0A> Parser parser =3D new Parser(new Lexer(table= ));=0A> try {=0A> Node tBodyNode =3D parser.extractAllN= odesThatMatch(new =0A> TagNameFilter("tbody")).elementAt(0);=0A> = System.out.println(tBodyNode.getChildren()); // Prints =0A> null <------= ---------=0A> } catch (ParserException e) {=0A> e.print= StackTrace(); //To change body of catch statement =0A> use File | Settings= | File Templates.=0A> }=0A>=0A> Does HTML Parser not handle text i= nput or partial html files well?=0A>=0A> _________________________=0A>=0A> = CONFIDENTIALITY NOTICE=0A>=0A> The information contained in this e-mail mes= sage is intended only for =0A> the exclusive use of the individual or entit= y named above and may =0A> contain information that is privileged, confiden= tial or exempt from =0A> disclosure under applicable law. If the reader of = this message is not =0A> the intended recipient, or the employee or agent r= esponsible for =0A> delivery of the message to the intended recipient, you = are hereby =0A> notified that any dissemination, distribution or copying of= this =0A> communication is strictly prohibited. If you have received this = =0A> communication in error, please notify the sender immediately by e-mail= =0A> and delete the material from any computer. Thank you.=0A>=0A> ------= ------------------------------------------------------------------=0A>=0A> = -------------------------------------------------------------------------= =0A> This SF.net email is sponsored by: Microsoft=0A> Defy all challenges. = Microsoft(R) Visual Studio 2005.=0A> http://clk.atdmt.com/MRT/go/vse0120000= 070mrt/direct/01/=0A> -----------------------------------------------------= -------------------=0A>=0A> _______________________________________________= =0A> Htmlparser-user mailing list=0A> Htm...@li...= =0A> https://lists.sourceforge.net/lists/listinfo/htmlparser-user=0A> =0A= =0A=0A-- =0AMadhur Kumar Tanwani <http://madhurtanwani.googlepages.com>=0AG= ebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1>=0A=0A*******************= *********************************************=0A* Imagine if every Thursday= your shoes exploded if you tied them the *usual way*.=0AThis happens to us= all the time with *computers*, and nobody thinks of complaining. Johnson= =0A****************************************************************=0A=0A= =0A------------------------------------------------------------------------= -=0AThis SF.net email is sponsored by: Microsoft=0ADefy all challenges. Mic= rosoft(R) Visual Studio 2005.=0Ahttp://clk.atdmt.com/MRT/go/vse0120000070mr= t/direct/01/=0A_______________________________________________=0AHtmlparser= -user mailing list=0AH...@li...=0Ahttps://lists.= sourceforge.net/lists/listinfo/htmlparser-user=0A=0A=0A=0A=0A |
From: Madhur K. T. <mad...@gm...> - 2007-09-27 12:06:34
|
:) So what does it take for an outsider like me - i mean off the developer list - to work on that? Is that possible? Or allowed? Derrick Oswald wrote: > > This has been on the Request For Enhancement list for a couple of > years now. > RFE #888158 block & inline tag differentiation > <http://sourceforge.net/tracker/index.php?func=detail&aid=888158&group_id=24399&atid=381402> > RFE #1395202 support for well known tags like b,i,u,iframe > <http://sourceforge.net/tracker/index.php?func=detail&aid=1395202&group_id=24399&atid=381402> > > Adding the equivalent of composite tags isn't too difficult, but each > tag has semantics that are more subtle. > Like for example the MAP and AREA example would need accessors like > MapTag.getName() and AreaTag.getShape() and AreaTag.getCoords(). > > ... it's just a little code > > ----- Original Message ---- > From: Madhur Kumar Tanwani <mad...@gm...> > To: htmlparser user list <htm...@li...> > Sent: Thursday, September 27, 2007 12:22:16 AM > Subject: Re: [Htmlparser-user] Parsing Partial HTML text > > Hi Derrick, > I understand that the parser wraps unknown/unregistered tags in generic > tag classes. But do you think, with so many normal tags not registered > with HmlParser, it would be time to get them incorporated in the set of > tags the node factory registers? I mean, the HTML tags - MAP, AREA, > TBODY, THEAD etc... are very common in a HTML content and I think it > would be good to have them registered, so that people can work on it. > > Would like to hear your comments on the same. > Thanks, > > Derrick Oswald wrote: > > > > The tbody tag you are getting is a generic tag - because the parser > > doesn't know about tbody. > > Hence it has no children because it is not a composite node. > > You can make your own tbody composite node as described here: > > http://htmlparser.sourceforge.net/faq.html#composite > > > > > > ----- Original Message ---- > > From: "mic...@no..." <mic...@no...> > > To: htm...@li... > > Sent: Wednesday, September 26, 2007 10:30:45 AM > > Subject: [Htmlparser-user] Parsing Partial HTML text > > > > > > I am having trouble parsing html tagged text. It seems that I can > > retrieve a node but that element does not have the child nodes as > > expected. > > > > String table = > > "<tbody>\n" + > > "<tr>\n" + > > "<td><span>brain_normal_GSM80627</span></td>\n" + > > "<td><span>normal</span></td>\n" + > > "<td><span>cerebral cortex</span></td>\n" + > > "<td><span>brain</span></td>\n" + > > "</tr>\n" + > > "</tbody>\n"; > > > > Parser parser = new Parser(new Lexer(table)); > > try { > > Node tBodyNode = parser.extractAllNodesThatMatch(new > > TagNameFilter("tbody")).elementAt(0); > > System.out.println(tBodyNode.getChildren()); // Prints > > null <--------------- > > } catch (ParserException e) { > > e.printStackTrace(); //To change body of catch statement > > use File | Settings | File Templates. > > } > > > > Does HTML Parser not handle text input or partial html files well? > > > > _________________________ > > > > CONFIDENTIALITY NOTICE > > > > The information contained in this e-mail message is intended only for > > the exclusive use of the individual or entity named above and may > > contain information that is privileged, confidential or exempt from > > disclosure under applicable law. If the reader of this message is not > > the intended recipient, or the employee or agent responsible for > > delivery of the message to the intended recipient, you are hereby > > notified that any dissemination, distribution or copying of this > > communication is strictly prohibited. If you have received this > > communication in error, please notify the sender immediately by e-mail > > and delete the material from any computer. Thank you. > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > -- > Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> > Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> > > **************************************************************** > * Imagine if every Thursday your shoes exploded if you tied them the > *usual way*. > This happens to us all the time with *computers*, and nobody thinks of > complaining. Johnson > **************************************************************** > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> **************************************************************** * Imagine if every Thursday your shoes exploded if you tied them the *usual way*. This happens to us all the time with *computers*, and nobody thinks of complaining. Johnson **************************************************************** |
From: Derrick O. <der...@ro...> - 2007-09-27 12:27:24
|
Yes it's possible and allowed... it's open source.=0ATwo obvious choices...= =0AYou can join the developer community and have access to subversion,=0Aor= you can submit a patch (an overlay of files and changes) to be incorporate= d.=0AYour choice.=0A=0A----- Original Message ----=0AFrom: Madhur Kumar Tan= wani <mad...@gm...>=0ATo: htmlparser user list <htmlparser-user@= lists.sourceforge.net>=0ASent: Thursday, September 27, 2007 8:08:37 AM=0ASu= bject: Re: [Htmlparser-user] Parsing Partial HTML text=0A=0A:)=0ASo what do= es it take for an outsider like me - i mean off the developer =0Alist - to = work on that? Is that possible? Or allowed?=0A=0ADerrick Oswald wrote:=0A>= =0A> This has been on the Request For Enhancement list for a couple of =0A>= years now.=0A> RFE #888158 block & inline tag differentiation =0A> <http:/= /sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D888158&group_id=3D24= 399&atid=3D381402>=0A> RFE #1395202 support for well known tags like b,i,u,= iframe =0A> <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1= 395202&group_id=3D24399&atid=3D381402>=0A>=0A> Adding the equivalent of com= posite tags isn't too difficult, but each =0A> tag has semantics that are m= ore subtle.=0A> Like for example the MAP and AREA example would need access= ors like =0A> MapTag.getName() and AreaTag.getShape() and AreaTag.getCoords= ().=0A>=0A> ... it's just a little code=0A>=0A> ----- Original Message ----= =0A> From: Madhur Kumar Tanwani <mad...@gm...>=0A> To: htmlparse= r user list <htm...@li...>=0A> Sent: Thursday, Sep= tember 27, 2007 12:22:16 AM=0A> Subject: Re: [Htmlparser-user] Parsing Part= ial HTML text=0A>=0A> Hi Derrick,=0A> I understand that the parser wraps un= known/unregistered tags in generic=0A> tag classes. But do you think, with = so many normal tags not registered=0A> with HmlParser, it would be time to = get them incorporated in the set of=0A> tags the node factory registers? I = mean, the HTML tags - MAP, AREA,=0A> TBODY, THEAD etc... are very common in= a HTML content and I think it=0A> would be good to have them registered, s= o that people can work on it.=0A>=0A> Would like to hear your comments on t= he same.=0A> Thanks,=0A>=0A> Derrick Oswald wrote:=0A> >=0A> > The tbody ta= g you are getting is a generic tag - because the parser=0A> > doesn't know = about tbody.=0A> > Hence it has no children because it is not a composite n= ode.=0A> > You can make your own tbody composite node as described here:=0A= > > http://htmlparser.sourceforge.net/faq.html#composite=0A> >=0A> >=0A> > = ----- Original Message ----=0A> > From: "mic...@no..." <mich= ael...@no...>=0A> > To: htm...@li...=0A= > > Sent: Wednesday, September 26, 2007 10:30:45 AM=0A> > Subject: [Htmlpar= ser-user] Parsing Partial HTML text=0A> >=0A> >=0A> > I am having trouble p= arsing html tagged text. It seems that I can=0A> > retrieve a node but that= element does not have the child nodes as=0A> > expected.=0A> >=0A> > = String table =3D=0A> > "<tbody>\n" +=0A> > = "<tr>\n" +=0A> > "<td><span>brain_normal_GSM80627</span><= /td>\n" +=0A> > "<td><span>normal</span></td>\n" +=0A> > = "<td><span>cerebral cortex</span></td>\n" +=0A> > = "<td><span>brain</span></td>\n" +=0A> > "</tr>\n" +=0A= > > "</tbody>\n";=0A> >=0A> > Parser parser =3D new= Parser(new Lexer(table));=0A> > try {=0A> > Node tBody= Node =3D parser.extractAllNodesThatMatch(new=0A> > TagNameFilter("tbody")).= elementAt(0);=0A> > System.out.println(tBodyNode.getChildren())= ; // Prints=0A> > null <---------------=0A> > } catch (ParserExcep= tion e) {=0A> > e.printStackTrace(); //To change body of catch= statement=0A> > use File | Settings | File Templates.=0A> > }=0A> = >=0A> > Does HTML Parser not handle text input or partial html files well?= =0A> >=0A> > _________________________=0A> >=0A> > CONFIDENTIALITY NOTICE= =0A> >=0A> > The information contained in this e-mail message is intended o= nly for=0A> > the exclusive use of the individual or entity named above and= may=0A> > contain information that is privileged, confidential or exempt f= rom=0A> > disclosure under applicable law. If the reader of this message is= not=0A> > the intended recipient, or the employee or agent responsible for= =0A> > delivery of the message to the intended recipient, you are hereby=0A= > > notified that any dissemination, distribution or copying of this=0A> > = communication is strictly prohibited. If you have received this=0A> > commu= nication in error, please notify the sender immediately by e-mail=0A> > and= delete the material from any computer. Thank you.=0A> >=0A> > -----------= -------------------------------------------------------------=0A> >=0A> > = =0A> ----------------------------------------------------------------------= ---=0A> > This SF.net email is sponsored by: Microsoft=0A> > Defy all chall= enges. Microsoft(R) Visual Studio 2005.=0A> > http://clk.atdmt.com/MRT/go/v= se0120000070mrt/direct/01/=0A> > ------------------------------------------= ------------------------------=0A> >=0A> > ________________________________= _______________=0A> > Htmlparser-user mailing list=0A> > Htmlparser-user@li= sts.sourceforge.net=0A> > https://lists.sourceforge.net/lists/listinfo/html= parser-user=0A> > =0A>=0A>=0A> -- =0A> Madhur Kumar Tanwani <http://madhur= tanwani.googlepages.com>=0A> Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7= E6/1>=0A>=0A> *************************************************************= ***=0A> * Imagine if every Thursday your shoes exploded if you tied them th= e =0A> *usual way*.=0A> This happens to us all the time with *computers*, a= nd nobody thinks of =0A> complaining. Johnson=0A> *************************= ***************************************=0A>=0A>=0A> -----------------------= --------------------------------------------------=0A> This SF.net email is= sponsored by: Microsoft=0A> Defy all challenges. Microsoft(R) Visual Studi= o 2005.=0A> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/=0A> ___= ____________________________________________=0A> Htmlparser-user mailing li= st=0A> Htm...@li...=0A> https://lists.sourceforge.= net/lists/listinfo/htmlparser-user=0A>=0A> --------------------------------= ----------------------------------------=0A>=0A> --------------------------= -----------------------------------------------=0A> This SF.net email is sp= onsored by: Microsoft=0A> Defy all challenges. Microsoft(R) Visual Studio 2= 005.=0A> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/=0A> ------= ------------------------------------------------------------------=0A>=0A> = _______________________________________________=0A> Htmlparser-user mailing= list=0A> Htm...@li...=0A> https://lists.sourcefor= ge.net/lists/listinfo/htmlparser-user=0A> =0A=0A=0A-- =0AMadhur Kumar Tan= wani <http://madhurtanwani.googlepages.com>=0AGebo <http://feeds.feedburner= .com/%7Er/Gebo/%7E6/1>=0A=0A***********************************************= *****************=0A* Imagine if every Thursday your shoes exploded if you = tied them the *usual way*.=0AThis happens to us all the time with *computer= s*, and nobody thinks of complaining. Johnson=0A***************************= *************************************=0A=0A=0A-----------------------------= --------------------------------------------=0AThis SF.net email is sponsor= ed by: Microsoft=0ADefy all challenges. Microsoft(R) Visual Studio 2005.=0A= http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/=0A_________________= ______________________________=0AHtmlparser-user mailing list=0AHtmlparser-= us...@li...=0Ahttps://lists.sourceforge.net/lists/listinfo/h= tmlparser-user=0A=0A=0A=0A=0A |
From: Madhur K. T. <mad...@gm...> - 2007-09-27 12:57:39
|
That is cool - I've been using HtmlParser for about 2 years now. I've done pretty basic things, though. The deepest I've been in Htmlparser's code is to find out how CompostieTagScanner worked - that was the time when I wanted to remove all the virtual end tags that the parser added to the content. I think the developer community option seems to be more exciting. I'll check on the site for more details on joining the dev side and will revert to you. So what - does the project have a set of induction docs or what? Thanks, Derrick Oswald wrote: > Yes it's possible and allowed... it's open source. > Two obvious choices... > You can join the developer community and have access to subversion, > or you can submit a patch (an overlay of files and changes) to be > incorporated. > Your choice. > > ----- Original Message ---- > From: Madhur Kumar Tanwani <mad...@gm...> > To: htmlparser user list <htm...@li...> > Sent: Thursday, September 27, 2007 8:08:37 AM > Subject: Re: [Htmlparser-user] Parsing Partial HTML text > > :) > So what does it take for an outsider like me - i mean off the developer > list - to work on that? Is that possible? Or allowed? > > Derrick Oswald wrote: > > > > This has been on the Request For Enhancement list for a couple of > > years now. > > RFE #888158 block & inline tag differentiation > > > <http://sourceforge.net/tracker/index.php?func=detail&aid=888158&group_id=24399&atid=381402 > <http://sourceforge.net/tracker/index.php?func=detail&aid=888158&group_id=24399&atid=381402>> > > RFE #1395202 support for well known tags like b,i,u,iframe > > > <http://sourceforge.net/tracker/index.php?func=detail&aid=1395202&group_id=24399&atid=381402 > <http://sourceforge.net/tracker/index.php?func=detail&aid=1395202&group_id=24399&atid=381402>> > > > > Adding the equivalent of composite tags isn't too difficult, but each > > tag has semantics that are more subtle. > > Like for example the MAP and AREA example would need accessors like > > MapTag.getName() and AreaTag.getShape() and AreaTag.getCoords(). > > > > ... it's just a little code > > > > ----- Original Message ---- > > From: Madhur Kumar Tanwani <mad...@gm...> > > To: htmlparser user list <htm...@li...> > > Sent: Thursday, September 27, 2007 12:22:16 AM > > Subject: Re: [Htmlparser-user] Parsing Partial HTML text > > > > Hi Derrick, > > I understand that the parser wraps unknown/unregistered tags in generic > > tag classes. But do you think, with so many normal tags not registered > > with HmlParser, it would be time to get them incorporated in the set of > > tags the node factory registers? I mean, the HTML tags - MAP, AREA, > > TBODY, THEAD etc... are very common in a HTML content and I think it > > would be good to have them registered, so that people can work on it. > > > > Would like to hear your comments on the same. > > Thanks, > > > > Derrick Oswald wrote: > > > > > > The tbody tag you are getting is a generic tag - because the parser > > > doesn't know about tbody. > > > Hence it has no children because it is not a composite node. > > > You can make your own tbody composite node as described here: > > > http://htmlparser.sourceforge.net/faq.html#composite > > > > > > > > > ----- Original Message ---- > > > From: "mic...@no..." <mic...@no...> > > > To: htm...@li... > > > Sent: Wednesday, September 26, 2007 10:30:45 AM > > > Subject: [Htmlparser-user] Parsing Partial HTML text > > > > > > > > > I am having trouble parsing html tagged text. It seems that I can > > > retrieve a node but that element does not have the child nodes as > > > expected. > > > > > > String table = > > > "<tbody>\n" + > > > "<tr>\n" + > > > "<td><span>brain_normal_GSM80627</span></td>\n" + > > > "<td><span>normal</span></td>\n" + > > > "<td><span>cerebral cortex</span></td>\n" + > > > "<td><span>brain</span></td>\n" + > > > "</tr>\n" + > > > "</tbody>\n"; > > > > > > Parser parser = new Parser(new Lexer(table)); > > > try { > > > Node tBodyNode = parser.extractAllNodesThatMatch(new > > > TagNameFilter("tbody")).elementAt(0); > > > System.out.println(tBodyNode.getChildren()); // Prints > > > null <--------------- > > > } catch (ParserException e) { > > > e.printStackTrace(); //To change body of catch statement > > > use File | Settings | File Templates. > > > } > > > > > > Does HTML Parser not handle text input or partial html files well? > > > > > > _________________________ > > > > > > CONFIDENTIALITY NOTICE > > > > > > The information contained in this e-mail message is intended only for > > > the exclusive use of the individual or entity named above and may > > > contain information that is privileged, confidential or exempt from > > > disclosure under applicable law. If the reader of this message is not > > > the intended recipient, or the employee or agent responsible for > > > delivery of the message to the intended recipient, you are hereby > > > notified that any dissemination, distribution or copying of this > > > communication is strictly prohibited. If you have received this > > > communication in error, please notify the sender immediately by e-mail > > > and delete the material from any computer. Thank you. > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > ------------------------------------------------------------------------- > > > This SF.net email is sponsored by: Microsoft > > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > > > ------------------------------------------------------------------------ > > > > > > _______________________________________________ > > > Htmlparser-user mailing list > > > Htm...@li... > > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > -- > > Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> > > Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> > > > > **************************************************************** > > * Imagine if every Thursday your shoes exploded if you tied them the > > *usual way*. > > This happens to us all the time with *computers*, and nobody thinks of > > complaining. Johnson > > **************************************************************** > > > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > ------------------------------------------------------------------------ > > > > > ------------------------------------------------------------------------- > > This SF.net email is sponsored by: Microsoft > > Defy all challenges. Microsoft(R) Visual Studio 2005. > > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > -- > Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> > Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> > > **************************************************************** > * Imagine if every Thursday your shoes exploded if you tied them the > *usual way*. > This happens to us all the time with *computers*, and nobody thinks of > complaining. Johnson > **************************************************************** > > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> **************************************************************** * Imagine if every Thursday your shoes exploded if you tied them the *usual way*. This happens to us all the time with *computers*, and nobody thinks of complaining. Johnson **************************************************************** |
From: Madhur K. T. <mad...@gm...> - 2007-09-27 04:20:12
|
Hi Derrick, I understand that the parser wraps unknown/unregistered tags in generic tag classes. But do you think, with so many normal tags not registered with HmlParser, it would be time to get them incorporated in the set of tags the node factory registers? I mean, the HTML tags - MAP, AREA, TBODY, THEAD etc... are very common in a HTML content and I think it would be good to have them registered, so that people can work on it. Would like to hear your comments on the same. Thanks, Derrick Oswald wrote: > > The tbody tag you are getting is a generic tag - because the parser > doesn't know about tbody. > Hence it has no children because it is not a composite node. > You can make your own tbody composite node as described here: > http://htmlparser.sourceforge.net/faq.html#composite > > > ----- Original Message ---- > From: "mic...@no..." <mic...@no...> > To: htm...@li... > Sent: Wednesday, September 26, 2007 10:30:45 AM > Subject: [Htmlparser-user] Parsing Partial HTML text > > > I am having trouble parsing html tagged text. It seems that I can > retrieve a node but that element does not have the child nodes as > expected. > > String table = > "<tbody>\n" + > "<tr>\n" + > "<td><span>brain_normal_GSM80627</span></td>\n" + > "<td><span>normal</span></td>\n" + > "<td><span>cerebral cortex</span></td>\n" + > "<td><span>brain</span></td>\n" + > "</tr>\n" + > "</tbody>\n"; > > Parser parser = new Parser(new Lexer(table)); > try { > Node tBodyNode = parser.extractAllNodesThatMatch(new > TagNameFilter("tbody")).elementAt(0); > System.out.println(tBodyNode.getChildren()); // Prints > null <--------------- > } catch (ParserException e) { > e.printStackTrace(); //To change body of catch statement > use File | Settings | File Templates. > } > > Does HTML Parser not handle text input or partial html files well? > > _________________________ > > CONFIDENTIALITY NOTICE > > The information contained in this e-mail message is intended only for > the exclusive use of the individual or entity named above and may > contain information that is privileged, confidential or exempt from > disclosure under applicable law. If the reader of this message is not > the intended recipient, or the employee or agent responsible for > delivery of the message to the intended recipient, you are hereby > notified that any dissemination, distribution or copying of this > communication is strictly prohibited. If you have received this > communication in error, please notify the sender immediately by e-mail > and delete the material from any computer. Thank you. > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > ------------------------------------------------------------------------ > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > -- Madhur Kumar Tanwani <http://madhurtanwani.googlepages.com> Gebo <http://feeds.feedburner.com/%7Er/Gebo/%7E6/1> **************************************************************** * Imagine if every Thursday your shoes exploded if you tied them the *usual way*. This happens to us all the time with *computers*, and nobody thinks of complaining. Johnson **************************************************************** |