Re: [Htmlparser-user] Failure parsing html with StringBean
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-06-06 11:57:12
|
Mark, A newline is only inserted in the output if the tag breaks the normal flow of text. The list of tags that do this is from the HTML specification and is encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. The StringBean processing is driven by the tags that are encountered. If it doesn't see a tag that causes a break, none is emitted. Since the text $[weblogEnabled$ is outside of any TD tag in a table, an argument could be made that it shouldn't print at all, but if your browser prints something and it inserts a newline, an argument could also be made to change the operation of the StringBean to assume that a break is pending *after* tags that break the flow, and output newlines accordingly. I fear this would cause more problems than it solves though. Presumably this 'dollar text' will be substituted by some server side processing into a real <TD>xxxx</TD> section, perhaps the parser should be applied after this processing. Derrick Mark Stark wrote: >I made a system.out before collapsing the string and got following hint > >Txt (3664[96,78],3672[96,86]): Personen >Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >Txt (3688[97,7],3697[98,7]): \n \t >Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >Txt (3797[100,79],3805[100,87]): Projekte >Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >Txt (3821[101,7],3830[102,7]): \n \t >Txt (3846[102,23],3850[103,2]): \n\t\t >Txt (3858[103,10],3880[103,32]): Organisationseinheiten >Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t > >The output from these lines after collaps() is >Personen >Projekte >Organisationseinheiten $[weblogEnabled$ > >The "failure" (i dont know if its a failure at all) should be into >collapse() > > >Mark Stark schrieb: > > >>hi, >> >>i'am using StringBean to extract strings from a given html source. This >>code caues htmlparser to only recognize one connected string >> >><td class="yes"> >> <strong>Organisationseinheiten</strong> >></td> >> $[weblogEnabled$ >><td class="no"> >> >>returned: Organisationseinheiten $[weblogEnabled$ >> >>But it should be >> >>Organisationseinheiten >> >>$[weblogEnabled$ >> >>Can someone give me a hint which part of StringBean causes this? >> >>thanks a lot >> >> >> >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> >> > > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |