Re: [Htmlparser-user] Failure parsing html with StringBean
Brought to you by:
derrickoswald
From: Derrick O. <Der...@Ro...> - 2006-06-06 23:26:31
|
If you don't care how many carriage returns are present in the output, just output one after processing each tag in visitTag() and visitEndTag(). Mark Stark wrote: >Thanks Derrick, > >i have to add, that i've removed the breaksFlow() statement. i add a >carriageReturn after all segments (text between some bracktes). i later >save it in a file (key - value) > >my intention is, to extract all strings from a given html, write them >into a file, and replace these strings with some other values. (translation) > >the problem is, if "Organisationseinheiten $[weblogEnabled$" is >recognized as one connected segment, it is not possible to replace it in >a second run with the translation. is it understandable? :) > >p.s.: it is important that that parser can pass the templates with this >$$ subs. > >thanks a lot > > > >Derrick Oswald schrieb: > > >>Mark, >> >>A newline is only inserted in the output if the tag breaks the normal >>flow of text. >>The list of tags that do this is from the HTML specification and is >>encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. >> >>The StringBean processing is driven by the tags that are encountered. If >>it doesn't see a tag that causes a break, none is emitted. >> >>Since the text $[weblogEnabled$ is outside of any TD tag in a table, an >>argument could be made that it shouldn't print at all, but if your >>browser prints something and it inserts a newline, an argument could >>also be made to change the operation of the StringBean to assume that a >>break is pending *after* tags that break the flow, and output newlines >>accordingly. I fear this would cause more problems than it solves though. >> >>Presumably this 'dollar text' will be substituted by some server side >>processing into a real <TD>xxxx</TD> section, perhaps the parser should >>be applied after this processing. >> >>Derrick >> >>Mark Stark wrote: >> >> >> >>>I made a system.out before collapsing the string and got following hint >>> >>>Txt (3664[96,78],3672[96,86]): Personen >>>Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >>>Txt (3688[97,7],3697[98,7]): \n \t >>>Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >>>Txt (3797[100,79],3805[100,87]): Projekte >>>Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >>>Txt (3821[101,7],3830[102,7]): \n \t >>>Txt (3846[102,23],3850[103,2]): \n\t\t >>>Txt (3858[103,10],3880[103,32]): Organisationseinheiten >>>Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >>>Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >>> >>>The output from these lines after collaps() is >>>Personen >>>Projekte >>>Organisationseinheiten $[weblogEnabled$ >>> >>>The "failure" (i dont know if its a failure at all) should be into >>>collapse() >>> >>> >>>Mark Stark schrieb: >>> >>> >>> >>> >>>>hi, >>>> >>>>i'am using StringBean to extract strings from a given html source. This >>>>code caues htmlparser to only recognize one connected string >>>> >>>><td class="yes"> >>>> <strong>Organisationseinheiten</strong> >>>></td> >>>> $[weblogEnabled$ >>>><td class="no"> >>>> >>>>returned: Organisationseinheiten $[weblogEnabled$ >>>> >>>>But it should be >>>> >>>>Organisationseinheiten >>>> >>>>$[weblogEnabled$ >>>> >>>>Can someone give me a hint which part of StringBean causes this? >>>> >>>>thanks a lot >>>> >>>> >>>> >>>>_______________________________________________ >>>>Htmlparser-user mailing list >>>>Htm...@li... >>>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>> >>>> >>>> >>>> >>>> >>>> >>> >>>_______________________________________________ >>>Htmlparser-user mailing list >>>Htm...@li... >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >>> >> >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> >> > > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |