Re: [Htmlparser-user] Failure parsing html with StringBean
Brought to you by:
derrickoswald
From: Mark S. <htm...@ey...> - 2006-06-06 12:25:15
|
Thanks Derrick, i have to add, that i've removed the breaksFlow() statement. i add a carriageReturn after all segments (text between some bracktes). i later save it in a file (key - value) my intention is, to extract all strings from a given html, write them into a file, and replace these strings with some other values. (translation) the problem is, if "Organisationseinheiten $[weblogEnabled$" is recognized as one connected segment, it is not possible to replace it in a second run with the translation. is it understandable? :) p.s.: it is important that that parser can pass the templates with this $$ subs. thanks a lot Derrick Oswald schrieb: > Mark, > > A newline is only inserted in the output if the tag breaks the normal > flow of text. > The list of tags that do this is from the HTML specification and is > encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. > > The StringBean processing is driven by the tags that are encountered. If > it doesn't see a tag that causes a break, none is emitted. > > Since the text $[weblogEnabled$ is outside of any TD tag in a table, an > argument could be made that it shouldn't print at all, but if your > browser prints something and it inserts a newline, an argument could > also be made to change the operation of the StringBean to assume that a > break is pending *after* tags that break the flow, and output newlines > accordingly. I fear this would cause more problems than it solves though. > > Presumably this 'dollar text' will be substituted by some server side > processing into a real <TD>xxxx</TD> section, perhaps the parser should > be applied after this processing. > > Derrick > > Mark Stark wrote: > >> I made a system.out before collapsing the string and got following hint >> >> Txt (3664[96,78],3672[96,86]): Personen >> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >> Txt (3688[97,7],3697[98,7]): \n \t >> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >> Txt (3797[100,79],3805[100,87]): Projekte >> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >> Txt (3821[101,7],3830[102,7]): \n \t >> Txt (3846[102,23],3850[103,2]): \n\t\t >> Txt (3858[103,10],3880[103,32]): Organisationseinheiten >> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >> >> The output from these lines after collaps() is >> Personen >> Projekte >> Organisationseinheiten $[weblogEnabled$ >> >> The "failure" (i dont know if its a failure at all) should be into >> collapse() >> >> >> Mark Stark schrieb: >> >> >>> hi, >>> >>> i'am using StringBean to extract strings from a given html source. This >>> code caues htmlparser to only recognize one connected string >>> >>> <td class="yes"> >>> <strong>Organisationseinheiten</strong> >>> </td> >>> $[weblogEnabled$ >>> <td class="no"> >>> >>> returned: Organisationseinheiten $[weblogEnabled$ >>> >>> But it should be >>> >>> Organisationseinheiten >>> >>> $[weblogEnabled$ >>> >>> Can someone give me a hint which part of StringBean causes this? >>> >>> thanks a lot >>> >>> >>> >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >> >> >> >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |