Re: [Htmlparser-user] Failure parsing html with StringBean
Brought to you by:
derrickoswald
From: Mark S. <htm...@ey...> - 2006-06-07 10:41:30
|
Thank you, this works fine :) it is not a htmlparser question, but do you know how to run multiple TestClasses with JUnit4TestAdapter return new JUnit4TestAdapter(SegmentFindingVisitorTest.class); Derrick Oswald schrieb: > If you don't care how many carriage returns are present in the output, > just output one after processing each tag in visitTag() and visitEndTag(). > > Mark Stark wrote: > >> Thanks Derrick, >> >> i have to add, that i've removed the breaksFlow() statement. i add a >> carriageReturn after all segments (text between some bracktes). i later >> save it in a file (key - value) >> >> my intention is, to extract all strings from a given html, write them >> into a file, and replace these strings with some other values. (translation) >> >> the problem is, if "Organisationseinheiten $[weblogEnabled$" is >> recognized as one connected segment, it is not possible to replace it in >> a second run with the translation. is it understandable? :) >> >> p.s.: it is important that that parser can pass the templates with this >> $$ subs. >> >> thanks a lot >> >> >> >> Derrick Oswald schrieb: >> >> >>> Mark, >>> >>> A newline is only inserted in the output if the tag breaks the normal >>> flow of text. >>> The list of tags that do this is from the HTML specification and is >>> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. >>> >>> The StringBean processing is driven by the tags that are encountered. If >>> it doesn't see a tag that causes a break, none is emitted. >>> >>> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an >>> argument could be made that it shouldn't print at all, but if your >>> browser prints something and it inserts a newline, an argument could >>> also be made to change the operation of the StringBean to assume that a >>> break is pending *after* tags that break the flow, and output newlines >>> accordingly. I fear this would cause more problems than it solves though. >>> >>> Presumably this 'dollar text' will be substituted by some server side >>> processing into a real <TD>xxxx</TD> section, perhaps the parser should >>> be applied after this processing. >>> >>> Derrick >>> >>> Mark Stark wrote: >>> >>> >>> >>>> I made a system.out before collapsing the string and got following hint >>>> >>>> Txt (3664[96,78],3672[96,86]): Personen >>>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >>>> Txt (3688[97,7],3697[98,7]): \n \t >>>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >>>> Txt (3797[100,79],3805[100,87]): Projekte >>>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >>>> Txt (3821[101,7],3830[102,7]): \n \t >>>> Txt (3846[102,23],3850[103,2]): \n\t\t >>>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten >>>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >>>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >>>> >>>> The output from these lines after collaps() is >>>> Personen >>>> Projekte >>>> Organisationseinheiten $[weblogEnabled$ >>>> >>>> The "failure" (i dont know if its a failure at all) should be into >>>> collapse() >>>> >>>> >>>> Mark Stark schrieb: >>>> >>>> >>>> >>>> >>>>> hi, >>>>> >>>>> i'am using StringBean to extract strings from a given html source. This >>>>> code caues htmlparser to only recognize one connected string >>>>> >>>>> <td class="yes"> >>>>> <strong>Organisationseinheiten</strong> >>>>> </td> >>>>> $[weblogEnabled$ >>>>> <td class="no"> >>>>> >>>>> returned: Organisationseinheiten $[weblogEnabled$ >>>>> >>>>> But it should be >>>>> >>>>> Organisationseinheiten >>>>> >>>>> $[weblogEnabled$ >>>>> >>>>> Can someone give me a hint which part of StringBean causes this? >>>>> >>>>> thanks a lot >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Htmlparser-user mailing list >>>>> Htm...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Htmlparser-user mailing list >>>> Htm...@li... >>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>> >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >> >> >> >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |