Thread: [Htmlparser-user] Failure parsing html with StringBean
Brought to you by:
derrickoswald
From: Mark S. <htm...@ey...> - 2006-06-06 10:30:03
|
hi, i'am using StringBean to extract strings from a given html source. This code caues htmlparser to only recognize one connected string <td class="yes"> <strong>Organisationseinheiten</strong> </td> $[weblogEnabled$ <td class="no"> returned: Organisationseinheiten $[weblogEnabled$ But it should be Organisationseinheiten $[weblogEnabled$ Can someone give me a hint which part of StringBean causes this? thanks a lot |
From: Mark S. <htm...@ey...> - 2006-06-06 10:59:57
|
I made a system.out before collapsing the string and got following hint Txt (3664[96,78],3672[96,86]): Personen Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t Txt (3688[97,7],3697[98,7]): \n \t Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t Txt (3797[100,79],3805[100,87]): Projekte Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t Txt (3821[101,7],3830[102,7]): \n \t Txt (3846[102,23],3850[103,2]): \n\t\t Txt (3858[103,10],3880[103,32]): Organisationseinheiten Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t The output from these lines after collaps() is Personen Projekte Organisationseinheiten $[weblogEnabled$ The "failure" (i dont know if its a failure at all) should be into collapse() Mark Stark schrieb: > hi, > > i'am using StringBean to extract strings from a given html source. This > code caues htmlparser to only recognize one connected string > > <td class="yes"> > <strong>Organisationseinheiten</strong> > </td> > $[weblogEnabled$ > <td class="no"> > > returned: Organisationseinheiten $[weblogEnabled$ > > But it should be > > Organisationseinheiten > > $[weblogEnabled$ > > Can someone give me a hint which part of StringBean causes this? > > thanks a lot > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <Der...@Ro...> - 2006-06-06 11:57:12
|
Mark, A newline is only inserted in the output if the tag breaks the normal flow of text. The list of tags that do this is from the HTML specification and is encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. The StringBean processing is driven by the tags that are encountered. If it doesn't see a tag that causes a break, none is emitted. Since the text $[weblogEnabled$ is outside of any TD tag in a table, an argument could be made that it shouldn't print at all, but if your browser prints something and it inserts a newline, an argument could also be made to change the operation of the StringBean to assume that a break is pending *after* tags that break the flow, and output newlines accordingly. I fear this would cause more problems than it solves though. Presumably this 'dollar text' will be substituted by some server side processing into a real <TD>xxxx</TD> section, perhaps the parser should be applied after this processing. Derrick Mark Stark wrote: >I made a system.out before collapsing the string and got following hint > >Txt (3664[96,78],3672[96,86]): Personen >Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >Txt (3688[97,7],3697[98,7]): \n \t >Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >Txt (3797[100,79],3805[100,87]): Projekte >Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >Txt (3821[101,7],3830[102,7]): \n \t >Txt (3846[102,23],3850[103,2]): \n\t\t >Txt (3858[103,10],3880[103,32]): Organisationseinheiten >Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t > >The output from these lines after collaps() is >Personen >Projekte >Organisationseinheiten $[weblogEnabled$ > >The "failure" (i dont know if its a failure at all) should be into >collapse() > > >Mark Stark schrieb: > > >>hi, >> >>i'am using StringBean to extract strings from a given html source. This >>code caues htmlparser to only recognize one connected string >> >><td class="yes"> >> <strong>Organisationseinheiten</strong> >></td> >> $[weblogEnabled$ >><td class="no"> >> >>returned: Organisationseinheiten $[weblogEnabled$ >> >>But it should be >> >>Organisationseinheiten >> >>$[weblogEnabled$ >> >>Can someone give me a hint which part of StringBean causes this? >> >>thanks a lot >> >> >> >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> >> > > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Mark S. <htm...@ey...> - 2006-06-06 12:25:15
|
Thanks Derrick, i have to add, that i've removed the breaksFlow() statement. i add a carriageReturn after all segments (text between some bracktes). i later save it in a file (key - value) my intention is, to extract all strings from a given html, write them into a file, and replace these strings with some other values. (translation) the problem is, if "Organisationseinheiten $[weblogEnabled$" is recognized as one connected segment, it is not possible to replace it in a second run with the translation. is it understandable? :) p.s.: it is important that that parser can pass the templates with this $$ subs. thanks a lot Derrick Oswald schrieb: > Mark, > > A newline is only inserted in the output if the tag breaks the normal > flow of text. > The list of tags that do this is from the HTML specification and is > encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. > > The StringBean processing is driven by the tags that are encountered. If > it doesn't see a tag that causes a break, none is emitted. > > Since the text $[weblogEnabled$ is outside of any TD tag in a table, an > argument could be made that it shouldn't print at all, but if your > browser prints something and it inserts a newline, an argument could > also be made to change the operation of the StringBean to assume that a > break is pending *after* tags that break the flow, and output newlines > accordingly. I fear this would cause more problems than it solves though. > > Presumably this 'dollar text' will be substituted by some server side > processing into a real <TD>xxxx</TD> section, perhaps the parser should > be applied after this processing. > > Derrick > > Mark Stark wrote: > >> I made a system.out before collapsing the string and got following hint >> >> Txt (3664[96,78],3672[96,86]): Personen >> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >> Txt (3688[97,7],3697[98,7]): \n \t >> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >> Txt (3797[100,79],3805[100,87]): Projekte >> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >> Txt (3821[101,7],3830[102,7]): \n \t >> Txt (3846[102,23],3850[103,2]): \n\t\t >> Txt (3858[103,10],3880[103,32]): Organisationseinheiten >> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >> >> The output from these lines after collaps() is >> Personen >> Projekte >> Organisationseinheiten $[weblogEnabled$ >> >> The "failure" (i dont know if its a failure at all) should be into >> collapse() >> >> >> Mark Stark schrieb: >> >> >>> hi, >>> >>> i'am using StringBean to extract strings from a given html source. This >>> code caues htmlparser to only recognize one connected string >>> >>> <td class="yes"> >>> <strong>Organisationseinheiten</strong> >>> </td> >>> $[weblogEnabled$ >>> <td class="no"> >>> >>> returned: Organisationseinheiten $[weblogEnabled$ >>> >>> But it should be >>> >>> Organisationseinheiten >>> >>> $[weblogEnabled$ >>> >>> Can someone give me a hint which part of StringBean causes this? >>> >>> thanks a lot >>> >>> >>> >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >> >> >> >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <Der...@Ro...> - 2006-06-06 23:26:31
|
If you don't care how many carriage returns are present in the output, just output one after processing each tag in visitTag() and visitEndTag(). Mark Stark wrote: >Thanks Derrick, > >i have to add, that i've removed the breaksFlow() statement. i add a >carriageReturn after all segments (text between some bracktes). i later >save it in a file (key - value) > >my intention is, to extract all strings from a given html, write them >into a file, and replace these strings with some other values. (translation) > >the problem is, if "Organisationseinheiten $[weblogEnabled$" is >recognized as one connected segment, it is not possible to replace it in >a second run with the translation. is it understandable? :) > >p.s.: it is important that that parser can pass the templates with this >$$ subs. > >thanks a lot > > > >Derrick Oswald schrieb: > > >>Mark, >> >>A newline is only inserted in the output if the tag breaks the normal >>flow of text. >>The list of tags that do this is from the HTML specification and is >>encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. >> >>The StringBean processing is driven by the tags that are encountered. If >>it doesn't see a tag that causes a break, none is emitted. >> >>Since the text $[weblogEnabled$ is outside of any TD tag in a table, an >>argument could be made that it shouldn't print at all, but if your >>browser prints something and it inserts a newline, an argument could >>also be made to change the operation of the StringBean to assume that a >>break is pending *after* tags that break the flow, and output newlines >>accordingly. I fear this would cause more problems than it solves though. >> >>Presumably this 'dollar text' will be substituted by some server side >>processing into a real <TD>xxxx</TD> section, perhaps the parser should >>be applied after this processing. >> >>Derrick >> >>Mark Stark wrote: >> >> >> >>>I made a system.out before collapsing the string and got following hint >>> >>>Txt (3664[96,78],3672[96,86]): Personen >>>Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >>>Txt (3688[97,7],3697[98,7]): \n \t >>>Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >>>Txt (3797[100,79],3805[100,87]): Projekte >>>Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >>>Txt (3821[101,7],3830[102,7]): \n \t >>>Txt (3846[102,23],3850[103,2]): \n\t\t >>>Txt (3858[103,10],3880[103,32]): Organisationseinheiten >>>Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >>>Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >>> >>>The output from these lines after collaps() is >>>Personen >>>Projekte >>>Organisationseinheiten $[weblogEnabled$ >>> >>>The "failure" (i dont know if its a failure at all) should be into >>>collapse() >>> >>> >>>Mark Stark schrieb: >>> >>> >>> >>> >>>>hi, >>>> >>>>i'am using StringBean to extract strings from a given html source. This >>>>code caues htmlparser to only recognize one connected string >>>> >>>><td class="yes"> >>>> <strong>Organisationseinheiten</strong> >>>></td> >>>> $[weblogEnabled$ >>>><td class="no"> >>>> >>>>returned: Organisationseinheiten $[weblogEnabled$ >>>> >>>>But it should be >>>> >>>>Organisationseinheiten >>>> >>>>$[weblogEnabled$ >>>> >>>>Can someone give me a hint which part of StringBean causes this? >>>> >>>>thanks a lot >>>> >>>> >>>> >>>>_______________________________________________ >>>>Htmlparser-user mailing list >>>>Htm...@li... >>>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>> >>>> >>>> >>>> >>>> >>>> >>> >>>_______________________________________________ >>>Htmlparser-user mailing list >>>Htm...@li... >>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >>> >> >>_______________________________________________ >>Htmlparser-user mailing list >>Htm...@li... >>https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> >> > > > > >_______________________________________________ >Htmlparser-user mailing list >Htm...@li... >https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > |
From: Mark S. <htm...@ey...> - 2006-06-07 10:41:30
|
Thank you, this works fine :) it is not a htmlparser question, but do you know how to run multiple TestClasses with JUnit4TestAdapter return new JUnit4TestAdapter(SegmentFindingVisitorTest.class); Derrick Oswald schrieb: > If you don't care how many carriage returns are present in the output, > just output one after processing each tag in visitTag() and visitEndTag(). > > Mark Stark wrote: > >> Thanks Derrick, >> >> i have to add, that i've removed the breaksFlow() statement. i add a >> carriageReturn after all segments (text between some bracktes). i later >> save it in a file (key - value) >> >> my intention is, to extract all strings from a given html, write them >> into a file, and replace these strings with some other values. (translation) >> >> the problem is, if "Organisationseinheiten $[weblogEnabled$" is >> recognized as one connected segment, it is not possible to replace it in >> a second run with the translation. is it understandable? :) >> >> p.s.: it is important that that parser can pass the templates with this >> $$ subs. >> >> thanks a lot >> >> >> >> Derrick Oswald schrieb: >> >> >>> Mark, >>> >>> A newline is only inserted in the output if the tag breaks the normal >>> flow of text. >>> The list of tags that do this is from the HTML specification and is >>> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. >>> >>> The StringBean processing is driven by the tags that are encountered. If >>> it doesn't see a tag that causes a break, none is emitted. >>> >>> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an >>> argument could be made that it shouldn't print at all, but if your >>> browser prints something and it inserts a newline, an argument could >>> also be made to change the operation of the StringBean to assume that a >>> break is pending *after* tags that break the flow, and output newlines >>> accordingly. I fear this would cause more problems than it solves though. >>> >>> Presumably this 'dollar text' will be substituted by some server side >>> processing into a real <TD>xxxx</TD> section, perhaps the parser should >>> be applied after this processing. >>> >>> Derrick >>> >>> Mark Stark wrote: >>> >>> >>> >>>> I made a system.out before collapsing the string and got following hint >>>> >>>> Txt (3664[96,78],3672[96,86]): Personen >>>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >>>> Txt (3688[97,7],3697[98,7]): \n \t >>>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >>>> Txt (3797[100,79],3805[100,87]): Projekte >>>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >>>> Txt (3821[101,7],3830[102,7]): \n \t >>>> Txt (3846[102,23],3850[103,2]): \n\t\t >>>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten >>>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >>>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >>>> >>>> The output from these lines after collaps() is >>>> Personen >>>> Projekte >>>> Organisationseinheiten $[weblogEnabled$ >>>> >>>> The "failure" (i dont know if its a failure at all) should be into >>>> collapse() >>>> >>>> >>>> Mark Stark schrieb: >>>> >>>> >>>> >>>> >>>>> hi, >>>>> >>>>> i'am using StringBean to extract strings from a given html source. This >>>>> code caues htmlparser to only recognize one connected string >>>>> >>>>> <td class="yes"> >>>>> <strong>Organisationseinheiten</strong> >>>>> </td> >>>>> $[weblogEnabled$ >>>>> <td class="no"> >>>>> >>>>> returned: Organisationseinheiten $[weblogEnabled$ >>>>> >>>>> But it should be >>>>> >>>>> Organisationseinheiten >>>>> >>>>> $[weblogEnabled$ >>>>> >>>>> Can someone give me a hint which part of StringBean causes this? >>>>> >>>>> thanks a lot >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Htmlparser-user mailing list >>>>> Htm...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Htmlparser-user mailing list >>>> Htm...@li... >>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>> >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >> >> >> >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Derrick O. <Der...@Ro...> - 2006-06-07 11:55:54
|
There was the concept of a suite in the JUnit 3.8, maybe it's something like that. Mark Stark wrote: >Thank you, this works fine :) > >it is not a htmlparser question, but do you know how to run multiple >TestClasses with JUnit4TestAdapter > >return new JUnit4TestAdapter(SegmentFindingVisitorTest.class); > >Derrick Oswald schrieb: > > >>If you don't care how many carriage returns are present in the output, >>just output one after processing each tag in visitTag() and visitEndTag(). >> >>Mark Stark wrote: >> >> > > |
From: Mark S. <htm...@ey...> - 2006-06-07 12:53:15
|
Have you any idea how to pass recursively a list of files in a directory to the string bean or any given visitor? Derrick Oswald schrieb: > If you don't care how many carriage returns are present in the output, > just output one after processing each tag in visitTag() and visitEndTag(). > > Mark Stark wrote: > >> Thanks Derrick, >> >> i have to add, that i've removed the breaksFlow() statement. i add a >> carriageReturn after all segments (text between some bracktes). i later >> save it in a file (key - value) >> >> my intention is, to extract all strings from a given html, write them >> into a file, and replace these strings with some other values. (translation) >> >> the problem is, if "Organisationseinheiten $[weblogEnabled$" is >> recognized as one connected segment, it is not possible to replace it in >> a second run with the translation. is it understandable? :) >> >> p.s.: it is important that that parser can pass the templates with this >> $$ subs. >> >> thanks a lot >> >> >> >> Derrick Oswald schrieb: >> >> >>> Mark, >>> >>> A newline is only inserted in the output if the tag breaks the normal >>> flow of text. >>> The list of tags that do this is from the HTML specification and is >>> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. >>> >>> The StringBean processing is driven by the tags that are encountered. If >>> it doesn't see a tag that causes a break, none is emitted. >>> >>> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an >>> argument could be made that it shouldn't print at all, but if your >>> browser prints something and it inserts a newline, an argument could >>> also be made to change the operation of the StringBean to assume that a >>> break is pending *after* tags that break the flow, and output newlines >>> accordingly. I fear this would cause more problems than it solves though. >>> >>> Presumably this 'dollar text' will be substituted by some server side >>> processing into a real <TD>xxxx</TD> section, perhaps the parser should >>> be applied after this processing. >>> >>> Derrick >>> >>> Mark Stark wrote: >>> >>> >>> >>>> I made a system.out before collapsing the string and got following hint >>>> >>>> Txt (3664[96,78],3672[96,86]): Personen >>>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t >>>> Txt (3688[97,7],3697[98,7]): \n \t >>>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t >>>> Txt (3797[100,79],3805[100,87]): Projekte >>>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t >>>> Txt (3821[101,7],3830[102,7]): \n \t >>>> Txt (3846[102,23],3850[103,2]): \n\t\t >>>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten >>>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t >>>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t >>>> >>>> The output from these lines after collaps() is >>>> Personen >>>> Projekte >>>> Organisationseinheiten $[weblogEnabled$ >>>> >>>> The "failure" (i dont know if its a failure at all) should be into >>>> collapse() >>>> >>>> >>>> Mark Stark schrieb: >>>> >>>> >>>> >>>> >>>>> hi, >>>>> >>>>> i'am using StringBean to extract strings from a given html source. This >>>>> code caues htmlparser to only recognize one connected string >>>>> >>>>> <td class="yes"> >>>>> <strong>Organisationseinheiten</strong> >>>>> </td> >>>>> $[weblogEnabled$ >>>>> <td class="no"> >>>>> >>>>> returned: Organisationseinheiten $[weblogEnabled$ >>>>> >>>>> But it should be >>>>> >>>>> Organisationseinheiten >>>>> >>>>> $[weblogEnabled$ >>>>> >>>>> Can someone give me a hint which part of StringBean causes this? >>>>> >>>>> thanks a lot >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Htmlparser-user mailing list >>>>> Htm...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> Htmlparser-user mailing list >>>> Htm...@li... >>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>>> >>>> >>>> >>>> >>>> >>> _______________________________________________ >>> Htmlparser-user mailing list >>> Htm...@li... >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >>> >>> >>> >>> >> >> >> >> _______________________________________________ >> Htmlparser-user mailing list >> Htm...@li... >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user >> >> >> > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > |
From: Ian M. <ian...@gm...> - 2006-06-07 22:13:24
|
The File class in Java has a method that gets you a list of all File objects in that directory. The rest should be easy. Ian On 6/7/06, Mark Stark <htm...@ey...> wrote: > Have you any idea how to pass recursively a list of files in a directory > to the string bean or any given visitor? > > Derrick Oswald schrieb: > > If you don't care how many carriage returns are present in the output, > > just output one after processing each tag in visitTag() and visitEndTag(). > > > > Mark Stark wrote: > > > >> Thanks Derrick, > >> > >> i have to add, that i've removed the breaksFlow() statement. i add a > >> carriageReturn after all segments (text between some bracktes). i later > >> save it in a file (key - value) > >> > >> my intention is, to extract all strings from a given html, write them > >> into a file, and replace these strings with some other values. (translation) > >> > >> the problem is, if "Organisationseinheiten $[weblogEnabled$" is > >> recognized as one connected segment, it is not possible to replace it in > >> a second run with the translation. is it understandable? :) > >> > >> p.s.: it is important that that parser can pass the templates with this > >> $$ subs. > >> > >> thanks a lot > >> > >> > >> > >> Derrick Oswald schrieb: > >> > >> > >>> Mark, > >>> > >>> A newline is only inserted in the output if the tag breaks the normal > >>> flow of text. > >>> The list of tags that do this is from the HTML specification and is > >>> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list. > >>> > >>> The StringBean processing is driven by the tags that are encountered. If > >>> it doesn't see a tag that causes a break, none is emitted. > >>> > >>> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an > >>> argument could be made that it shouldn't print at all, but if your > >>> browser prints something and it inserts a newline, an argument could > >>> also be made to change the operation of the StringBean to assume that a > >>> break is pending *after* tags that break the flow, and output newlines > >>> accordingly. I fear this would cause more problems than it solves though. > >>> > >>> Presumably this 'dollar text' will be substituted by some server side > >>> processing into a real <TD>xxxx</TD> section, perhaps the parser should > >>> be applied after this processing. > >>> > >>> Derrick > >>> > >>> Mark Stark wrote: > >>> > >>> > >>> > >>>> I made a system.out before collapsing the string and got following hint > >>>> > >>>> Txt (3664[96,78],3672[96,86]): Personen > >>>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t > >>>> Txt (3688[97,7],3697[98,7]): \n \t > >>>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t > >>>> Txt (3797[100,79],3805[100,87]): Projekte > >>>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t > >>>> Txt (3821[101,7],3830[102,7]): \n \t > >>>> Txt (3846[102,23],3850[103,2]): \n\t\t > >>>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten > >>>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t > >>>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n \t > >>>> > >>>> The output from these lines after collaps() is > >>>> Personen > >>>> Projekte > >>>> Organisationseinheiten $[weblogEnabled$ > >>>> > >>>> The "failure" (i dont know if its a failure at all) should be into > >>>> collapse() > >>>> > >>>> > >>>> Mark Stark schrieb: > >>>> > >>>> > >>>> > >>>> > >>>>> hi, > >>>>> > >>>>> i'am using StringBean to extract strings from a given html source. This > >>>>> code caues htmlparser to only recognize one connected string > >>>>> > >>>>> <td class="yes"> > >>>>> <strong>Organisationseinheiten</strong> > >>>>> </td> > >>>>> $[weblogEnabled$ > >>>>> <td class="no"> > >>>>> > >>>>> returned: Organisationseinheiten $[weblogEnabled$ > >>>>> > >>>>> But it should be > >>>>> > >>>>> Organisationseinheiten > >>>>> > >>>>> $[weblogEnabled$ > >>>>> > >>>>> Can someone give me a hint which part of StringBean causes this? > >>>>> > >>>>> thanks a lot > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> Htmlparser-user mailing list > >>>>> Htm...@li... > >>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>> _______________________________________________ > >>>> Htmlparser-user mailing list > >>>> Htm...@li... > >>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >>>> > >>>> > >>>> > >>>> > >>>> > >>> _______________________________________________ > >>> Htmlparser-user mailing list > >>> Htm...@li... > >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >>> > >>> > >>> > >>> > >> > >> > >> > >> _______________________________________________ > >> Htmlparser-user mailing list > >> Htm...@li... > >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user > >> > >> > >> > > > > > > > > _______________________________________________ > > Htmlparser-user mailing list > > Htm...@li... > > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > > > > > > > > > _______________________________________________ > Htmlparser-user mailing list > Htm...@li... > https://lists.sourceforge.net/lists/listinfo/htmlparser-user > |