Re: [Htmlparser-user] Failure parsing html with StringBean

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thanks Derrick,

i have to add, that i've removed the breaksFlow() statement. i add a
carriageReturn after all segments (text between some bracktes). i later
save it in a file (key - value)

my intention is, to extract all strings from a given html, write them
into a file, and replace these strings with some other values. (translation)

the problem is, if "Organisationseinheiten $[weblogEnabled$" is
recognized as one connected segment, it is not possible to replace it in
a second run with the translation. is it understandable? :)

p.s.: it is important that that parser can pass the templates with this
$$ subs.

thanks a lot

Derrick Oswald schrieb:
> Mark,
> 
> A newline is only inserted in the output if the tag breaks the normal 
> flow of text.
> The list of tags that do this is from the HTML specification and is 
> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list.
> 
> The StringBean processing is driven by the tags that are encountered. If 
> it doesn't see a tag that causes a break, none is emitted.
>  
> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an 
> argument could be made that it shouldn't print at all, but if your 
> browser prints something and it inserts a newline, an argument could 
> also be made to change the operation of the StringBean to assume that a 
> break is pending *after* tags that break the flow, and output newlines 
> accordingly. I fear this would cause more problems than it solves though.
> 
> Presumably this 'dollar text' will be substituted by some server side 
> processing into a real <TD>xxxx</TD> section, perhaps the parser should 
> be applied after this processing.
> 
> Derrick
> 
> Mark Stark wrote:
> 
>> I made a system.out before collapsing the string and got following hint
>>
>> Txt (3664[96,78],3672[96,86]): Personen
>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t
>> Txt (3688[97,7],3697[98,7]): \n      \t
>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t
>> Txt (3797[100,79],3805[100,87]): Projekte
>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t
>> Txt (3821[101,7],3830[102,7]): \n      \t
>> Txt (3846[102,23],3850[103,2]): \n\t\t
>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten
>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t
>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n      \t
>>
>> The output from these lines after collaps() is
>> Personen
>> Projekte
>> Organisationseinheiten $[weblogEnabled$
>>
>> The "failure" (i dont know if its a failure at all) should be into
>> collapse()
>>
>>
>> Mark Stark schrieb:
>>  
>>
>>> hi,
>>>
>>> i'am using StringBean to extract strings from a given html source. This
>>> code caues htmlparser to only recognize one connected string
>>>
>>> <td class="yes">
>>> 	<strong>Organisationseinheiten</strong>				
>>> </td>
>>> 	$[weblogEnabled$
>>> <td class="no">
>>>
>>> returned: Organisationseinheiten $[weblogEnabled$
>>>
>>> But it should be
>>>
>>> Organisationseinheiten
>>>
>>> $[weblogEnabled$
>>>
>>> Can someone give me a hint which part of StringBean causes this?
>>>
>>> thanks a lot
>>>
>>>
>>>
>>> _______________________________________________
>>> Htmlparser-user mailing list
>>> Htm...@li...
>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>
>>>
>>>    
>>>
>>
>>
>>
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>  
>>
> 
> 
> 
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> 
>