Thread: [Htmlparser-user] Failure parsing html with StringBean

Brought to you by: derrickoswald

htmlparser-user

[Htmlparser-user] Failure parsing html with StringBean

From: Mark S. <htm...@ey...> - 2006-06-06 10:30:03

hi,

i'am using StringBean to extract strings from a given html source. This
code caues htmlparser to only recognize one connected string

<td class="yes">
	<strong>Organisationseinheiten</strong>				
</td>
	$[weblogEnabled$
<td class="no">

returned: Organisationseinheiten $[weblogEnabled$

But it should be

Organisationseinheiten

$[weblogEnabled$

Can someone give me a hint which part of StringBean causes this?

thanks a lot

Re: [Htmlparser-user] Failure parsing html with StringBean

From: Mark S. <htm...@ey...> - 2006-06-06 10:59:57

I made a system.out before collapsing the string and got following hint

Txt (3664[96,78],3672[96,86]): Personen
Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t
Txt (3688[97,7],3697[98,7]): \n      \t
Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t
Txt (3797[100,79],3805[100,87]): Projekte
Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t
Txt (3821[101,7],3830[102,7]): \n      \t
Txt (3846[102,23],3850[103,2]): \n\t\t
Txt (3858[103,10],3880[103,32]): Organisationseinheiten
Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t
Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n      \t

The output from these lines after collaps() is
Personen
Projekte
Organisationseinheiten $[weblogEnabled$

The "failure" (i dont know if its a failure at all) should be into
collapse()


Mark Stark schrieb:
> hi,
> 
> i'am using StringBean to extract strings from a given html source. This
> code caues htmlparser to only recognize one connected string
> 
> <td class="yes">
> 	<strong>Organisationseinheiten</strong>				
> </td>
> 	$[weblogEnabled$
> <td class="no">
> 
> returned: Organisationseinheiten $[weblogEnabled$
> 
> But it should be
> 
> Organisationseinheiten
> 
> $[weblogEnabled$
> 
> Can someone give me a hint which part of StringBean causes this?
> 
> thanks a lot
> 
> 
> 
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> 
>

Re: [Htmlparser-user] Failure parsing html with StringBean

From: Derrick O. <Der...@Ro...> - 2006-06-06 11:57:12

Mark,

A newline is only inserted in the output if the tag breaks the normal 
flow of text.
The list of tags that do this is from the HTML specification and is 
encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list.

The StringBean processing is driven by the tags that are encountered. If 
it doesn't see a tag that causes a break, none is emitted.

Since the text $[weblogEnabled$ is outside of any TD tag in a table, an 
argument could be made that it shouldn't print at all, but if your 
browser prints something and it inserts a newline, an argument could 
also be made to change the operation of the StringBean to assume that a 
break is pending *after* tags that break the flow, and output newlines 
accordingly. I fear this would cause more problems than it solves though.

Presumably this 'dollar text' will be substituted by some server side 
processing into a real <TD>xxxx</TD> section, perhaps the parser should 
be applied after this processing.

Derrick

Mark Stark wrote:

>I made a system.out before collapsing the string and got following hint
>
>Txt (3664[96,78],3672[96,86]): Personen
>Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t
>Txt (3688[97,7],3697[98,7]): \n      \t
>Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t
>Txt (3797[100,79],3805[100,87]): Projekte
>Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t
>Txt (3821[101,7],3830[102,7]): \n      \t
>Txt (3846[102,23],3850[103,2]): \n\t\t
>Txt (3858[103,10],3880[103,32]): Organisationseinheiten
>Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t
>Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n      \t
>
>The output from these lines after collaps() is
>Personen
>Projekte
>Organisationseinheiten $[weblogEnabled$
>
>The "failure" (i dont know if its a failure at all) should be into
>collapse()
>
>
>Mark Stark schrieb:
>  
>
>>hi,
>>
>>i'am using StringBean to extract strings from a given html source. This
>>code caues htmlparser to only recognize one connected string
>>
>><td class="yes">
>>	<strong>Organisationseinheiten</strong>				
>></td>
>>	$[weblogEnabled$
>><td class="no">
>>
>>returned: Organisationseinheiten $[weblogEnabled$
>>
>>But it should be
>>
>>Organisationseinheiten
>>
>>$[weblogEnabled$
>>
>>Can someone give me a hint which part of StringBean causes this?
>>
>>thanks a lot
>>
>>
>>
>>_______________________________________________
>>Htmlparser-user mailing list
>>Htm...@li...
>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>
>>    
>>
>
>
>
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>  
>

Re: [Htmlparser-user] Failure parsing html with StringBean

From: Mark S. <htm...@ey...> - 2006-06-06 12:25:15

Thanks Derrick,

i have to add, that i've removed the breaksFlow() statement. i add a
carriageReturn after all segments (text between some bracktes). i later
save it in a file (key - value)

my intention is, to extract all strings from a given html, write them
into a file, and replace these strings with some other values. (translation)

the problem is, if "Organisationseinheiten $[weblogEnabled$" is
recognized as one connected segment, it is not possible to replace it in
a second run with the translation. is it understandable? :)

p.s.: it is important that that parser can pass the templates with this
$$ subs.

thanks a lot



Derrick Oswald schrieb:
> Mark,
> 
> A newline is only inserted in the output if the tag breaks the normal 
> flow of text.
> The list of tags that do this is from the HTML specification and is 
> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list.
> 
> The StringBean processing is driven by the tags that are encountered. If 
> it doesn't see a tag that causes a break, none is emitted.
>  
> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an 
> argument could be made that it shouldn't print at all, but if your 
> browser prints something and it inserts a newline, an argument could 
> also be made to change the operation of the StringBean to assume that a 
> break is pending *after* tags that break the flow, and output newlines 
> accordingly. I fear this would cause more problems than it solves though.
> 
> Presumably this 'dollar text' will be substituted by some server side 
> processing into a real <TD>xxxx</TD> section, perhaps the parser should 
> be applied after this processing.
> 
> Derrick
> 
> Mark Stark wrote:
> 
>> I made a system.out before collapsing the string and got following hint
>>
>> Txt (3664[96,78],3672[96,86]): Personen
>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t
>> Txt (3688[97,7],3697[98,7]): \n      \t
>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t
>> Txt (3797[100,79],3805[100,87]): Projekte
>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t
>> Txt (3821[101,7],3830[102,7]): \n      \t
>> Txt (3846[102,23],3850[103,2]): \n\t\t
>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten
>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t
>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n      \t
>>
>> The output from these lines after collaps() is
>> Personen
>> Projekte
>> Organisationseinheiten $[weblogEnabled$
>>
>> The "failure" (i dont know if its a failure at all) should be into
>> collapse()
>>
>>
>> Mark Stark schrieb:
>>  
>>
>>> hi,
>>>
>>> i'am using StringBean to extract strings from a given html source. This
>>> code caues htmlparser to only recognize one connected string
>>>
>>> <td class="yes">
>>> 	<strong>Organisationseinheiten</strong>				
>>> </td>
>>> 	$[weblogEnabled$
>>> <td class="no">
>>>
>>> returned: Organisationseinheiten $[weblogEnabled$
>>>
>>> But it should be
>>>
>>> Organisationseinheiten
>>>
>>> $[weblogEnabled$
>>>
>>> Can someone give me a hint which part of StringBean causes this?
>>>
>>> thanks a lot
>>>
>>>
>>>
>>> _______________________________________________
>>> Htmlparser-user mailing list
>>> Htm...@li...
>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>
>>>
>>>    
>>>
>>
>>
>>
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>  
>>
> 
> 
> 
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> 
>

Re: [Htmlparser-user] Failure parsing html with StringBean

From: Derrick O. <Der...@Ro...> - 2006-06-06 23:26:31

If you don't care how many carriage returns are present in the output, 
just output one after processing each tag in visitTag() and visitEndTag().

Mark Stark wrote:

>Thanks Derrick,
>
>i have to add, that i've removed the breaksFlow() statement. i add a
>carriageReturn after all segments (text between some bracktes). i later
>save it in a file (key - value)
>
>my intention is, to extract all strings from a given html, write them
>into a file, and replace these strings with some other values. (translation)
>
>the problem is, if "Organisationseinheiten $[weblogEnabled$" is
>recognized as one connected segment, it is not possible to replace it in
>a second run with the translation. is it understandable? :)
>
>p.s.: it is important that that parser can pass the templates with this
>$$ subs.
>
>thanks a lot
>
>
>
>Derrick Oswald schrieb:
>  
>
>>Mark,
>>
>>A newline is only inserted in the output if the tag breaks the normal 
>>flow of text.
>>The list of tags that do this is from the HTML specification and is 
>>encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list.
>>
>>The StringBean processing is driven by the tags that are encountered. If 
>>it doesn't see a tag that causes a break, none is emitted.
>> 
>>Since the text $[weblogEnabled$ is outside of any TD tag in a table, an 
>>argument could be made that it shouldn't print at all, but if your 
>>browser prints something and it inserts a newline, an argument could 
>>also be made to change the operation of the StringBean to assume that a 
>>break is pending *after* tags that break the flow, and output newlines 
>>accordingly. I fear this would cause more problems than it solves though.
>>
>>Presumably this 'dollar text' will be substituted by some server side 
>>processing into a real <TD>xxxx</TD> section, perhaps the parser should 
>>be applied after this processing.
>>
>>Derrick
>>
>>Mark Stark wrote:
>>
>>    
>>
>>>I made a system.out before collapsing the string and got following hint
>>>
>>>Txt (3664[96,78],3672[96,86]): Personen
>>>Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t
>>>Txt (3688[97,7],3697[98,7]): \n      \t
>>>Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t
>>>Txt (3797[100,79],3805[100,87]): Projekte
>>>Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t
>>>Txt (3821[101,7],3830[102,7]): \n      \t
>>>Txt (3846[102,23],3850[103,2]): \n\t\t
>>>Txt (3858[103,10],3880[103,32]): Organisationseinheiten
>>>Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t
>>>Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n      \t
>>>
>>>The output from these lines after collaps() is
>>>Personen
>>>Projekte
>>>Organisationseinheiten $[weblogEnabled$
>>>
>>>The "failure" (i dont know if its a failure at all) should be into
>>>collapse()
>>>
>>>
>>>Mark Stark schrieb:
>>> 
>>>
>>>      
>>>
>>>>hi,
>>>>
>>>>i'am using StringBean to extract strings from a given html source. This
>>>>code caues htmlparser to only recognize one connected string
>>>>
>>>><td class="yes">
>>>>	<strong>Organisationseinheiten</strong>				
>>>></td>
>>>>	$[weblogEnabled$
>>>><td class="no">
>>>>
>>>>returned: Organisationseinheiten $[weblogEnabled$
>>>>
>>>>But it should be
>>>>
>>>>Organisationseinheiten
>>>>
>>>>$[weblogEnabled$
>>>>
>>>>Can someone give me a hint which part of StringBean causes this?
>>>>
>>>>thanks a lot
>>>>
>>>>
>>>>
>>>>_______________________________________________
>>>>Htmlparser-user mailing list
>>>>Htm...@li...
>>>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>>
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>
>>>_______________________________________________
>>>Htmlparser-user mailing list
>>>Htm...@li...
>>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>
>>> 
>>>
>>>      
>>>
>>
>>_______________________________________________
>>Htmlparser-user mailing list
>>Htm...@li...
>>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>
>>    
>>
>
>
>
>
>_______________________________________________
>Htmlparser-user mailing list
>Htm...@li...
>https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>
>  
>

Re: [Htmlparser-user] Failure parsing html with StringBean

From: Mark S. <htm...@ey...> - 2006-06-07 10:41:30

Thank you, this works fine :)

it is not a htmlparser question, but do you know how to run multiple
TestClasses with JUnit4TestAdapter

return new JUnit4TestAdapter(SegmentFindingVisitorTest.class);

Derrick Oswald schrieb:
> If you don't care how many carriage returns are present in the output, 
> just output one after processing each tag in visitTag() and visitEndTag().
> 
> Mark Stark wrote:
> 
>> Thanks Derrick,
>>
>> i have to add, that i've removed the breaksFlow() statement. i add a
>> carriageReturn after all segments (text between some bracktes). i later
>> save it in a file (key - value)
>>
>> my intention is, to extract all strings from a given html, write them
>> into a file, and replace these strings with some other values. (translation)
>>
>> the problem is, if "Organisationseinheiten $[weblogEnabled$" is
>> recognized as one connected segment, it is not possible to replace it in
>> a second run with the translation. is it understandable? :)
>>
>> p.s.: it is important that that parser can pass the templates with this
>> $$ subs.
>>
>> thanks a lot
>>
>>
>>
>> Derrick Oswald schrieb:
>>  
>>
>>> Mark,
>>>
>>> A newline is only inserted in the output if the tag breaks the normal 
>>> flow of text.
>>> The list of tags that do this is from the HTML specification and is 
>>> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list.
>>>
>>> The StringBean processing is driven by the tags that are encountered. If 
>>> it doesn't see a tag that causes a break, none is emitted.
>>>
>>> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an 
>>> argument could be made that it shouldn't print at all, but if your 
>>> browser prints something and it inserts a newline, an argument could 
>>> also be made to change the operation of the StringBean to assume that a 
>>> break is pending *after* tags that break the flow, and output newlines 
>>> accordingly. I fear this would cause more problems than it solves though.
>>>
>>> Presumably this 'dollar text' will be substituted by some server side 
>>> processing into a real <TD>xxxx</TD> section, perhaps the parser should 
>>> be applied after this processing.
>>>
>>> Derrick
>>>
>>> Mark Stark wrote:
>>>
>>>    
>>>
>>>> I made a system.out before collapsing the string and got following hint
>>>>
>>>> Txt (3664[96,78],3672[96,86]): Personen
>>>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t
>>>> Txt (3688[97,7],3697[98,7]): \n      \t
>>>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t
>>>> Txt (3797[100,79],3805[100,87]): Projekte
>>>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t
>>>> Txt (3821[101,7],3830[102,7]): \n      \t
>>>> Txt (3846[102,23],3850[103,2]): \n\t\t
>>>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten
>>>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t
>>>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n      \t
>>>>
>>>> The output from these lines after collaps() is
>>>> Personen
>>>> Projekte
>>>> Organisationseinheiten $[weblogEnabled$
>>>>
>>>> The "failure" (i dont know if its a failure at all) should be into
>>>> collapse()
>>>>
>>>>
>>>> Mark Stark schrieb:
>>>>
>>>>
>>>>      
>>>>
>>>>> hi,
>>>>>
>>>>> i'am using StringBean to extract strings from a given html source. This
>>>>> code caues htmlparser to only recognize one connected string
>>>>>
>>>>> <td class="yes">
>>>>> 	<strong>Organisationseinheiten</strong>				
>>>>> </td>
>>>>> 	$[weblogEnabled$
>>>>> <td class="no">
>>>>>
>>>>> returned: Organisationseinheiten $[weblogEnabled$
>>>>>
>>>>> But it should be
>>>>>
>>>>> Organisationseinheiten
>>>>>
>>>>> $[weblogEnabled$
>>>>>
>>>>> Can someone give me a hint which part of StringBean causes this?
>>>>>
>>>>> thanks a lot
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Htmlparser-user mailing list
>>>>> Htm...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>>>
>>>>>
>>>>>   
>>>>>
>>>>>        
>>>>>
>>>> _______________________________________________
>>>> Htmlparser-user mailing list
>>>> Htm...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>>
>>>>
>>>>
>>>>      
>>>>
>>> _______________________________________________
>>> Htmlparser-user mailing list
>>> Htm...@li...
>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>
>>>
>>>    
>>>
>>
>>
>>
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>  
>>
> 
> 
> 
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> 
>

Re: [Htmlparser-user] JUnit 4

From: Derrick O. <Der...@Ro...> - 2006-06-07 11:55:54

There was the concept of a suite in the JUnit 3.8, maybe it's something 
like that.

Mark Stark wrote:

>Thank you, this works fine :)
>
>it is not a htmlparser question, but do you know how to run multiple
>TestClasses with JUnit4TestAdapter
>
>return new JUnit4TestAdapter(SegmentFindingVisitorTest.class);
>
>Derrick Oswald schrieb:
>  
>
>>If you don't care how many carriage returns are present in the output, 
>>just output one after processing each tag in visitTag() and visitEndTag().
>>
>>Mark Stark wrote:
>>    
>>
>  
>

Re: [Htmlparser-user] Failure parsing html with StringBean

From: Mark S. <htm...@ey...> - 2006-06-07 12:53:15

Have you any idea how to pass recursively a list of files in a directory
to the string bean or any given visitor?

Derrick Oswald schrieb:
> If you don't care how many carriage returns are present in the output, 
> just output one after processing each tag in visitTag() and visitEndTag().
> 
> Mark Stark wrote:
> 
>> Thanks Derrick,
>>
>> i have to add, that i've removed the breaksFlow() statement. i add a
>> carriageReturn after all segments (text between some bracktes). i later
>> save it in a file (key - value)
>>
>> my intention is, to extract all strings from a given html, write them
>> into a file, and replace these strings with some other values. (translation)
>>
>> the problem is, if "Organisationseinheiten $[weblogEnabled$" is
>> recognized as one connected segment, it is not possible to replace it in
>> a second run with the translation. is it understandable? :)
>>
>> p.s.: it is important that that parser can pass the templates with this
>> $$ subs.
>>
>> thanks a lot
>>
>>
>>
>> Derrick Oswald schrieb:
>>  
>>
>>> Mark,
>>>
>>> A newline is only inserted in the output if the tag breaks the normal 
>>> flow of text.
>>> The list of tags that do this is from the HTML specification and is 
>>> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list.
>>>
>>> The StringBean processing is driven by the tags that are encountered. If 
>>> it doesn't see a tag that causes a break, none is emitted.
>>>
>>> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an 
>>> argument could be made that it shouldn't print at all, but if your 
>>> browser prints something and it inserts a newline, an argument could 
>>> also be made to change the operation of the StringBean to assume that a 
>>> break is pending *after* tags that break the flow, and output newlines 
>>> accordingly. I fear this would cause more problems than it solves though.
>>>
>>> Presumably this 'dollar text' will be substituted by some server side 
>>> processing into a real <TD>xxxx</TD> section, perhaps the parser should 
>>> be applied after this processing.
>>>
>>> Derrick
>>>
>>> Mark Stark wrote:
>>>
>>>    
>>>
>>>> I made a system.out before collapsing the string and got following hint
>>>>
>>>> Txt (3664[96,78],3672[96,86]): Personen
>>>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t
>>>> Txt (3688[97,7],3697[98,7]): \n      \t
>>>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t
>>>> Txt (3797[100,79],3805[100,87]): Projekte
>>>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t
>>>> Txt (3821[101,7],3830[102,7]): \n      \t
>>>> Txt (3846[102,23],3850[103,2]): \n\t\t
>>>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten
>>>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t
>>>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n      \t
>>>>
>>>> The output from these lines after collaps() is
>>>> Personen
>>>> Projekte
>>>> Organisationseinheiten $[weblogEnabled$
>>>>
>>>> The "failure" (i dont know if its a failure at all) should be into
>>>> collapse()
>>>>
>>>>
>>>> Mark Stark schrieb:
>>>>
>>>>
>>>>      
>>>>
>>>>> hi,
>>>>>
>>>>> i'am using StringBean to extract strings from a given html source. This
>>>>> code caues htmlparser to only recognize one connected string
>>>>>
>>>>> <td class="yes">
>>>>> 	<strong>Organisationseinheiten</strong>				
>>>>> </td>
>>>>> 	$[weblogEnabled$
>>>>> <td class="no">
>>>>>
>>>>> returned: Organisationseinheiten $[weblogEnabled$
>>>>>
>>>>> But it should be
>>>>>
>>>>> Organisationseinheiten
>>>>>
>>>>> $[weblogEnabled$
>>>>>
>>>>> Can someone give me a hint which part of StringBean causes this?
>>>>>
>>>>> thanks a lot
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Htmlparser-user mailing list
>>>>> Htm...@li...
>>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>>>
>>>>>
>>>>>   
>>>>>
>>>>>        
>>>>>
>>>> _______________________________________________
>>>> Htmlparser-user mailing list
>>>> Htm...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>>
>>>>
>>>>
>>>>      
>>>>
>>> _______________________________________________
>>> Htmlparser-user mailing list
>>> Htm...@li...
>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>>
>>>
>>>    
>>>
>>
>>
>>
>> _______________________________________________
>> Htmlparser-user mailing list
>> Htm...@li...
>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>>
>>  
>>
> 
> 
> 
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> 
>

Re: [Htmlparser-user] Failure parsing html with StringBean

From: Ian M. <ian...@gm...> - 2006-06-07 22:13:24

The File class in Java has a method that gets you a list of all File
objects in that directory. The rest should be easy.

Ian

On 6/7/06, Mark Stark <htm...@ey...> wrote:
> Have you any idea how to pass recursively a list of files in a directory
> to the string bean or any given visitor?
>
> Derrick Oswald schrieb:
> > If you don't care how many carriage returns are present in the output,
> > just output one after processing each tag in visitTag() and visitEndTag().
> >
> > Mark Stark wrote:
> >
> >> Thanks Derrick,
> >>
> >> i have to add, that i've removed the breaksFlow() statement. i add a
> >> carriageReturn after all segments (text between some bracktes). i later
> >> save it in a file (key - value)
> >>
> >> my intention is, to extract all strings from a given html, write them
> >> into a file, and replace these strings with some other values. (translation)
> >>
> >> the problem is, if "Organisationseinheiten $[weblogEnabled$" is
> >> recognized as one connected segment, it is not possible to replace it in
> >> a second run with the translation. is it understandable? :)
> >>
> >> p.s.: it is important that that parser can pass the templates with this
> >> $$ subs.
> >>
> >> thanks a lot
> >>
> >>
> >>
> >> Derrick Oswald schrieb:
> >>
> >>
> >>> Mark,
> >>>
> >>> A newline is only inserted in the output if the tag breaks the normal
> >>> flow of text.
> >>> The list of tags that do this is from the HTML specification and is
> >>> encoded in the org.htmlparser.nodes.Tagnode class as the breakTags list.
> >>>
> >>> The StringBean processing is driven by the tags that are encountered. If
> >>> it doesn't see a tag that causes a break, none is emitted.
> >>>
> >>> Since the text $[weblogEnabled$ is outside of any TD tag in a table, an
> >>> argument could be made that it shouldn't print at all, but if your
> >>> browser prints something and it inserts a newline, an argument could
> >>> also be made to change the operation of the StringBean to assume that a
> >>> break is pending *after* tags that break the flow, and output newlines
> >>> accordingly. I fear this would cause more problems than it solves though.
> >>>
> >>> Presumably this 'dollar text' will be substituted by some server side
> >>> processing into a real <TD>xxxx</TD> section, perhaps the parser should
> >>> be applied after this processing.
> >>>
> >>> Derrick
> >>>
> >>> Mark Stark wrote:
> >>>
> >>>
> >>>
> >>>> I made a system.out before collapsing the string and got following hint
> >>>>
> >>>> Txt (3664[96,78],3672[96,86]): Personen
> >>>> Txt (3676[96,90],3683[97,2]): \t\t\t\n\t\t
> >>>> Txt (3688[97,7],3697[98,7]): \n      \t
> >>>> Txt (3712[98,22],3720[100,2]): \n\t\t\n\t\t
> >>>> Txt (3797[100,79],3805[100,87]): Projekte
> >>>> Txt (3809[100,91],3816[101,2]): \t\t\t\n\t\t
> >>>> Txt (3821[101,7],3830[102,7]): \n      \t
> >>>> Txt (3846[102,23],3850[103,2]): \n\t\t
> >>>> Txt (3858[103,10],3880[103,32]): Organisationseinheiten
> >>>> Txt (3889[103,41],3900[105,2]): \n\t\t\t\t\t\n\t\t
> >>>> Txt (3905[105,7],3934[107,7]): \n\t\t$[weblogEnabled$\n      \t
> >>>>
> >>>> The output from these lines after collaps() is
> >>>> Personen
> >>>> Projekte
> >>>> Organisationseinheiten $[weblogEnabled$
> >>>>
> >>>> The "failure" (i dont know if its a failure at all) should be into
> >>>> collapse()
> >>>>
> >>>>
> >>>> Mark Stark schrieb:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> hi,
> >>>>>
> >>>>> i'am using StringBean to extract strings from a given html source. This
> >>>>> code caues htmlparser to only recognize one connected string
> >>>>>
> >>>>> <td class="yes">
> >>>>>   <strong>Organisationseinheiten</strong>
> >>>>> </td>
> >>>>>   $[weblogEnabled$
> >>>>> <td class="no">
> >>>>>
> >>>>> returned: Organisationseinheiten $[weblogEnabled$
> >>>>>
> >>>>> But it should be
> >>>>>
> >>>>> Organisationseinheiten
> >>>>>
> >>>>> $[weblogEnabled$
> >>>>>
> >>>>> Can someone give me a hint which part of StringBean causes this?
> >>>>>
> >>>>> thanks a lot
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Htmlparser-user mailing list
> >>>>> Htm...@li...
> >>>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> _______________________________________________
> >>>> Htmlparser-user mailing list
> >>>> Htm...@li...
> >>>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>> _______________________________________________
> >>> Htmlparser-user mailing list
> >>> Htm...@li...
> >>> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >> _______________________________________________
> >> Htmlparser-user mailing list
> >> Htm...@li...
> >> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >>
> >>
> >>
> >
> >
> >
> > _______________________________________________
> > Htmlparser-user mailing list
> > Htm...@li...
> > https://lists.sourceforge.net/lists/listinfo/htmlparser-user
> >
> >
>
>
>
>
> _______________________________________________
> Htmlparser-user mailing list
> Htm...@li...
> https://lists.sourceforge.net/lists/listinfo/htmlparser-user
>