Re: [Htmlparser-developer] HTML Comments/Remarks

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Somik,

Somik Raha wrote:

>>Thanks for the help.  I think I would like to see the
>>toPlainTextString() method remain.  Although I'm not quite sure of the
>>difference between HTMLRemarkNode.toString and
>>HTMLRemarkNode.toPlainTextString.
>>    
>>
>
>This is actually based on your suggestion (eons back..) -
>toPlainTextString() is the uniform way of getting string representation of a
>page - meaningful and hopefully semantic data. I think you'd probably want
>to use toPlainTextString() instead of toString() - as toString() always
>gives some output for all the tags, while toPlainTextString() works only for
>specific ones like string nodes, link text and strings inside forms. It was
>also enabled earlier for comments, but was taken out last week. I am
>thinking of putting it back in. What this will mean is that if folks have
>commented tags - you will get that sort of data in your string filter. I
>think you can live with that (?)
>
>Also - I am thinking of a better approach - wherein, should one require pure
>strings within a comment, one could create a new parser, that operates on
>the contents of the string node (it would be an interesting approach to
>try..)
>
I'm not sure that I'm following you. But then its late here ....

It would seem that whatever other considerations there might be one 
would want to have some method on HTMLRemarkNode that allows you to grab 
the pure unadulterated text of the remark without anything else.

The HTMLRemarkNode.toString() method I'm using now seems to be appending 
the string "Comment Tag :" to the front of the string that is returned. 
 Its nice to have convenience methods to pretty print things.  But 
shouldn't the two default methods on any node be to:

1. return the original HTML
2. return the text appearing within it that is not a default part of the tag

Naturally there will be variation depending on the node, but it seems 
odd to have prettified print responses as the default (maybe they're not 
and I'm just getting confused)  - ideally they would be called with a 
parameter or special method like prettyPrint().

I'm not sure what the downside is to having a toPlainTextString() call 
in the HTMLRemarkNode.  Remember I don't have such a wonderful 
understanding of the HTMLParser itself.  For example I'm not sure what 
you mean when you say that the remark text data would appear in your 
string filter.  I'm not sure what a string filter is ...   At the moment 
it seems I have to explicitly check for HTMLRemarkNodes and then process 
them if I want to ....

CHEERS> SAM