Re: [Htmlparser-developer] character encoding (for Derrick)
Brought to you by:
derrickoswald
|
From: Derrick O. <Der...@ro...> - 2003-03-30 21:19:53
|
Somik,
The capability to setEncoding() is already there, so API users who know
the character set can set it before parsing (caveat: there is no test
case for this, so it may not work). This can happen before or after the
connection is opened, but in the latter case will cause an input stream
reset. In the former case the setting will be overwritten by the
incoming HTTP and HTML header values if they are there and differ from
what's set. One possible enhancement would be to not allow the headers
to override the character set if it's been set via the API, which
assumes the user knows what they are doing. It's most likely that users
don't know the character set though, either from the HTTP or HTML
header, so automatic handling by default is best.
I noted that four tests are failing after last weeks integration:
testScriptCodeExtraction
testScriptCodeExtractionWithMultipleQuotes
testStringBeanListener
testThreadSafety
I'm not sure about the others, but the bean listener test is failing
because the handling of tables has changed. The test is to show that
extracted text contains the link URLs when the links property is set to
true. It would now have to dig into the table tags to find the link.
I'm looking at collectInto() but can't see how to collect string and
link tags so the links can be inserted in context into the text (where
they are found). I'm also wrestling with the issue of handling
<pre></pre>, since collectInto() doesn't seem to be able to give that
kind of information. I guess collectInto() is too blunt a tool.
Derrick
Somik Raha wrote:
> Hi Derrick,
> Continuing our earlier discussion, I've had an idea- instead of
> re-establishing an input stream, suppose we assume that the parser can
> be initialized with a character set - and we use that..
> We could have both strategies in there.
>
> Bytway, quite a few steps are failing - I'm guessing that you're
> actively working on those - let me know if there any issues if I make
> an integration release this week (in case you don't finish).
>
> Regards,
> Somik
|