Re: [Htmlparser-developer] character encoding (for Derrick)
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2003-03-30 21:19:53
|
Somik, The capability to setEncoding() is already there, so API users who know the character set can set it before parsing (caveat: there is no test case for this, so it may not work). This can happen before or after the connection is opened, but in the latter case will cause an input stream reset. In the former case the setting will be overwritten by the incoming HTTP and HTML header values if they are there and differ from what's set. One possible enhancement would be to not allow the headers to override the character set if it's been set via the API, which assumes the user knows what they are doing. It's most likely that users don't know the character set though, either from the HTTP or HTML header, so automatic handling by default is best. I noted that four tests are failing after last weeks integration: testScriptCodeExtraction testScriptCodeExtractionWithMultipleQuotes testStringBeanListener testThreadSafety I'm not sure about the others, but the bean listener test is failing because the handling of tables has changed. The test is to show that extracted text contains the link URLs when the links property is set to true. It would now have to dig into the table tags to find the link. I'm looking at collectInto() but can't see how to collect string and link tags so the links can be inserted in context into the text (where they are found). I'm also wrestling with the issue of handling <pre></pre>, since collectInto() doesn't seem to be able to give that kind of information. I guess collectInto() is too blunt a tool. Derrick Somik Raha wrote: > Hi Derrick, > Continuing our earlier discussion, I've had an idea- instead of > re-establishing an input stream, suppose we assume that the parser can > be initialized with a character set - and we use that.. > We could have both strategies in there. > > Bytway, quite a few steps are failing - I'm guessing that you're > actively working on those - let me know if there any issues if I make > an integration release this week (in case you don't finish). > > Regards, > Somik |