Re: [Htmlparser-developer] character encoding (for Derrick)
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2003-03-30 23:25:06
|
> It's most likely that users=20 > don't know the character set though, either from the HTTP or HTML=20 > header, so automatic handling by default is best. =20 I agree. > I noted that four tests are failing after last weeks integration: > testStringBeanListener This one has been failing for a while.=20 > I'm not sure about the others, but the bean listener test is failing=20 > because the handling of tables has changed. The test is to show that=20 > extracted text contains the link URLs when the links property is set = to=20 > true. It would now have to dig into the table tags to find the link.=20 > I'm looking at collectInto() but can't see how to collect string and=20 > link tags so the links can be inserted in context into the text (where = > they are found). I'm also wrestling with the issue of handling=20 > <pre></pre>, since collectInto() doesn't seem to be able to give that=20 > kind of information. I guess collectInto() is too blunt a tool. If you're trying to collect strings AND links and keep them in context, = your best bet is to write your own visitor. > testThreadSafety Thanks for reporting this - on my end this one's passing. I had left one = last variable in TagParser- and I thought it would affect Thread safety. = So I rigged up that test, but surprisingly it passed every time on my = end. Can you send me the failure message ? I might need to rework = TagParser again. > testScriptCodeExtraction > testScriptCodeExtractionWithMultipleQuotes You can ignore these two - they actually demonstrate a bug which I have = no clue about, and I think there's little we can do about it. From my = earlier integration release mail (last week), Thanks are also due to Huang-Chun Yu for uncovering a serious bug with = the script scanning mechanism. The parser can currently handle script tags = like : <script> <!-- code here --> </script> But when the tags are like: <script> code here </script> the parser is unable to identify the code and treats it like regular = tags. Such pages are quite widespread and ought to be supported. I was curious = if anyone has ideas on solving this - given the existing design - fresh = ideas often lead to a better perspective.=20 Regards, Somik |