Re: [Htmlparser-developer] character encoding (for Derrick)
Brought to you by:
derrickoswald
|
From: Somik R. <so...@ya...> - 2003-03-30 23:25:06
|
> It's most likely that users=20
> don't know the character set though, either from the HTTP or HTML=20
> header, so automatic handling by default is best. =20
I agree.
> I noted that four tests are failing after last weeks integration:
> testStringBeanListener
This one has been failing for a while.=20
> I'm not sure about the others, but the bean listener test is failing=20
> because the handling of tables has changed. The test is to show that=20
> extracted text contains the link URLs when the links property is set =
to=20
> true. It would now have to dig into the table tags to find the link.=20
> I'm looking at collectInto() but can't see how to collect string and=20
> link tags so the links can be inserted in context into the text (where =
> they are found). I'm also wrestling with the issue of handling=20
> <pre></pre>, since collectInto() doesn't seem to be able to give that=20
> kind of information. I guess collectInto() is too blunt a tool.
If you're trying to collect strings AND links and keep them in context, =
your best bet is to write your own visitor.
> testThreadSafety
Thanks for reporting this - on my end this one's passing. I had left one =
last variable in TagParser- and I thought it would affect Thread safety. =
So I rigged up that test, but surprisingly it passed every time on my =
end. Can you send me the failure message ? I might need to rework =
TagParser again.
> testScriptCodeExtraction
> testScriptCodeExtractionWithMultipleQuotes
You can ignore these two - they actually demonstrate a bug which I have =
no clue about, and I think there's little we can do about it. From my =
earlier integration release mail (last week),
Thanks are also due to Huang-Chun Yu for uncovering a serious bug with =
the
script scanning mechanism. The parser can currently handle script tags =
like
:
<script>
<!--
code here
-->
</script>
But when the tags are like:
<script>
code here
</script>
the parser is unable to identify the code and treats it like regular =
tags.
Such pages are quite widespread and ought to be supported. I was curious =
if
anyone has ideas on solving this - given the existing design - fresh =
ideas
often lead to a better perspective.=20
Regards,
Somik |