Re: [Htmlparser-developer] character encoding (for Derrick)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

> It's most likely that users=20
> don't know the character set though, either from the HTTP or HTML=20
> header, so automatic handling by default is best. =20

I agree.

> I noted that four tests are failing after last weeks integration:

>     testStringBeanListener

This one has been failing for a while.=20

> I'm not sure about the others, but the bean listener test is failing=20
> because the handling of tables has changed.  The test is to show that=20
> extracted text contains the link URLs when the links property is set =
to=20
> true. It would now have to dig into the table tags to find the link.=20
>  I'm looking at collectInto() but can't see how to collect string and=20
> link tags so the links can be inserted in context into the text (where =

> they are found).  I'm also wrestling with the issue of handling=20
> <pre></pre>, since collectInto() doesn't seem to be able to give that=20
> kind of information. I guess collectInto() is too blunt a tool.

If you're trying to collect strings AND links and keep them in context, =
your best bet is to write your own visitor.

>     testThreadSafety

Thanks for reporting this - on my end this one's passing. I had left one =
last variable in TagParser- and I thought it would affect Thread safety. =
So I rigged up that test, but surprisingly it passed every time on my =
end. Can you send me the failure message ? I might need to rework =
TagParser again.

>     testScriptCodeExtraction
>     testScriptCodeExtractionWithMultipleQuotes

You can ignore these two - they actually demonstrate a bug which I have =
no clue about, and I think there's little we can do about it. From my =
earlier integration release mail (last week),

Thanks are also due to Huang-Chun Yu for uncovering a serious bug with =
the
script scanning mechanism. The parser can currently handle script tags =
like
:

<script>
<!--
    code here
-->
</script>

But when the tags are like:
<script>
    code here
</script>

the parser is unable to identify the code and treats it like regular =
tags.
Such pages are quite widespread and ought to be supported. I was curious =
if
anyone has ideas on solving this - given the existing design - fresh =
ideas
often lead to a better perspective.=20

Regards,
Somik