Re: [Htmlparser-user] Testing/feedback, question

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Claude,
>We have (my company) processed
>about 11 million HTML documents successfully (with the >Swing parser),
>some of which we'll see tested again with the
>HTMLParser code in the
>next few weeks.

Great - this will be a great service to this project and its community.
Thank you very much.

>To date, we have only run a few simple tests with the HTMLParser code
>but it appears now that the library is writing to standard err. I would
>expect all errors to result in parser-specific exceptions that the
>calling application would be free to handle as it may see fit.

Hmm.. although I agree with this, I have a question - what do you see being
written to standard err ? My understanding is that, when the parser crashes,
it usually throws an exception all the way up - so if you wrap your parsing
block (the for loop) in a try-catch and look for a simple exception, you
would be able to catch it.

>Some of the data we are processing is not publicly available. The errors
>we have seen are issues with vary large HTML files that were generated
>from log files. These are suprisingly common but offer a special
>challenge to HTML parsers in that they tend to contain large strings of
>log file information between <pre></pre> tags.

Sounds interesting. Even if we cant get the data that you tested with, we
could simulate an equivalent testcase...

>We'll probably be running about 1 or 2 million files through the parser
>this week. I will try to report problems and get set up to build the
>library so that I can offer more specific class/line-based
>feedback/fixes.

Cool. Looking forward to it.

Cheers,
Somik