Dear Claude,
>We have (my company) processed
>about 11 million HTML documents successfully (with the >Swing parser),
>some of which we'll see tested again with the
>HTMLParser code in the
>next few weeks.
Great - this will be a great service to this project and its community.
Thank you very much.
>To date, we have only run a few simple tests with the HTMLParser code
>but it appears now that the library is writing to standard err. I would
>expect all errors to result in parser-specific exceptions that the
>calling application would be free to handle as it may see fit.
Hmm.. although I agree with this, I have a question - what do you see being
written to standard err ? My understanding is that, when the parser crashes,
it usually throws an exception all the way up - so if you wrap your parsing
block (the for loop) in a try-catch and look for a simple exception, you
would be able to catch it.
>Some of the data we are processing is not publicly available. The errors
>we have seen are issues with vary large HTML files that were generated
>from log files. These are suprisingly common but offer a special
>challenge to HTML parsers in that they tend to contain large strings of
>log file information between <pre></pre> tags.
Sounds interesting. Even if we cant get the data that you tested with, we
could simulate an equivalent testcase...
>We'll probably be running about 1 or 2 million files through the parser
>this week. I will try to report problems and get set up to build the
>library so that I can offer more specific class/line-based
>feedback/fixes.
Cool. Looking forward to it.
Cheers,
Somik
|