Thread: [Htmlparser-developer] new i/o subsystem

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Marc, James, Somik, Joshua, Amit, et. al.

I've just dropped some speed fixes to the lexer package, the new low 
level i/o subsystem I've been working on.
It now appears to be 10% to 50% faster at getting raw nodes than the 
NodeReader/parserHelpers were.
It's not complete:
    - it needs an EndNode class for speed and memory reasons
    - I backed off multi-threading for speed
    - character set detection isn't really working yet
    - there's no constructor taking a file name
But the next logical step is probably integration into the real parser 
to run against real test cases.
However, I think this will cause a *lot* of unit tests to fail.
There are a number of reasons for this:
    - attributes will have case preserved, I think I've gotten around 
this temporarily with a switch in the ParserTestCase class
    - whitespace is preserved, a lot of this has to do with the 
different line endings handling
    - the order of attributes in tags is preserved, so toHtml() output 
is completely different
    - the count of nodes may be altered by the whitespace nodes, this 
may require changing the ParserTestCase counting strategy
    - remark nodes store all the text, even the dashes
    - I mostly only paid attention to the HTML specification, real HTML 
is somewhat more exotic
All these failing tests will need labour intensive manual attention to 
detail to get the tests correct again.
In other words, once this is integrated there's no turning back.
As with any animal that's having it's spine replaced, there's bound to 
be a bit of pain.
So, before that happens, the code should go through a period of severe 
code review.
That's what open source is about right?
So if you have some time. please go over the lexer package with a fine 
tooth comb.
Add more test cases to the lexerTests package.
Take a look at the toString() output (see testReal in LexerTests for 
example).
Optimize the hell out of it.
Bounce it around and see what methods would make you happy. Then add them.
I'm thinking, two weeks minimum, so this period would span at least two 
integration builds.
The first one will be August 24th, so if you don't have CVS access 
you'll need to start with that.

OK, let's have at 'er folks!

Derrick

Thread: [Htmlparser-developer] new i/o subsystem

htmlparser-developer