Marc, James, Somik, Joshua, Amit, et. al.
I've just dropped some speed fixes to the lexer package, the new low
level i/o subsystem I've been working on.
It now appears to be 10% to 50% faster at getting raw nodes than the
NodeReader/parserHelpers were.
It's not complete:
- it needs an EndNode class for speed and memory reasons
- I backed off multi-threading for speed
- character set detection isn't really working yet
- there's no constructor taking a file name
But the next logical step is probably integration into the real parser
to run against real test cases.
However, I think this will cause a *lot* of unit tests to fail.
There are a number of reasons for this:
- attributes will have case preserved, I think I've gotten around
this temporarily with a switch in the ParserTestCase class
- whitespace is preserved, a lot of this has to do with the
different line endings handling
- the order of attributes in tags is preserved, so toHtml() output
is completely different
- the count of nodes may be altered by the whitespace nodes, this
may require changing the ParserTestCase counting strategy
- remark nodes store all the text, even the dashes
- I mostly only paid attention to the HTML specification, real HTML
is somewhat more exotic
All these failing tests will need labour intensive manual attention to
detail to get the tests correct again.
In other words, once this is integrated there's no turning back.
As with any animal that's having it's spine replaced, there's bound to
be a bit of pain.
So, before that happens, the code should go through a period of severe
code review.
That's what open source is about right?
So if you have some time. please go over the lexer package with a fine
tooth comb.
Add more test cases to the lexerTests package.
Take a look at the toString() output (see testReal in LexerTests for
example).
Optimize the hell out of it.
Bounce it around and see what methods would make you happy. Then add them.
I'm thinking, two weeks minimum, so this period would span at least two
integration builds.
The first one will be August 24th, so if you don't have CVS access
you'll need to start with that.
OK, let's have at 'er folks!
Derrick
|