[Htmlparser-developer] new i/o subsystem
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2003-08-21 06:03:09
|
Marc, James, Somik, Joshua, Amit, et. al. I've just dropped some speed fixes to the lexer package, the new low level i/o subsystem I've been working on. It now appears to be 10% to 50% faster at getting raw nodes than the NodeReader/parserHelpers were. It's not complete: - it needs an EndNode class for speed and memory reasons - I backed off multi-threading for speed - character set detection isn't really working yet - there's no constructor taking a file name But the next logical step is probably integration into the real parser to run against real test cases. However, I think this will cause a *lot* of unit tests to fail. There are a number of reasons for this: - attributes will have case preserved, I think I've gotten around this temporarily with a switch in the ParserTestCase class - whitespace is preserved, a lot of this has to do with the different line endings handling - the order of attributes in tags is preserved, so toHtml() output is completely different - the count of nodes may be altered by the whitespace nodes, this may require changing the ParserTestCase counting strategy - remark nodes store all the text, even the dashes - I mostly only paid attention to the HTML specification, real HTML is somewhat more exotic All these failing tests will need labour intensive manual attention to detail to get the tests correct again. In other words, once this is integrated there's no turning back. As with any animal that's having it's spine replaced, there's bound to be a bit of pain. So, before that happens, the code should go through a period of severe code review. That's what open source is about right? So if you have some time. please go over the lexer package with a fine tooth comb. Add more test cases to the lexerTests package. Take a look at the toString() output (see testReal in LexerTests for example). Optimize the hell out of it. Bounce it around and see what methods would make you happy. Then add them. I'm thinking, two weeks minimum, so this period would span at least two integration builds. The first one will be August 24th, so if you don't have CVS access you'll need to start with that. OK, let's have at 'er folks! Derrick |