htmlparser-developer Mailing List for HTML Parser (Page 9)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Derrick, these changes sound great!  Thank you so much for putting so =
much work into creating a top notch lexer package.  I'll definitely put =
some time into going over your code, and I'll definitely help with =
testing out the integration once it gets underway.

Marc

-----Original Message-----
From: Derrick Oswald [mailto:Der...@ro...]
Sent: Wednesday, August 20, 2003 7:56 PM
To: htm...@li...
Subject: [Htmlparser-developer] new i/o subsystem

Marc, James, Somik, Joshua, Amit, et. al.

I've just dropped some speed fixes to the lexer package, the new low=20
level i/o subsystem I've been working on.
It now appears to be 10% to 50% faster at getting raw nodes than the=20
NodeReader/parserHelpers were.
It's not complete:
    - it needs an EndNode class for speed and memory reasons
    - I backed off multi-threading for speed
    - character set detection isn't really working yet
    - there's no constructor taking a file name
But the next logical step is probably integration into the real parser=20
to run against real test cases.
However, I think this will cause a *lot* of unit tests to fail.
There are a number of reasons for this:
    - attributes will have case preserved, I think I've gotten around=20
this temporarily with a switch in the ParserTestCase class
    - whitespace is preserved, a lot of this has to do with the=20
different line endings handling
    - the order of attributes in tags is preserved, so toHtml() output=20
is completely different
    - the count of nodes may be altered by the whitespace nodes, this=20
may require changing the ParserTestCase counting strategy
    - remark nodes store all the text, even the dashes
    - I mostly only paid attention to the HTML specification, real HTML=20
is somewhat more exotic
All these failing tests will need labour intensive manual attention to=20
detail to get the tests correct again.
In other words, once this is integrated there's no turning back.
As with any animal that's having it's spine replaced, there's bound to=20
be a bit of pain.
So, before that happens, the code should go through a period of severe=20
code review.
That's what open source is about right?
So if you have some time. please go over the lexer package with a fine=20
tooth comb.
Add more test cases to the lexerTests package.
Take a look at the toString() output (see testReal in LexerTests for=20
example).
Optimize the hell out of it.
Bounce it around and see what methods would make you happy. Then add =
them.
I'm thinking, two weeks minimum, so this period would span at least two=20
integration builds.
The first one will be August 24th, so if you don't have CVS access=20
you'll need to start with that.

OK, let's have at 'er folks!

Derrick

-------------------------------------------------------
This SF.net email is sponsored by Dice.com.
Did you know that Dice has over 25,000 tech jobs available today? From
careers in IT to Engineering to Tech Sales, Dice has tech jobs from the
best hiring companies. http://www.dice.com/index.epl?rel_code=3D104
_______________________________________________
Htmlparser-developer mailing list
Htm...@li...
https://lists.sourceforge.net/lists/listinfo/htmlparser-developer

2001	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (4)	Nov (1)	Dec (4)
2002	Jan (12)	Feb	Mar (7)	Apr (27)	May (14)	Jun (16)	Jul (27)	Aug (74)	Sep (1)	Oct (23)	Nov (12)	Dec (119)
2003	Jan (31)	Feb (23)	Mar (28)	Apr (59)	May (119)	Jun (10)	Jul (3)	Aug (17)	Sep (8)	Oct (38)	Nov (6)	Dec (1)
2004	Jan (4)	Feb (4)	Mar (1)	Apr (2)	May	Jun (7)	Jul (6)	Aug (1)	Sep	Oct	Nov	Dec
2005	Jan	Feb (1)	Mar	Apr (8)	May	Jun	Jul	Aug (2)	Sep (10)	Oct (4)	Nov (15)	Dec
2006	Jan	Feb (1)	Mar	Apr (4)	May (11)	Jun	Jul	Aug	Sep (2)	Oct	Nov	Dec
2007	Jan (3)	Feb (2)	Mar	Apr (2)	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec
2008	Jan	Feb (1)	Mar	Apr	May	Jun	Jul	Aug	Sep (5)	Oct (1)	Nov	Dec
2009	Jan	Feb (1)	Mar	Apr (2)	May	Jun (4)	Jul	Aug (1)	Sep	Oct	Nov	Dec (2)
2010	Jan (1)	Feb	Mar	Apr (8)	May	Jun	Jul	Aug	Sep (6)	Oct	Nov (1)	Dec
2011	Jan	Feb	Mar	Apr	May (3)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2012	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May (1)	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr (1)	May	Jun (1)	Jul	Aug	Sep	Oct	Nov (2)	Dec (1)
2016	Jan	Feb	Mar	Apr	May	Jun	Jul (2)	Aug	Sep	Oct	Nov (2)	Dec (2)

htmlparser-developer Mailing List for HTML Parser (Page 9)

htmlparser-developer — The developer mailing list of the htmlparser project