Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTag
Brought to you by:
derrickoswald
From: Derrick O. <Der...@ro...> - 2003-05-28 22:32:24
|
Marc, I've been thinking about your problem and I think I have a solution. I'll re-write the node reader. OK, that's the bottom line, but I've said before that the lowest level should return a contiguous stream of nodes, that have the original characters (not case converted) and include the formatting like line endings and other whitespace so that toHtml() gives you the exact same page that you started with. I should make a picture, but see if you can follow me here. The lowest level is a byte stream, right off the wire. This needs to support mark and reset in case the character set changes. The second level is a character stream, after applying the decoding for a particular charset. The third level is a string, which is a char array. The chars are copied from the second level, so that can be discarded, but only after the entire stream has been drained. If we want to do threaded access to the socket to provide for parallel parsing while reading, the characters need to be kept around to create whole new strings. The fourth level is a stream of tags. Instead of keeping substrings though, the tags just keep character position, start and end, within the entire page, like a cursor, and a pointer to a new 'Page' object. That way as the Page reads more bytes from the stream, it accumulates more characters, which make a bigger string that represents the page read so far, and there's nothing preventing the older strings from being garbage collected. The upper case thing goes away since the tags point to the original characters via their offsets. The end of line thing goes away because the reader just treats a newline as any other whitespace. So what you have after a parse is a single (very large) string with a parallel stream of tag objects with a whole bunch of cursors pointing into the string. I've experimented with reading all the characters up front and that breaks 67 test cases. If you erroneously substitute "\n" for "\r\n" (or vice versa) there are only 47 failed cases left. The reset on character set change test case is one of them. If you erroneously consume newlines at the front of string nodes the number of failing tests is only 33. And if you erroneously return no string nodes if that consumption leaves nothing left in the string, there are only 15 failing cases. These would have to be examined in detail for correctness, according to HTML the spec. So it's doable. I just have to find the time. For now just include the entire original ScripScanner.scan() code in a base class for your script scanners so that the evil CompositeTagScanner.scan() is overridden. Derrick Marc wrote: >Here are the main things that the new ScriptScanner does that breaks my code: > > >Here are the main things that the new ScriptScanner does that breaks my code: >1) acts very strangely when it encounters "\" at a newline (doesn't just get rid of the newline, but it starts repeating the entire line about 6 times) >2) uppercases and auto-closes tags that aren't in quotes > > |