Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTag

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Marc,

I've been thinking about your problem and I think I have a solution.
I'll re-write the node reader.

OK, that's the bottom line, but I've said before that the lowest level 
should return a contiguous stream of nodes, that have the original 
characters (not case converted) and include the formatting like line 
endings and other whitespace so that toHtml() gives you the exact same 
page that you started with.

I should make a picture, but see if you can follow me here.

The lowest level is a byte stream, right off the wire. This needs to 
support mark and reset in case the character set changes.

The second level is a character stream, after applying the decoding for 
a particular charset.

The third level is a string, which is a char array. The chars are copied 
from the second level, so that can be discarded, but only after the 
entire stream has been drained. If we want to do threaded access to the 
socket to provide for parallel parsing while reading, the characters 
need to be kept around to create whole new strings.

The fourth level is a stream of tags. Instead of keeping substrings 
though, the tags just keep character position, start and end, within the 
entire page, like a cursor, and a pointer to a new 'Page' object. That 
way as the Page reads more bytes from the stream, it accumulates more 
characters, which make a bigger string that represents the page read so 
far, and there's nothing preventing the older strings from being garbage 
collected.

The upper case thing goes away since the tags point to the original 
characters via their offsets. The end of line thing goes away because 
the reader just treats a newline as any other whitespace.

So what you have after a parse is a single (very large) string with a 
parallel stream of tag objects with a whole bunch of cursors pointing 
into the string.

I've experimented with reading all the characters up front and that 
breaks 67 test cases. If you erroneously substitute "\n" for "\r\n" (or 
vice versa) there are only 47 failed cases left. The reset on character 
set change test case is one of them.  If you erroneously consume 
newlines at the front of string nodes the number of failing tests is 
only 33. And if you erroneously return no string nodes if that 
consumption leaves nothing left in the string, there are only 15 failing 
cases. These would have to be examined in detail for correctness, 
according to HTML the spec.

So it's doable.
I just have to find the time.
For now just include the entire original ScripScanner.scan() code in a 
base class for your script scanners so that the evil 
CompositeTagScanner.scan() is overridden.

Derrick

Marc wrote:

>Here are the main things that the new ScriptScanner does that breaks my code:
>  
>
>Here are the main things that the new ScriptScanner does that breaks my code:
>1) acts very strangely when it encounters "\" at a newline (doesn't just get rid of the newline, but it starts repeating the entire line about 6 times)
>2) uppercases and auto-closes tags that aren't in quotes
>  
>

Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTag

Re: [Htmlparser-developer] RE: [Htmlparser-cvs] htmlparser/src/org/htmlparser/scanners CompositeTagScanner.java,1.52,1.53 ScriptScanner.java,1.21,1.22