Re: [Htmlparser-developer] HTMLReader design needs to be modified (dev opinion solicited)
Brought to you by:
derrickoswald
From: Leslie R. <le...@op...> - 2002-12-07 22:40:47
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title></title> </head> <body> <br> <br> Somik Raha wrote:<br> <blockquote type="cite" cite="mid00ad01c29dc1$47df4a20$2303a440@kurukshetra"> <meta http-equiv="Content-Type" content="text/html; "> <meta content="MSHTML 6.00.2800.1106" name="GENERATOR"> <style></style> <div><font face="Arial" size="2">Hi Folks, </font></div> <div><font face="Arial" size="2"> We've come up with an interesting problem - there was a request by Steve Harrington recently that we support multiple-sequential parsing, i.e. use the same parser object multiple times to parse instead of creating a new one each time.</font></div> <div> </div> <div><font face="Arial" size="2"> Unfortunately this has caused us to play around with the reader and try to mark and reset streams. This is not such a good idea as for large streams there is no guarantee that a reset will work. Leslie suggests that we note this in the javadoc, and roll back this feature.</font></div> </blockquote> My initial notion was to do that, which works for any size stream but does break backward compat.<br> I have also tested the length() method approach on pages as large as 60K bytes and all is well.<br> To decide this issue for myself, i went to the code in both htmlparser and the Sun sources. Here is what I found.<br> <br> The most typical use of the feature at hand would be the use of a StringReader wrapped by the HTMLReader, like:<br> String s = "<html>.....</html>";<br> StringReader sr = new StringReader(s);<br> HTMLReader hr = new HTMLReader(sr, s.length());<br> <br> Looking at the mark() and reset() implementations in StringReader we find that they do nothing at all.<br> Not surprising since the StringReader depends on the String (reference held in a member) for storage and there really is no "buffering" per se, since the String itself is obviously entirely in memory. The mark() and reset() are really just to keep the protocol consistant with the super-class chain.<br> <br> Looking at HTMLReader, it too does no buffering and likewise imposes the mark() nad reset() limit only because of the super-class protocol. In neither class does the mark actually create any buffering or impose any overhead -- we could just as easily hardcode the mark to MaxInt with no space or time penalty at all.<br> <br> However, if HTMLReader were used to wrap another sort of BufferedReader, conditions could be different as BufferedReader will keep upto "readAheadLimit" charactersbuffered in a char[]. This is pretty good, but not good enough to use MaxInt! ;-) so just changing 5000 to MaxInt is not what we want.<br> <br> But the real problem is not time and space performance, it's error recovery. What we really need to avoid is throwing away a finished parse when the reset() throws an exception, which is precisely what happens in the current release version for all strings longer than 5000 characters.<br> <br> I have implemented both fixes [1. a length() method in reader+ use in parser and 2. just pull the mark/reset out to the caller] and they each function as expected, with the caveat that the second method is not backward compat with respect to reusing the parser on a user supplied reader.<br> <br> I prefer the second, non-compat, approach on architectural grounds. If the creator of the reader knows the length of data, which is the most common case, then it (the creator) can do the mark and reset where needed with absolute certainty. On the other hand, if the creator of the reader does not know the data length, then it is in every bit as good a position to suggest a length as htmlparser is, and nothing can be gained by delegating to htmlparser.<br> <br> <br> <br> <blockquote type="cite" cite="mid00ad01c29dc1$47df4a20$2303a440@kurukshetra"> <div><font face="Arial" size="2"> </font></div> <div> </div> <div><font face="Arial" size="2"> Our complete bug report and discussion is at <a href="https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399">https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399</a></font></div> <div> </div> <div><font face="Arial" size="2"> The bug id is #649133. A discussion of this bug is in order, and it would be good if developers can participate with their views. </font></div> <div><font face="Arial" size="2"> Steve --> It will be good to hear your views on this.</font></div> <div> </div> <div><font face="Arial" size="2">Regards,</font></div> <div><font face="Arial" size="2">Somik</font></div> </blockquote> <br> <pre class="moz-signature" cols="$mailwrapcol">-- Leslie Rohde <a class="moz-txt-link-freetext" href="mailto:le...@op...">mailto:le...@op...</a> <a class="moz-txt-link-freetext" href="http://www.optitext.com">http://www.optitext.com</a> </pre> <br> </body> </html> |