Re: [Htmlparser-developer] HTMLReader design needs to be modified (dev opinion solicited)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title></title>
</head>
<body>
<br>
<br>
Somik Raha wrote:<br>
<blockquote type="cite"
 cite="mid00ad01c29dc1$47df4a20$2303a440@kurukshetra">  
  <meta http-equiv="Content-Type" content="text/html; ">

  <meta content="MSHTML 6.00.2800.1106" name="GENERATOR">

  <style></style>  
  <div><font face="Arial" size="2">Hi Folks,&nbsp;&nbsp;&nbsp; </font></div>

  <div><font face="Arial" size="2">&nbsp;&nbsp;&nbsp; We've come up with an  interesting
problem - there was a request by Steve Harrington recently that we  support
multiple-sequential parsing, i.e. use the same parser object multiple  times
to parse instead of creating a new one each time.</font></div>

  <div>&nbsp;</div>

  <div><font face="Arial" size="2">&nbsp;&nbsp;&nbsp; Unfortunately this has caused us  to
play around with the reader and try to mark and reset streams. This is not
 such a good idea as for large streams there is no guarantee that a reset
will  work. Leslie suggests that we note this in the javadoc, and roll back
this  feature.</font></div>
</blockquote>
My initial notion was to do that, which works for any size stream but does
break backward compat.<br>
I have also tested the length() method approach on pages as large as 60K
bytes and all is well.<br>
To decide this issue for myself, i went to the code in both htmlparser and
the Sun sources. &nbsp;Here is what I found.<br>
 <br>
 The most typical use of the feature at hand would be the use of a StringReader
wrapped by the HTMLReader, like:<br>
 String s = "&lt;html&gt;.....&lt;/html&gt;";<br>
 StringReader sr = new StringReader(s);<br>
 HTMLReader hr = new HTMLReader(sr, s.length());<br>
 <br>
 Looking at the mark() and reset() implementations in StringReader we find
that they do nothing at all.<br>
Not surprising since the StringReader depends on the String (reference held
in a member) for storage and there really is no "buffering" per se, since
the String itself is obviously entirely in memory. &nbsp;The mark() and reset()
are really just to keep the protocol consistant with the super-class chain.<br>
 <br>
 Looking at HTMLReader, it too does no buffering and likewise imposes the
mark() nad reset() limit only because of the super-class protocol. &nbsp;In neither
class does the mark actually create any buffering or impose any overhead
-- we could just as easily hardcode the mark to MaxInt with no space or time
penalty at all.<br>
 <br>
 However, if HTMLReader were used to wrap another sort of BufferedReader,
conditions &nbsp;could be different as BufferedReader will keep upto "readAheadLimit"
charactersbuffered in a char[]. &nbsp;This is pretty good, but not good enough
to use MaxInt! &nbsp;;-) &nbsp;so just changing 5000 to MaxInt is not what we want.<br>
<br>
But the real problem is not time and space performance, it's error recovery.
&nbsp;What we really need to avoid is throwing away a finished parse when the
reset() throws an exception, which is precisely what happens in the current
release version for all strings longer than 5000 characters.<br>
<br>
I have implemented both fixes [1. a length() method in reader+ use in parser
and 2. just pull the mark/reset out to the caller] and they each function
as expected, with the caveat that the second method is not backward compat
with respect to reusing the parser on a user supplied reader.<br>
<br>
I prefer the second, non-compat, approach on architectural grounds. &nbsp;If the
creator of the reader knows the length of data, which is the most common
case, then it (the creator) can do the mark and reset where needed with absolute
certainty. &nbsp;On the other hand, if the creator of the reader does not know
the data length, then it is in every bit as good a position to suggest a
length as htmlparser is, and nothing can be gained by delegating to htmlparser.<br>
<br>
 <br>
<br>
<blockquote type="cite"
 cite="mid00ad01c29dc1$47df4a20$2303a440@kurukshetra">
  <div><font face="Arial" size="2"> </font></div>

  <div>&nbsp;</div>

  <div><font face="Arial" size="2">&nbsp;&nbsp;&nbsp; Our complete bug report and  discussion
is at <a
 href="https://sourceforge.net/tracker/index.php?func=detail&amp;aid=649133&amp;group_id=24399&amp;atid=381399">https://sourceforge.net/tracker/index.php?func=detail&amp;aid=649133&amp;group_id=24399&amp;atid=381399</a></font></div>

  <div>&nbsp;</div>

  <div><font face="Arial" size="2">&nbsp;&nbsp;&nbsp; The bug id is #649133. A  discussion
of this bug is in order, and it would be good if developers can  participate
with their views. </font></div>

  <div><font face="Arial" size="2">&nbsp;&nbsp;&nbsp; Steve --&gt; It will be good to  hear
your views on this.</font></div>

  <div>&nbsp;</div>

  <div><font face="Arial" size="2">Regards,</font></div>

  <div><font face="Arial" size="2">Somik</font></div>
</blockquote>
<br>
<pre class="moz-signature" cols="$mailwrapcol">-- 
Leslie Rohde
<a class="moz-txt-link-freetext" href="mailto:le...@op...">mailto:le...@op...</a>
<a class="moz-txt-link-freetext" href="http://www.optitext.com">http://www.optitext.com</a>
</pre>
<br>
</body>
</html>