Thread: [Htmlparser-developer] HTMLReader design needs to be modified (dev opinion solicited)

Brought to you by: derrickoswald

htmlparser-developer

[Htmlparser-developer] HTMLReader design needs to be modified (dev opinion solicited)

From: Somik R. <so...@ya...> - 2002-12-07 07:20:42

Hi Folks,   =20
    We've come up with an interesting problem - there was a request by =
Steve Harrington recently that we support multiple-sequential parsing, =
i.e. use the same parser object multiple times to parse instead of =
creating a new one each time.

    Unfortunately this has caused us to play around with the reader and =
try to mark and reset streams. This is not such a good idea as for large =
streams there is no guarantee that a reset will work. Leslie suggests =
that we note this in the javadoc, and roll back this feature.=20

    Our complete bug report and discussion is at =
https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D649133&grou=
p_id=3D24399&atid=3D381399

    The bug id is #649133. A discussion of this bug is in order, and it =
would be good if developers can participate with their views.=20
    Steve --> It will be good to hear your views on this.

Regards,
Somik

Re: [Htmlparser-developer] HTMLReader design needs to be modified (dev opinion solicited)

From: Leslie R. <le...@op...> - 2002-12-07 22:40:47

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title></title>
</head>
<body>
<br>
<br>
Somik Raha wrote:<br>
<blockquote type="cite"
 cite="mid00ad01c29dc1$47df4a20$2303a440@kurukshetra">  
  <meta http-equiv="Content-Type" content="text/html; ">
 
  <meta content="MSHTML 6.00.2800.1106" name="GENERATOR">
 
  <style></style>  
  <div><font face="Arial" size="2">Hi Folks,&nbsp;&nbsp;&nbsp; </font></div>
 
  <div><font face="Arial" size="2">&nbsp;&nbsp;&nbsp; We've come up with an  interesting
problem - there was a request by Steve Harrington recently that we  support
multiple-sequential parsing, i.e. use the same parser object multiple  times
to parse instead of creating a new one each time.</font></div>
 
  <div>&nbsp;</div>
 
  <div><font face="Arial" size="2">&nbsp;&nbsp;&nbsp; Unfortunately this has caused us  to
play around with the reader and try to mark and reset streams. This is not
 such a good idea as for large streams there is no guarantee that a reset
will  work. Leslie suggests that we note this in the javadoc, and roll back
this  feature.</font></div>
</blockquote>
My initial notion was to do that, which works for any size stream but does
break backward compat.<br>
I have also tested the length() method approach on pages as large as 60K
bytes and all is well.<br>
To decide this issue for myself, i went to the code in both htmlparser and
the Sun sources. &nbsp;Here is what I found.<br>
 <br>
 The most typical use of the feature at hand would be the use of a StringReader
wrapped by the HTMLReader, like:<br>
 String s = "&lt;html&gt;.....&lt;/html&gt;";<br>
 StringReader sr = new StringReader(s);<br>
 HTMLReader hr = new HTMLReader(sr, s.length());<br>
 <br>
 Looking at the mark() and reset() implementations in StringReader we find
that they do nothing at all.<br>
Not surprising since the StringReader depends on the String (reference held
in a member) for storage and there really is no "buffering" per se, since
the String itself is obviously entirely in memory. &nbsp;The mark() and reset()
are really just to keep the protocol consistant with the super-class chain.<br>
 <br>
 Looking at HTMLReader, it too does no buffering and likewise imposes the
mark() nad reset() limit only because of the super-class protocol. &nbsp;In neither
class does the mark actually create any buffering or impose any overhead
-- we could just as easily hardcode the mark to MaxInt with no space or time
penalty at all.<br>
 <br>
 However, if HTMLReader were used to wrap another sort of BufferedReader,
conditions &nbsp;could be different as BufferedReader will keep upto "readAheadLimit"
charactersbuffered in a char[]. &nbsp;This is pretty good, but not good enough
to use MaxInt! &nbsp;;-) &nbsp;so just changing 5000 to MaxInt is not what we want.<br>
<br>
But the real problem is not time and space performance, it's error recovery.
&nbsp;What we really need to avoid is throwing away a finished parse when the
reset() throws an exception, which is precisely what happens in the current
release version for all strings longer than 5000 characters.<br>
<br>
I have implemented both fixes [1. a length() method in reader+ use in parser
and 2. just pull the mark/reset out to the caller] and they each function
as expected, with the caveat that the second method is not backward compat
with respect to reusing the parser on a user supplied reader.<br>
<br>
I prefer the second, non-compat, approach on architectural grounds. &nbsp;If the
creator of the reader knows the length of data, which is the most common
case, then it (the creator) can do the mark and reset where needed with absolute
certainty. &nbsp;On the other hand, if the creator of the reader does not know
the data length, then it is in every bit as good a position to suggest a
length as htmlparser is, and nothing can be gained by delegating to htmlparser.<br>
<br>
 <br>
<br>
<blockquote type="cite"
 cite="mid00ad01c29dc1$47df4a20$2303a440@kurukshetra">
  <div><font face="Arial" size="2"> </font></div>
 
  <div>&nbsp;</div>
 
  <div><font face="Arial" size="2">&nbsp;&nbsp;&nbsp; Our complete bug report and  discussion
is at <a
 href="https://sourceforge.net/tracker/index.php?func=detail&amp;aid=649133&amp;group_id=24399&amp;atid=381399">https://sourceforge.net/tracker/index.php?func=detail&amp;aid=649133&amp;group_id=24399&amp;atid=381399</a></font></div>
 
  <div>&nbsp;</div>
 
  <div><font face="Arial" size="2">&nbsp;&nbsp;&nbsp; The bug id is #649133. A  discussion
of this bug is in order, and it would be good if developers can  participate
with their views. </font></div>
 
  <div><font face="Arial" size="2">&nbsp;&nbsp;&nbsp; Steve --&gt; It will be good to  hear
your views on this.</font></div>
 
  <div>&nbsp;</div>
 
  <div><font face="Arial" size="2">Regards,</font></div>
 
  <div><font face="Arial" size="2">Somik</font></div>
</blockquote>
<br>
<pre class="moz-signature" cols="$mailwrapcol">-- 
Leslie Rohde
<a class="moz-txt-link-freetext" href="mailto:le...@op...">mailto:le...@op...</a>
<a class="moz-txt-link-freetext" href="http://www.optitext.com">http://www.optitext.com</a>
</pre>
<br>
</body>
</html>

Re: [Htmlparser-developer] HTMLReader design needs to be modified (dev opinion solicited)

From: Somik R. <so...@ya...> - 2002-12-09 01:38:31

Hi Leslie,
I prefer the second, non-compat, approach on architectural grounds.  If =
the creator of the reader knows the length of data, which is the most =
common case, then it (the creator) can do the mark and reset where =
needed with absolute certainty.  On the other hand, if the creator of =
the reader does not know the data length, then it is in every bit as =
good a position to suggest a length as htmlparser is, and nothing can be =
gained by delegating to htmlparser.

Just to add to the picture - the reset is done when a call is made to =
the elements() method, we wish to position the parser back to the =
beginning of the stream. Now, it just might be that this is not possible =
- in which case we'd throw an exception. For the user to handle an =
exception and create a new parser object/move the mark himself in the =
catch code is an unncessary complication - dont you think ?=20

The whole idea of putting it there was to make it simpler to parse thru =
a given html page again and again using the same parser object.=20
But if that is leading to other complications, it might just be better =
to take it out and expect that the parser object will need to be created =
every time. Of course if we can handle all of it in the parser, then I'd =
think its worth it, but a middle approach might just benefit neither =
side.

What are your thoughts ?

Regards,
Somik

[Htmlparser-developer] Re: HTMLReader design needs to be modified (dev opinion solicited)

From: Stephen J. H. <Ste...@tr...> - 2002-12-09 18:06:19

I already created a work around, so it doesn't kill me.

I just hated to have to spend the time to make a new connection to the
source I am scraping since the pipe I am using is small.

I would be fine with it the way it is, provided the docs are updated.

Thanks for looking into this.

--stephen

Somik Raha wrote:

> Hi Folks,    We've come up with an interesting problem - there was a
> request by Steve Harrington recently that we support
> multiple-sequential parsing, i.e. use the same parser object multiple
> times to parse instead of creating a new one each time.
> Unfortunately this has caused us to play around with the reader and
> try to mark and reset streams. This is not such a good idea as for
> large streams there is no guarantee that a reset will work. Leslie
> suggests that we note this in the javadoc, and roll back this
> feature.     Our complete bug report and discussion is at
> https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399
> The bug id is #649133. A discussion of this bug is in order, and it
> would be good if developers can participate with their views.    Steve
> --> It will be good to hear your views on this. Regards,Somik

Re: [Htmlparser-developer] Re: HTMLReader design needs to be modified (dev opinion solicited)

From: Leslie R. <le...@op...> - 2002-12-09 18:18:40

Stephen J. Harrington wrote:

> I already created a work around, so it doesn't kill me.
>
> I just hated to have to spend the time to make a new connection to the 
> source I am scraping since the pipe I am using is small.
>
Don't make a new connection.  Just do a mark(10000) right after the 
Reader is opened, call parse, and do a Reader.reset()
before calling parse again.  The connection will remain, and the 
BufferedReader will hold onto the html string between
calls to parse.  The # 10000 is an example only -- you'll have to 
provide a value large enough to accommodate whatever
stream length you expect or the subsequent reset will fail.

> I would be fine with it the way it is, provided the docs are updated.
>
> Thanks for looking into this.
>
> --stephen
>
> Somik Raha wrote:
>
>> Hi Folks,    We've come up with an interesting problem - there was a 
>> request by Steve Harrington recently that we support 
>> multiple-sequential parsing, i.e. use the same parser object multiple 
>> times to parse instead of creating a new one each time.     
>> Unfortunately this has caused us to play around with the reader and 
>> try to mark and reset streams. This is not such a good idea as for 
>> large streams there is no guarantee that a reset will work. Leslie 
>> suggests that we note this in the javadoc, and roll back this 
>> feature.     Our complete bug report and discussion is at 
>> https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399 
>> <https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399>     
>> The bug id is #649133. A discussion of this bug is in order, and it 
>> would be good if developers can participate with their views.    
>> Steve --> It will be good to hear your views on this. Regards,Somik
>

-- 
Leslie Rohde
mailto:le...@op...
http://www.optitext.com