Thread: [Htmlparser-developer] HTMLReader design needs to be modified (dev opinion solicited)
Brought to you by:
derrickoswald
From: Somik R. <so...@ya...> - 2002-12-07 07:20:42
|
Hi Folks, =20 We've come up with an interesting problem - there was a request by = Steve Harrington recently that we support multiple-sequential parsing, = i.e. use the same parser object multiple times to parse instead of = creating a new one each time. Unfortunately this has caused us to play around with the reader and = try to mark and reset streams. This is not such a good idea as for large = streams there is no guarantee that a reset will work. Leslie suggests = that we note this in the javadoc, and roll back this feature.=20 Our complete bug report and discussion is at = https://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D649133&grou= p_id=3D24399&atid=3D381399 The bug id is #649133. A discussion of this bug is in order, and it = would be good if developers can participate with their views.=20 Steve --> It will be good to hear your views on this. Regards, Somik |
From: Leslie R. <le...@op...> - 2002-12-07 22:40:47
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <title></title> </head> <body> <br> <br> Somik Raha wrote:<br> <blockquote type="cite" cite="mid00ad01c29dc1$47df4a20$2303a440@kurukshetra"> <meta http-equiv="Content-Type" content="text/html; "> <meta content="MSHTML 6.00.2800.1106" name="GENERATOR"> <style></style> <div><font face="Arial" size="2">Hi Folks, </font></div> <div><font face="Arial" size="2"> We've come up with an interesting problem - there was a request by Steve Harrington recently that we support multiple-sequential parsing, i.e. use the same parser object multiple times to parse instead of creating a new one each time.</font></div> <div> </div> <div><font face="Arial" size="2"> Unfortunately this has caused us to play around with the reader and try to mark and reset streams. This is not such a good idea as for large streams there is no guarantee that a reset will work. Leslie suggests that we note this in the javadoc, and roll back this feature.</font></div> </blockquote> My initial notion was to do that, which works for any size stream but does break backward compat.<br> I have also tested the length() method approach on pages as large as 60K bytes and all is well.<br> To decide this issue for myself, i went to the code in both htmlparser and the Sun sources. Here is what I found.<br> <br> The most typical use of the feature at hand would be the use of a StringReader wrapped by the HTMLReader, like:<br> String s = "<html>.....</html>";<br> StringReader sr = new StringReader(s);<br> HTMLReader hr = new HTMLReader(sr, s.length());<br> <br> Looking at the mark() and reset() implementations in StringReader we find that they do nothing at all.<br> Not surprising since the StringReader depends on the String (reference held in a member) for storage and there really is no "buffering" per se, since the String itself is obviously entirely in memory. The mark() and reset() are really just to keep the protocol consistant with the super-class chain.<br> <br> Looking at HTMLReader, it too does no buffering and likewise imposes the mark() nad reset() limit only because of the super-class protocol. In neither class does the mark actually create any buffering or impose any overhead -- we could just as easily hardcode the mark to MaxInt with no space or time penalty at all.<br> <br> However, if HTMLReader were used to wrap another sort of BufferedReader, conditions could be different as BufferedReader will keep upto "readAheadLimit" charactersbuffered in a char[]. This is pretty good, but not good enough to use MaxInt! ;-) so just changing 5000 to MaxInt is not what we want.<br> <br> But the real problem is not time and space performance, it's error recovery. What we really need to avoid is throwing away a finished parse when the reset() throws an exception, which is precisely what happens in the current release version for all strings longer than 5000 characters.<br> <br> I have implemented both fixes [1. a length() method in reader+ use in parser and 2. just pull the mark/reset out to the caller] and they each function as expected, with the caveat that the second method is not backward compat with respect to reusing the parser on a user supplied reader.<br> <br> I prefer the second, non-compat, approach on architectural grounds. If the creator of the reader knows the length of data, which is the most common case, then it (the creator) can do the mark and reset where needed with absolute certainty. On the other hand, if the creator of the reader does not know the data length, then it is in every bit as good a position to suggest a length as htmlparser is, and nothing can be gained by delegating to htmlparser.<br> <br> <br> <br> <blockquote type="cite" cite="mid00ad01c29dc1$47df4a20$2303a440@kurukshetra"> <div><font face="Arial" size="2"> </font></div> <div> </div> <div><font face="Arial" size="2"> Our complete bug report and discussion is at <a href="https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399">https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399</a></font></div> <div> </div> <div><font face="Arial" size="2"> The bug id is #649133. A discussion of this bug is in order, and it would be good if developers can participate with their views. </font></div> <div><font face="Arial" size="2"> Steve --> It will be good to hear your views on this.</font></div> <div> </div> <div><font face="Arial" size="2">Regards,</font></div> <div><font face="Arial" size="2">Somik</font></div> </blockquote> <br> <pre class="moz-signature" cols="$mailwrapcol">-- Leslie Rohde <a class="moz-txt-link-freetext" href="mailto:le...@op...">mailto:le...@op...</a> <a class="moz-txt-link-freetext" href="http://www.optitext.com">http://www.optitext.com</a> </pre> <br> </body> </html> |
From: Somik R. <so...@ya...> - 2002-12-09 01:38:31
|
Hi Leslie, I prefer the second, non-compat, approach on architectural grounds. If = the creator of the reader knows the length of data, which is the most = common case, then it (the creator) can do the mark and reset where = needed with absolute certainty. On the other hand, if the creator of = the reader does not know the data length, then it is in every bit as = good a position to suggest a length as htmlparser is, and nothing can be = gained by delegating to htmlparser. Just to add to the picture - the reset is done when a call is made to = the elements() method, we wish to position the parser back to the = beginning of the stream. Now, it just might be that this is not possible = - in which case we'd throw an exception. For the user to handle an = exception and create a new parser object/move the mark himself in the = catch code is an unncessary complication - dont you think ?=20 The whole idea of putting it there was to make it simpler to parse thru = a given html page again and again using the same parser object.=20 But if that is leading to other complications, it might just be better = to take it out and expect that the parser object will need to be created = every time. Of course if we can handle all of it in the parser, then I'd = think its worth it, but a middle approach might just benefit neither = side. What are your thoughts ? Regards, Somik |
From: Stephen J. H. <Ste...@tr...> - 2002-12-09 18:06:19
|
I already created a work around, so it doesn't kill me. I just hated to have to spend the time to make a new connection to the source I am scraping since the pipe I am using is small. I would be fine with it the way it is, provided the docs are updated. Thanks for looking into this. --stephen Somik Raha wrote: > Hi Folks, We've come up with an interesting problem - there was a > request by Steve Harrington recently that we support > multiple-sequential parsing, i.e. use the same parser object multiple > times to parse instead of creating a new one each time. > Unfortunately this has caused us to play around with the reader and > try to mark and reset streams. This is not such a good idea as for > large streams there is no guarantee that a reset will work. Leslie > suggests that we note this in the javadoc, and roll back this > feature. Our complete bug report and discussion is at > https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399 > The bug id is #649133. A discussion of this bug is in order, and it > would be good if developers can participate with their views. Steve > --> It will be good to hear your views on this. Regards,Somik |
From: Leslie R. <le...@op...> - 2002-12-09 18:18:40
|
Stephen J. Harrington wrote: > I already created a work around, so it doesn't kill me. > > I just hated to have to spend the time to make a new connection to the > source I am scraping since the pipe I am using is small. > Don't make a new connection. Just do a mark(10000) right after the Reader is opened, call parse, and do a Reader.reset() before calling parse again. The connection will remain, and the BufferedReader will hold onto the html string between calls to parse. The # 10000 is an example only -- you'll have to provide a value large enough to accommodate whatever stream length you expect or the subsequent reset will fail. > I would be fine with it the way it is, provided the docs are updated. > > Thanks for looking into this. > > --stephen > > Somik Raha wrote: > >> Hi Folks, We've come up with an interesting problem - there was a >> request by Steve Harrington recently that we support >> multiple-sequential parsing, i.e. use the same parser object multiple >> times to parse instead of creating a new one each time. >> Unfortunately this has caused us to play around with the reader and >> try to mark and reset streams. This is not such a good idea as for >> large streams there is no guarantee that a reset will work. Leslie >> suggests that we note this in the javadoc, and roll back this >> feature. Our complete bug report and discussion is at >> https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399 >> <https://sourceforge.net/tracker/index.php?func=detail&aid=649133&group_id=24399&atid=381399> >> The bug id is #649133. A discussion of this bug is in order, and it >> would be good if developers can participate with their views. >> Steve --> It will be good to hear your views on this. Regards,Somik > -- Leslie Rohde mailto:le...@op... http://www.optitext.com |