Thread: RE: [Htmlparser-developer] Re: [Htmlparser-user] Another Ill-Formed Example
Brought to you by:
derrickoswald
From: Claude D. <CD...@ar...> - 2002-08-07 15:48:15
|
You are not only talented but very kind! Thanks. =20 BTW: I was giving some thought to the calls that take place in HTMLEnumeration. As far as I could tell, many internal calls were made twice, by virtue of the hasMoreNodes/nextHTMLNode pattern. An alternate pattern is repeated calls to nextHTMLNode which should stop when a null response is returned. This pattern is used by the BufferedReader.readLine method, by the JDBC ResultSet.next method, etc. Based on the simple observation that calls to hasMoreNodes AND nextHTMLNode run some of the same underlying code, it seems that the speed of the parser could be positively influenced by reducing the interface to a single call. Any thoughts? =20 -----Original Message----- From: Somik Raha [mailto:so...@ya...]=20 Sent: Tuesday, August 06, 2002 9:56 PM To: htm...@li... Cc: htm...@li... Subject: [Htmlparser-developer] Re: [Htmlparser-user] Another Ill-Formed Example Hi Claude, This has been handled, related to the earlier fix. All potential infinite loops have been removed, and there will be no more hangings - only HTMLParserExceptions from now on. There will be a release having all these fixes this weekend. =20 Regards, Somik ----- Original Message -----=20 From: Claude <mailto:CD...@ar...> Duguay=20 To: htm...@li...=20 Sent: Wednesday, August 07, 2002 3:35 AM Subject: [Htmlparser-user] Another Ill-Formed Example Here's some markup we found in another document that causes the HTMLParser to hang. "<TITLE>KRP VALIDATION<PROCESS/TITLE>" So far, we've had 4 documents cause our process to come to a grinding halt. I would much prefer a policy of exception throwing to hangs asap, followed by consideration of whether unusual markup can be handled more elegantly in a subsequent phase. Thanks to everyone, as always. =20 |
From: Claude D. <CD...@ar...> - 2002-08-08 15:58:20
|
QmFzZWQgb24geW91ciBkZXNjcmlwdGlvbiB0aGVyZSBpcyBhIHJpc2sgdGhhdCBjYWxsaW5nIGhh c01vcmVOb2RlcyB3aXRob3V0IGNhbGxpbmcgbmV4dEhUTUxOb2RlIGEgZmV3IHRpbWVzIGluIGEg cm93IHdpbGwgbm90IGhhdmUgdGhlIGRlc2lyZWQgQVBJIHNlbWFudGljcy4gSWYgdGhlIHBhcnNp bmcgdGFrZXMgcGxhY2UgaW4gdGhlIGNhbGwgdG8gaGFzTW9yZU5vZGVzLCB0aGVuIHRoZSBwYXJz ZXIgbW92ZXMgZm9yd2FyZCwgcmVnYXJkbGVzcyBvZiB3aGV0aGVyIHRoZSBuZXh0SFRNTE5vZGUg bWV0aG9kIHdhcyBjYWxsZWQuIFRoaXMgc3VnZ2VzdHMgdGhhdCB0aGUgbWV0aG9kIHNob3VsZCBi ZSBjYWxsZWQgc29tZXRoaW5nIGVsc2UsIG1vcmUgaW5kaWNhdGl2ZSBvZiB0aGlzIGJlaGF2aW9y LCBvciB0aGUgYmVoYXZpb3Igc2hvdWxkIGJlIGNoYW5nZWQuDQogDQotLS0tLU9yaWdpbmFsIE1l c3NhZ2UtLS0tLSANCkZyb206IFNvbWlrIFJhaGEgW21haWx0bzpzb21pa0B5YWhvby5jb21dIA0K U2VudDogVGh1IDgvOC8yMDAyIDEyOjA3IEFNIA0KVG86IGh0bWxwYXJzZXItZGV2ZWxvcGVyQGxp c3RzLnNvdXJjZWZvcmdlLm5ldCANCkNjOiANClN1YmplY3Q6IFJlOiBbSHRtbHBhcnNlci1kZXZl bG9wZXJdIFJlOiBbSHRtbHBhcnNlci11c2VyXSBBbm90aGVyIElsbC1Gb3JtZWQgRXhhbXBsZQ0K DQoNCg0KCUhpIENsYXVkZSwNCgkgICAgVGhhbmtzIGZvciB0aGUga2luZCB3b3Jkcy4NCgkNCglC VFc6IEkgd2FzIGdpdmluZyBzb21lIHRob3VnaHQgdG8gdGhlIGNhbGxzIHRoYXQgdGFrZSBwbGFj ZSBpbiBIVE1MRW51bWVyYXRpb24uIEFzIGZhciBhcyBJIGNvdWxkIHRlbGwsIG1hbnkgaW50ZXJu YWwgY2FsbHMgd2VyZSBtYWRlIHR3aWNlLCBieSB2aXJ0dWUgb2YgdGhlIGhhc01vcmVOb2Rlcy9u ZXh0SFRNTE5vZGUgcGF0dGVybi4gQW4gYWx0ZXJuYXRlIHBhdHRlcm4gaXMgcmVwZWF0ZWQgY2Fs bHMgdG8gbmV4dEhUTUxOb2RlIHdoaWNoIHNob3VsZCBzdG9wIHdoZW4gYSBudWxsIHJlc3BvbnNl IGlzIHJldHVybmVkLiBUaGlzIHBhdHRlcm4gaXMgdXNlZCBieSB0aGUgQnVmZmVyZWRSZWFkZXIu cmVhZExpbmUgbWV0aG9kLCBieSB0aGUgSkRCQyBSZXN1bHRTZXQubmV4dCBtZXRob2QsIGV0Yy4g QmFzZWQgb24gdGhlIHNpbXBsZSBvYnNlcnZhdGlvbiB0aGF0IGNhbGxzIHRvIGhhc01vcmVOb2Rl cyBBTkQgbmV4dEhUTUxOb2RlIHJ1biBzb21lIG9mIHRoZSBzYW1lIHVuZGVybHlpbmcgY29kZSwg aXQgc2VlbXMgdGhhdCB0aGUgc3BlZWQgb2YgdGhlIHBhcnNlciBjb3VsZCBiZSBwb3NpdGl2ZWx5 IGluZmx1ZW5jZWQgYnkgcmVkdWNpbmcgdGhlIGludGVyZmFjZSB0byBhIHNpbmdsZSBjYWxsLiBB bnkgdGhvdWdodHM/DQoJIA0KCUkgYW0gbm90IHNvIHN1cmUgdGhpcyB3b3VsZCBiZSBhIGdvb2Qg aWRlYSwgYmVjYXVzZSB0aGVuLCB3ZSdkIGhhdmUgdG8gY29tcHJvbWlzZSBvbiB0aGUgQVBJLiBU aGVuIHVzZXJzIHdvdWxkIGhhdmUgdG8gYmUgY2hlY2tpbmcgZm9yIG51bGwgdmFsdWVzLSAgdGhl IGl0ZXJhdG9yIGludGVyZmFjZSBpcyBhbHNvIG9uZSB0aGF0IGlzIHBvcHVsYXIgYW5kIHdlIGhh dmUgYSBmYW1pbGlhcml0eSBmYWN0b3IgaGVyZS4NCgkgDQoJQXMgZmFyIGFzIG9wdGltaXphdGlv biBnb2VzLCB0aGUgbmV4dEhUTUxOb2RlIGRvZXNlbnQgZG8gcGFyc2luZywgaXQgc2ltcGx5IHJl dHVybnMgdGhlIG5vZGUgdGhhdCB3YXMgcGFyc2VkIGludGVybmFsbHkgd2hlbiBoYXNNb3JlTm9k ZXMoKSB3YXMgY2FsbGVkLiBTbywgdGhlIG9ubHkgc3BlZWQgdXAgd291bGQgYmUgaW4gdGhlIHJl ZHVjdGlvbiBvZiBhIGNhbGwgLSBJIGFtIG5vdCBzbyBzdXJlIHRoYXQgdGhpcyB3b3VsZCBiZSB0 aGUgYmVzdCBwbGFjZSBmb3Igc3VjaCBhIHNwZWVkdXAuDQoJIA0KCUJ5dHdheSwgdGFsa2luZyBh Ym91dCBzcGVlZHVwcywgdGhlIGxhc3QgcmVsZWFzZSBhbmQgdGhlIG5leHQgb25lIHNob3VsZCBz ZWUgc29tZSB0d2Vha3MgLSBhbmQgdGhlIHBlcmZvcm1hbmNlIG91Z2h0IHRvIGhhdmUgZ290dGVu IGJldHRlci4gQXJlIHlvdSBzdGlsbCBkb2luZyB0aGUgcGVyZm9ybWFuY2UgdGVzdGluZyA/IEFu eSByZXN1bHRzIHRvIHNoYXJlID8NCgkgDQoJQ2hlZXJzLA0KCVNvbWlrDQoNCg== |
From: Somik R. <so...@ya...> - 2002-08-10 08:17:21
|
Hi Claude, =20 You've again raised a good point. I will look into this for next = week's release. Regards Somik ----- Original Message -----=20 From: Claude Duguay=20 To: htm...@li...=20 Sent: Friday, August 09, 2002 12:58 AM Subject: RE: [Htmlparser-developer] Re: [Htmlparser-user] Another = Ill-Formed Example Based on your description there is a risk that calling hasMoreNodes = without calling nextHTMLNode a few times in a row will not have the = desired API semantics. If the parsing takes place in the call to = hasMoreNodes, then the parser moves forward, regardless of whether the = nextHTMLNode method was called. This suggests that the method should be = called something else, more indicative of this behavior, or the behavior = should be changed. =20 -----Original Message-----=20 From: Somik Raha [mailto:so...@ya...]=20 Sent: Thu 8/8/2002 12:07 AM=20 To: htm...@li...=20 Cc:=20 Subject: Re: [Htmlparser-developer] Re: [Htmlparser-user] Another = Ill-Formed Example Hi Claude, Thanks for the kind words. BTW: I was giving some thought to the calls that take place in = HTMLEnumeration. As far as I could tell, many internal calls were made = twice, by virtue of the hasMoreNodes/nextHTMLNode pattern. An alternate = pattern is repeated calls to nextHTMLNode which should stop when a null = response is returned. This pattern is used by the = BufferedReader.readLine method, by the JDBC ResultSet.next method, etc. = Based on the simple observation that calls to hasMoreNodes AND = nextHTMLNode run some of the same underlying code, it seems that the = speed of the parser could be positively influenced by reducing the = interface to a single call. Any thoughts? I am not so sure this would be a good idea, because then, we'd have to = compromise on the API. Then users would have to be checking for null = values- the iterator interface is also one that is popular and we have = a familiarity factor here. As far as optimization goes, the nextHTMLNode doesent do parsing, it = simply returns the node that was parsed internally when hasMoreNodes() = was called. So, the only speed up would be in the reduction of a call - = I am not so sure that this would be the best place for such a speedup. Bytway, talking about speedups, the last release and the next one = should see some tweaks - and the performance ought to have gotten = better. Are you still doing the performance testing ? Any results to = share ? Cheers, Somik |
From: Somik R. <so...@ya...> - 2002-08-08 07:14:19
|
MessageHi Claude, Thanks for the kind words. BTW: I was giving some thought to the calls that take place in = HTMLEnumeration. As far as I could tell, many internal calls were made = twice, by virtue of the hasMoreNodes/nextHTMLNode pattern. An alternate = pattern is repeated calls to nextHTMLNode which should stop when a null = response is returned. This pattern is used by the = BufferedReader.readLine method, by the JDBC ResultSet.next method, etc. = Based on the simple observation that calls to hasMoreNodes AND = nextHTMLNode run some of the same underlying code, it seems that the = speed of the parser could be positively influenced by reducing the = interface to a single call. Any thoughts? I am not so sure this would be a good idea, because then, we'd have to = compromise on the API. Then users would have to be checking for null = values- the iterator interface is also one that is popular and we have = a familiarity factor here. As far as optimization goes, the nextHTMLNode doesent do parsing, it = simply returns the node that was parsed internally when hasMoreNodes() = was called. So, the only speed up would be in the reduction of a call - = I am not so sure that this would be the best place for such a speedup. Bytway, talking about speedups, the last release and the next one should = see some tweaks - and the performance ought to have gotten better. Are = you still doing the performance testing ? Any results to share ? Cheers, Somik |