Re: [larm-dev] Jakarta HTTPClient
Brought to you by:
cmarschner,
otis
From: otisg <ot...@ur...> - 2003-06-15 23:47:42
|
Clemens, Thanks for looking through sources and summarizing things for us. Do you think it would be easier and maybe faster if we reported your observations to httpclient-dev and suggested the alternative approach that you described below? Also, in which sources are the caches that you mentioned? JDK 1.4.* or HTTPClient? If in HTTPClient, is that in the CVS version or some released version? If it's in a released version, then maybe we should check the CVS version. I've been on httpclient-dev for a long time, and although I don't actively monitor the list, I seem to recall seeing mentions, or maybe bugs in Bugzilla, related to HTTP requests that use IP addresses instead of host names. If this cache stuff is in JDK 1.4.*, then maybe we should see what 1.5 brings when it comes out. I heard that it should be out this Fall. That may be worth waiting. Also, with all the LARM things, I think we should try not to get stuck with 'details' (this is not a detail in the long run, but I think you want to try to put more pieces together before thinking about how to improve individual components, tune them, etc.). (please don't take this comment as a bad criticism, I'm trying to be constructive here :)) Otis ---- On Sun, 15 Jun 2003, Clemens Marschner (Cle...@in...) wrote: > I looked at the source code of Jakarta's HTTPClient for use in the crawler. > Seems ok except that it creates a lot of objects on the way until a page is > loaded. > > The main thing I don't like is that it opens a java.net.Socket using the > host name. > > In the socket class this host name is resolved into an IP adress using > InetAddress.getHostByName. This method uses a completely awkward caching > mechanism that seems to become a bottleneck to me if 100s or 1000s of hosts > are in the cache. > > getByName calls getAllByName0() which performs a getCachedAddress(host) > lookup. This method first performs a linear (!) scan through the whole > cache, builds a Vector of entries that are expired, and then does a second > linear scan through that vector to remove these expired entries. And all > this is done in a synchronized section. Since this is done as a side effect > at each cache lookup it will be done for each connection opened by the > crawler. > > In short, this won't work. We have to lookup the IP address for ourselves, > using a mechanism that can cope with hundreds of host names without blocking > other threads. Then the host name and the IP address have to be provided to > the HTTP class. The socket has to be opened via the IP address and the HTTP > header has to contain the host name. > > This will end up in a rewrite of the HTTPClient.... > > If we want to use NIO for this, it will again be a different situation. I > suppose we have to write the HTTPClient from scratch some day. > > Clemens > > > > ------------------------------------------------------- > This SF.NET email is sponsored by: eBay > Great deals on office technology -- on eBay now! Click here: > http://adfarm.mediaplex.com/ad/ck/711-11697-6916-5 > _______________________________________________ > larm-developer mailing list > lar...@li... > https://lists.sourceforge.net/lists/listinfo/larm-developer > LARM is groovy > > ________________________________________________ Get your own "800" number Voicemail, fax, email, and a lot more http://www.ureach.com/reg/tag |