[larm-dev] Jakarta HTTPClient
Brought to you by:
cmarschner,
otis
From: Clemens M. <Cle...@in...> - 2003-06-15 23:19:47
|
I looked at the source code of Jakarta's HTTPClient for use in the crawler. Seems ok except that it creates a lot of objects on the way until a page is loaded. The main thing I don't like is that it opens a java.net.Socket using the host name. In the socket class this host name is resolved into an IP adress using InetAddress.getHostByName. This method uses a completely awkward caching mechanism that seems to become a bottleneck to me if 100s or 1000s of hosts are in the cache. getByName calls getAllByName0() which performs a getCachedAddress(host) lookup. This method first performs a linear (!) scan through the whole cache, builds a Vector of entries that are expired, and then does a second linear scan through that vector to remove these expired entries. And all this is done in a synchronized section. Since this is done as a side effect at each cache lookup it will be done for each connection opened by the crawler. In short, this won't work. We have to lookup the IP address for ourselves, using a mechanism that can cope with hundreds of host names without blocking other threads. Then the host name and the IP address have to be provided to the HTTP class. The socket has to be opened via the IP address and the HTTP header has to contain the host name. This will end up in a rewrite of the HTTPClient.... If we want to use NIO for this, it will again be a different situation. I suppose we have to write the HTTPClient from scratch some day. Clemens |