|
From: Alain F. <ala...@ri...> - 2002-05-03 11:38:31
|
Hi,
I'm testing the latest htdig-3.2.0b4 snapshot (20020428).
I think that there is a problem with the htdig behavior and the
server_wait_time.
In Retreiver.cc:
---- cut here ----
// No HTTP connections available, so we change server and pause
if (max_connection_requests == 1)
server->delay(); // This will pause if needed
// and reset the time
---- cut here ----
You can see that htdig waits for server_wait_time only if
max_connection_requests equal to 1.
if max_connection_requests have a different value, htdig will never wait
server_wait_time timeout before to open a new HTTP connection to the
same web server.
Since htdig HTTP client supports the Persistent Connection mechanism, I
think the delay() function have to be called jut before this loop :
---- Retreiver.cc ----
// I think delay function have to be here
server->delay();
while ( ( (max_connection_requests ==-1) ||
(count < max_connection_requests) ) &&
(ref = server->pop()) && noSignal)
{
// stuff
...
}
---- Retreiver.cc ----
May be, I'm wrong and server_wait_time directive doesn't work with
persistent connection mechanism ...
Alain FORCIOLI http://www.VigiNews.com/
http://www.risc.fr/ ala...@ri...
http://www.april.org/ afo...@ap...
GPG Public Key: http://www.risc.fr/~af/gpg.txt
|
|
From: Gabriele B. <an...@ti...> - 2002-05-03 22:28:51
|
>Gabriele can correct me on this point since this is his code. However, I
>think the problem is that you don't really want to have a wait time with a
>persistent connection. If we wait on a persistent connection, the server
>will kill it off before we can reconnect.
>
>Now perhaps we should argue that server_wait_time in this case should
>apply to a wait *after* we make max_connection_requests in a row.
Yeah,
I think your proposal is fine. If we enable persistent connections,
server_wait_time partially loses its meaning; I'd rather vote for ignoring
it when pcs are on. Or maybe as Geoff says, after the max requests in a row
are reached.
What d'u think about this?
-Gabriele
--
Gabriele Bartolini - Web Programmer
Current Location: Prato, Tuscany, Italy
an...@ti... | http://www.prato.linux.it/~gbartolini | ICQ#129221447
> find bin/laden -name osama -exec rm {} \;
-
Important:
--------------
I've experienced problems when receiving e-mail sent to the
address: an...@us.... I think I lost much of it.
So if you sent me a message, and I never replied to you,
that's probably the reason. Please update your address book to
this one: an...@ti.... Sorry and thank you!
|
|
From: Alain F. <ala...@ri...> - 2002-05-06 08:19:12
|
On Fri, 2002-05-03 at 23:29, Gabriele Bartolini wrote: > >Now perhaps we should argue that server_wait_time in this case should > >apply to a wait *after* we make max_connection_requests in a row. >=20 > Yeah, >=20 > I think your proposal is fine. If we enable persistent connections,=20 > server_wait_time partially loses its meaning; I'd rather vote for ignorin= g=20 > it when pcs are on. Or maybe as Geoff says, after the max requests in a r= ow=20 > are reached. >=20 > What d'u think about this? According to me, traversing a HTTP server using pcs, without waiting server_wait_time after reaching the max_connection_requests value is the same that traversing a HTTP server without server_wait_time and pcs. The results is a high load on the HTTP server(s), specially if indexation occurs in a local network. With or without pcs, I think server_wait_time must be applied after each connection (or max_connection_requests) if needed. --=20 Alain FORCIOLI http://www.VigiNews.com/ http://www.risc.fr/ ala...@ri... http://www.april.org/ afo...@ap... =09 GPG Public Key: http://www.risc.fr/~af/gpg.txt |
|
From: Geoff H. <ghu...@ws...> - 2002-05-09 05:17:15
|
> I'd rather vote for ignoring it when pcs are on. Or maybe as Geoff > says, after the max requests in a row are reached. No one else seems to have much opinion on this matter. But as I said before, I think if server_wait_time is set, you *should* wait at some point. To me, it makes logical sense that if the user has set max_connection_requests, then we should follow that and wait afterwards. If we *never* wait when persistent connections are enabled, we won't be good network citizens and won't enable ourselves to "throttle back" on a webserver. In a multiple-server context, there will be at least *some* wait while htdig switches to the next server after max_connection_requests. I feel that enabling server_wait_time at this point is consistent and will prevent server melt-down. -Geoff |
|
From: Gilles D. <gr...@sc...> - 2002-05-09 15:05:39
|
According to Geoff Hutchison: > > I'd rather vote for ignoring it when pcs are on. Or maybe as Geoff > > says, after the max requests in a row are reached. > > No one else seems to have much opinion on this matter. But as I said > before, I think if server_wait_time is set, you *should* wait at some > point. To me, it makes logical sense that if the user has set > max_connection_requests, then we should follow that and wait afterwards. > If we *never* wait when persistent connections are enabled, we won't be > good network citizens and won't enable ourselves to "throttle back" on a > webserver. > > In a multiple-server context, there will be at least *some* wait while > htdig switches to the next server after max_connection_requests. I feel > that enabling server_wait_time at this point is consistent and will > prevent server melt-down. Actually, I do have an opinion on the matter, and a strong one at that, but haven't had a chance to voice it. Basically, Geoff, I agree with you. Quite emphatically so, to the point of wondering why the issue is even subject to debate. I can see NO OTHER logical way of doing this, and certainly haven't seen any good argument for ignoring server_wait_time between connections. It just doesn't make sense, if the user specifies a delay between connections, not to use that delay just because you're transferring more than one document through that connection. The whole point of persistant connections, and the delay, is to allow the user some control over how much of a load htdig puts on the server, and to get the data from the server to htdig as quickly and efficiently as is practical without bogging things down. Respecting BOTH server_wait_time and max_connection_requests is the only reasonable way I can see to give the user that control. If htdig is currently ignoring server_wait_time, for any reason, then I'd say it's a bug, not a feature, and should be fixed. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gabriele B. <g.b...@co...> - 2002-05-15 07:47:16
|
Ciao guys,
<done> :-P
Now after we reach the maximum number of requests per connection we
wait for the exact number of seconds specified by server_wait_time.
Give it a look, please.
Ciao
-Gabriele
P.S.: I am going back to Australia (Melbourne, Perth, Sydney) for a few
weeks (if there's somebody over there who may find my presence useful - I
wonder why - please let me know!) :-)
--
Gabriele Bartolini - Computer Programmer
U.O. Rete Civica - Comune di Prato - Prato - Italia - Europa
g.b...@co... | http://www.po-net.prato.it/
The nice thing about Windows is - It does not just crash,
it displays a dialog box and lets you press 'OK' first.
|
|
From: Gilles D. <gr...@sc...> - 2002-05-15 15:05:57
|
According to Gabriele Bartolini: > Ciao guys, > > <done> :-P > > Now after we reach the maximum number of requests per connection we > wait for the exact number of seconds specified by server_wait_time. > > Give it a look, please. OK, the logic in Retriever.cc looks good to me, but I haven't actually tested it. I have a question, though. How does the information about the value of max_connection_requests get communicated from the Retriever class to the HtHTTP class, or perhaps more specifically, when/where does the connection get explicitly closed when Retriever reaches the number of requests allowed for one connection. I don't see it in the code. I imagine that when there's more than one server being indexed, there's an implicit request to close the connection because you're going from one server to another. What about if there's only one server, though? The way I see it, if there's only one server and no specified delay, the Retriever class will just make one request after another without ever closing the connection. Am I missing something, or have I spotted a problem? On another note, I did a cvs diff to see exactly what you changed to Retriever.cc. Apart from the one-line code change, and the addition of about 6 or 7 lines of comments, I noticed changes to the indentation of a few dozen lines of code. When I look at the file in the editor, I can't seem to figure out what tab spacing I should be using to get the code properly indented. Some parts make sense with a tab spacing of 8 (the default), while some, but not all of the changed lines seem better at a tab spacing of 4. No one tab spacing value seems to work for the whole file. So, what tab spacing do you use in your editor? To put the question to the whole list of developers, what do all of you feel we should standardize on? My own preference is 8, because that seems to be the most widely used standard, and what most of the code seems to be tabbed at right now*. However, I'll go with the flow if most of you prefer something else. I feel we really must pick a standard and stick with it, though, because this seemingly random indetation that the code is slowly but surely taking on is making it extremely hard to read and walk through. I think manual code walkthroughs are a powerful way of finding bugs, and anything that hinders this process will be a real detriment to the maintainability of the code. So, what's it going to be? * I should point out the difference between tab spacing and indent spacing. Tab spacing is the amount of columns between each tab stop, so that a single tab character (0x09) at one tab stop advances the cursor by that specified amount of spacing. Indent spacing, on the other hand, is the amount by which each level of nesting is indented. Most of the htdig code seems to be set up with tab spacing of 8 and indent spacing of 4, so that alternating levels of identation require either a tab or 4 spaces. There are some variations on this, though. E.g. Geoff's code seems to use indent spacing of 2, but with an extra level of indentation for braces. Consistency in indent spacing would be desirable, but not absolutely necessary, while consistent tab spacing is much more important, otherwise the indentation can get (and has gotten) terribly messed up and difficult to follow. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |