I have been trying to crawl one of my sites and have come across a problem when the server returns a 302 redirect response to redirect from an http page to an https page.
I have persistent connections switched on, and pavuk is re-using the existing connection on port 80 rather than making a new one on port 443.
This is caused by the redirect-handling code in doc_download_helper(). This code correctly notices that the port number is different from the previous request and so calls abs_close_socket(). However, that calls should_leave_persistent() which returns the wrong answer because docp->is_persistent is still set (it gets cleared immediately after abs_close_socket() returns.
I found that moving the clearing of this field above where abs_close_socket() gets called fixes the problem.
Here is the diff that fixed the problem for me:
--- pavuk-0.9.35.sheltond/src/doc.c 2010-03-22 17:26:44.000000000 +0000
+++ pavuk-0.9.35/src/doc.c 2004-11-03 06:51:11.000000000 +0000
@@ -882,19 +882,19 @@
{
http_handle_redirect(docu, resp->ret_code);
http_response_free(resp);
if(docu->is_persistent)
{
- docu->is_persistent = FALSE;
if(docu->doc_url->moved_to &&
((url_get_port(docu->doc_url) !=
url_get_port(docu->doc_url->moved_to))
|| strcmp(url_get_site(docu->doc_url),
url_get_site(docu->doc_url->moved_to))))
{
abs_close_socket(docu, TRUE);
}
+ docu->is_persistent = FALSE;
}
return -1;
}
http_response_free(resp);
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Same patch, hopefully formatted in a readable way.