if ($t=="" && $e=="")
{
$error_code = 0;
$error_string = "Couldn't connect to server";
}
}
else
{
$header_found = false; // will get true if the header of the page was extracted
if ($t=="" && $e=="")
{
$error_code = 0;
$error_string = "Couldn't connect to server";
}
}
else
{
$header_found = false; // will get true if the header of the page was extracted
1. You're connecting to the proxy instead of directly to the site.
2. You're changing the GET request to the full $url_to_crawl
3. You're passing in the encoded username/password for proxy authorization.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
how to set proxy in phpcrawl??
i want to connect through a proxy server that requires authentication
Hi,
sorry, but proxy-support is currently not implemented in phpcrawl.
it works for me - perhaps also for you
change the line #150 in phpcrawlerpagerequest.class.php in:
$s = @fsockopen ("PROXYURL", "PROXYPORT", $e, $t, $this->socket_mean_timeout);
with your values in place of the capitalized letters
Adding proxy support with authentication requires the following modifications to phpcrawlerpagerequest.class.php (version 0.71):
OLD CODE (starting at line 148):
===
// Open socket-connection
if ($url_disallowed == false)
{
$s = @fsockopen ($host_str, $port, $e, $t, $this->socket_mean_timeout);
}
else
{
return false; // Return false if the URL was completely ignored
}
if ($s==false) // Connection-error
{
$error_string = $t;
$error_code = $e;
if ($t=="" && $e=="")
{
$error_code = 0;
$error_string = "Couldn't connect to server";
}
}
else
{
$header_found = false; // will get true if the header of the page was extracted
// Build header to send
$headerlines_to_send = "GET ".$path.$file.$query." HTTP/1.0\r\n";
$headerlines_to_send = "HOST: ".$host."\r\n";
// Referer
if ($referer_url!="")
{
$headerlines_to_send = "Referer: $referer_url\r\n";
}
// Cookies
if ($this->handle_cookies == true)
{
$cookie_string = PHPCrawlerUtils::buildHeaderCookieString ($this->cookies, $host);
}
if (isset($cookie_string))
{
$headerlines_to_send = "Cookie: ".$cookie_string."\r\n";
}
// Authentication
if (count($authentication) > 0)
{
$auth_string = base64_encode($authentication.":".$authentication);
$headerlines_to_send = "Authorization: Basic ".$auth_string."\r\n";
}
// Rest of header
$headerlines_to_send = "User-Agent: ".str_replace("\n", "", $this->user_agent_string)."\r\n";
$headerlines_to_send = "Connection: close\r\n";
$headerlines_to_send = "\r\n";
===
NEW CODE:
// Open socket-connection
if ($url_disallowed == false)
{
$s = @fsockopen ($PROXY_URL_OR_IP, $PROXY_PORT, $e, $t, $this->socket_mean_timeout);
}
else
{
return false; // Return false if the URL was completely ignored
}
if ($s==false) // Connection-error
{
$error_string = $t;
$error_code = $e;
if ($t=="" && $e=="")
{
$error_code = 0;
$error_string = "Couldn't connect to server";
}
}
else
{
$header_found = false; // will get true if the header of the page was extracted
// Build header to send
#$headerlines_to_send = "GET ".$path.$file.$query." HTTP/1.0\r\n";
$headerlines_to_send = "GET $url_to_crawl HTTP/1.0\r\n";
$headerlines_to_send = "HOST: ".$host."\r\n";
// Referer
if ($referer_url!="")
{
$headerlines_to_send = "Referer: $referer_url\r\n";
}
// Cookies
if ($this->handle_cookies == true)
{
$cookie_string = PHPCrawlerUtils::buildHeaderCookieString ($this->cookies, $host);
}
if (isset($cookie_string))
{
$headerlines_to_send = "Cookie: ".$cookie_string."\r\n";
}
// Authentication
if (count($authentication) > 0)
{
$auth_string = base64_encode($authentication.":".$authentication);
$headerlines_to_send = "Authorization: Basic ".$auth_string."\r\n";
}
// Rest of header
$headerlines_to_send = "User-Agent: ".str_replace("\n", "", $this->user_agent_string)."\r\n";
$headerlines_to_send = "Proxy-Authorization: Basic ".base64_encode("username:password") ."\r\n";
$headerlines_to_send = "Connection: close\r\n";
$headerlines_to_send = "\r\n";
===
So you're changing three things:
1. You're connecting to the proxy instead of directly to the site.
2. You're changing the GET request to the full $url_to_crawl
3. You're passing in the encoded username/password for proxy authorization.