Configuration of PHP Crawl is pretty much standard, but problem is I think with its code or HTTP server - during data download it does not send EOF or timeout. After downloading full page source it just send empty "line", and crawler can't figure out that it is an end of document (also, I don't get any timeouts).
Looking on PHPCrawlerHTTPRequest.class.php:
$status = socket_get_status($this->socket);
We have two checks:
// Check for EOF
if ($status == true) $stop_receving = true;
// Socket timed out
if ($status == true)
Both of them fails.
Those checks too:
// Check if content-length stated in the header is reached
if ($this->lastResponseHeader->content_length == $bytes_received)
{
$stop_receving = true;
}
// Check if contentsize-limit is reached
if ($this->content_size_limit > 0 && $this->content_size_limit <= $bytes_received)
{
$stop_receving = true;
}
Content size limit is useless, because new lines (after end of document) are empty. Header does not contain content length, so there is no way to stop crawler (aside of page execution time).
Is there any way to catch it and terminate downloading data without infinite loop?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That's really strange, never encountered such a problem.
I will take a look at it the next days.
Could you please post your setup? (PHP-version and phpcrawl-version).
Right now i dont't have any idea how to determinate that theres no more content to be send by the
server, what else could indicate it other than socket EOF and/or content-length detection?
Will open a bug in the tracker for this.
Thanks for the report!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I had exactly the same problem. Seems to be only in HTTPS sites. I was testing in https://www.casasvitoria.com.br/. I tested on Windows 7 and Windows XP.
eof (bool) - TRUE if the stream has reached end-of-file. Note that for socket streams this member can be TRUE even when unread_bytes is non-zero. To determine if there is more data to be read, use feof() instead of reading this item.
So in function readResponseContentChunk on file PHPCrawlerHTTPRequest.class.php, replace line:
if ($status["eof"] == true)
by:
if (feof($this->socket) == true || $status["eof"] == true)
Maybe could be only:
if (feof($this->socket) == true)
but I'm not sure.
Now it's working.
Rodrigo Sibin Lichti
rodsibin@gmail.com
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I'm using PHPCrawl and this tool is amazing - veeery useful for me. I have a problem though - on one page crawler got into infinite loop.
I'm using Windows 7 and latest XAMPP (just for local testing). I'm trying to process https://www.olay.pl/Pages/SecureJoinClubOlay_PL.aspx.
Configuration of PHP Crawl is pretty much standard, but problem is I think with its code or HTTP server - during data download it does not send EOF or timeout. After downloading full page source it just send empty "line", and crawler can't figure out that it is an end of document (also, I don't get any timeouts).
Looking on PHPCrawlerHTTPRequest.class.php:
$status = socket_get_status($this->socket);
We have two checks:
// Check for EOF
if ($status == true) $stop_receving = true;
// Socket timed out
if ($status == true)
Both of them fails.
Those checks too:
// Check if content-length stated in the header is reached
if ($this->lastResponseHeader->content_length == $bytes_received)
{
$stop_receving = true;
}
// Check if contentsize-limit is reached
if ($this->content_size_limit > 0 && $this->content_size_limit <= $bytes_received)
{
$stop_receving = true;
}
Content size limit is useless, because new lines (after end of document) are empty. Header does not contain content length, so there is no way to stop crawler (aside of page execution time).
Is there any way to catch it and terminate downloading data without infinite loop?
Hi!
That's really strange, never encountered such a problem.
I will take a look at it the next days.
Could you please post your setup? (PHP-version and phpcrawl-version).
Right now i dont't have any idea how to determinate that theres no more content to be send by the
server, what else could indicate it other than socket EOF and/or content-length detection?
Will open a bug in the tracker for this.
Thanks for the report!
Bugreport:
http://sourceforge.net/tracker/?func=detail&aid=3601482&group_id=89439&atid=590146
Hi again,
i just came to make a short test with this problem.
I just tested it on linux (ubuntu) with the newest version of phpcrawl 0.81 and i don't get any
problems with that page.
So this must have to do something with Win7 and/or XAMPP.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
I had exactly the same problem. Seems to be only in HTTPS sites. I was testing in https://www.casasvitoria.com.br/. I tested on Windows 7 and Windows XP.
After some research, I found the solution. There is a comment about PHP function stream_get_meta_data (or socket_get_status) EOF in http://php.net/manual/pt_BR/function.stream-get-meta-data.php:
eof (bool) - TRUE if the stream has reached end-of-file. Note that for socket streams this member can be TRUE even when unread_bytes is non-zero. To determine if there is more data to be read, use feof() instead of reading this item.
So in function readResponseContentChunk on file PHPCrawlerHTTPRequest.class.php, replace line:
if ($status["eof"] == true)
by:
if (feof($this->socket) == true || $status["eof"] == true)
Maybe could be only:
if (feof($this->socket) == true)
but I'm not sure.
Now it's working.
Rodrigo Sibin Lichti
rodsibin@gmail.com