PHPCrawl / Forum / Help: Infinite loop for site without EOF/timeout

Nobody/Anonymous - 2013-01-18

Hi,

I'm using PHPCrawl and this tool is amazing - veeery useful for me. I have a problem though - on one page crawler got into infinite loop.

I'm using Windows 7 and latest XAMPP (just for local testing). I'm trying to process https://www.olay.pl/Pages/SecureJoinClubOlay_PL.aspx.

Configuration of PHP Crawl is pretty much standard, but problem is I think with its code or HTTP server - during data download it does not send EOF or timeout. After downloading full page source it just send empty "line", and crawler can't figure out that it is an end of document (also, I don't get any timeouts).

Looking on PHPCrawlerHTTPRequest.class.php:
      $status = socket_get_status($this->socket);

We have two checks:
      // Check for EOF
      if ($status == true) $stop_receving = true;

      // Socket timed out
      if ($status == true)

Both of them fails.

Those checks too:
      // Check if content-length stated in the header is reached
      if ($this->lastResponseHeader->content_length == $bytes_received)
      {
        $stop_receving = true;
      }

      // Check if contentsize-limit is reached
      if ($this->content_size_limit > 0 && $this->content_size_limit <= $bytes_received)
      {
        $stop_receving = true;
      }

Content size limit is useless, because new lines (after end of document) are empty. Header does not contain content length, so there is no way to stop crawler (aside of page execution time).

Is there any way to catch it and terminate downloading data without infinite loop?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2013-01-19

Hi!

That's really strange, never encountered such a problem.
I will take a look at it the next days.

Could you please post your setup? (PHP-version and phpcrawl-version).

Right now i dont't have any idea how to determinate that theres no more content to be send by the
server, what else could indicate it other than socket EOF and/or content-length detection?

Will open a bug in the tracker for this.

Thanks for the report!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2013-01-19

Bugreport:
http://sourceforge.net/tracker/?func=detail&aid=3601482&group_id=89439&atid=590146

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2013-04-08

Hi again,

i just came to make a short test with this problem.

I just tested it on linux (ubuntu) with the newest version of phpcrawl 0.81 and i don't get any
problems with that page.

So this must have to do something with Win7 and/or XAMPP.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-07-26

I had exactly the same problem. Seems to be only in HTTPS sites. I was testing in https://www.casasvitoria.com.br/. I tested on Windows 7 and Windows XP.

After some research, I found the solution. There is a comment about PHP function stream_get_meta_data (or socket_get_status) EOF in http://php.net/manual/pt_BR/function.stream-get-meta-data.php:

eof (bool) - TRUE if the stream has reached end-of-file. Note that for socket streams this member can be TRUE even when unread_bytes is non-zero. To determine if there is more data to be read, use feof() instead of reading this item.

So in function readResponseContentChunk on file PHPCrawlerHTTPRequest.class.php, replace line:
if ($status["eof"] == true)
by:
if (feof($this->socket) == true || $status["eof"] == true)

Maybe could be only:
if (feof($this->socket) == true)
but I'm not sure.

Now it's working.

Rodrigo Sibin Lichti
rodsibin@gmail.com

I had exactly the same problem. Seems to be only in HTTPS sites. I was testing in https://www.casasvitoria.com.br/. I tested on Windows 7 and Windows XP. After some research, I found the solution. There is a comment about PHP function stream_get_meta_data (or socket_get_status) EOF in http://php.net/manual/pt_BR/function.stream-get-meta-data.php: eof (bool) - TRUE if the stream has reached end-of-file. Note that for socket streams this member can be TRUE even when unread_bytes is non-zero. To determine if there is more data to be read, <b>use feof()</b> instead of reading this item. So in function readResponseContentChunk on file PHPCrawlerHTTPRequest.class.php, replace line: if ($status["eof"] == true) by: if (feof($this->socket) == true || $status["eof"] == true) Maybe could be only: if (feof($this->socket) == true) but I'm not sure. Now it's working. Rodrigo Sibin Lichti rodsibin@gmail.com

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Infinite loop for site without EOF/timeout

Forums

Help

Infinite loop for site without EOF/timeout document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Infinite loop for site without EOF/timeout