Infinite loop for site without EOF/timeout

Help
2013-01-18
2014-07-26
  • Nobody/Anonymous

    Hi,

    I'm using PHPCrawl and this tool is amazing - veeery useful for me. I have a problem though - on one page crawler got into infinite loop.

    I'm using Windows 7 and latest XAMPP (just for local testing). I'm trying to process https://www.olay.pl/Pages/SecureJoinClubOlay_PL.aspx.

    Configuration of PHP Crawl is pretty much standard, but problem is I think with its code or HTTP server - during data download it does not send EOF or timeout. After downloading full page source it just send empty "line", and crawler can't figure out that it is an end of document (also, I don't get any timeouts).

    Looking on PHPCrawlerHTTPRequest.class.php:
          $status = socket_get_status($this->socket);

    We have two checks:
          // Check for EOF
          if ($status == true) $stop_receving = true;
         
          // Socket timed out
          if ($status == true)

    Both of them fails.

    Those checks too:
          // Check if content-length stated in the header is reached
          if ($this->lastResponseHeader->content_length == $bytes_received)
          {
            $stop_receving = true;
          }
         
          // Check if contentsize-limit is reached
          if ($this->content_size_limit > 0 && $this->content_size_limit <= $bytes_received)
          {
            $stop_receving = true;
          }

    Content size limit is useless, because new lines (after end of document) are empty. Header does not contain content length, so there is no way to stop crawler (aside of page execution time).

    Is there any way to catch it and terminate downloading data without infinite loop?

     
  • Nobody/Anonymous

    Hi!

    That's really strange, never encountered such a problem.
    I will take a look at it the next days.

    Could you please post your setup? (PHP-version and phpcrawl-version).

    Right now i dont't have any idea how to determinate that theres no more content to be send by the
    server, what else could indicate it other than socket EOF and/or content-length detection?

    Will open a bug in the tracker for this.

    Thanks for the report!

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2013-04-08

    Hi again,

    i just came to make a short test with this problem.

    I just tested it on linux (ubuntu) with the newest version of phpcrawl 0.81 and i don't get any
    problems with that page.

    So this must have to do something with Win7 and/or XAMPP.

     
  • Comment has been marked as spam. 
    Undo

    You can see all pending comments posted by this user  here

    Anonymous - 2014-07-26

    I had exactly the same problem. Seems to be only in HTTPS sites. I was testing in https://www.casasvitoria.com.br/. I tested on Windows 7 and Windows XP.

    After some research, I found the solution. There is a comment about PHP function stream_get_meta_data (or socket_get_status) EOF in http://php.net/manual/pt_BR/function.stream-get-meta-data.php:

    eof (bool) - TRUE if the stream has reached end-of-file. Note that for socket streams this member can be TRUE even when unread_bytes is non-zero. To determine if there is more data to be read, use feof() instead of reading this item.

    So in function readResponseContentChunk on file PHPCrawlerHTTPRequest.class.php, replace line:
    if ($status["eof"] == true)
    by:
    if (feof($this->socket) == true || $status["eof"] == true)

    Maybe could be only:
    if (feof($this->socket) == true)
    but I'm not sure.

    Now it's working.

    Rodrigo Sibin Lichti
    rodsibin@gmail.com

     


Anonymous

Cancel  Add attachments