Menu

#88 getProcessReport() - user_abort and abort_reason not returning correct values

open
5
2015-04-08
2015-03-30
Anonymous
No

Hai,

When you
- Return a negative value in the overridden method handleDocumentInfo
- And the URL you're currently trying to crawl returns a 404
- And the URL you're currently trying to crawl is the first URL you input (for ex. the main page of a website)

The function getProcessReport() will return:
- user_abort = false
- abort_reason = 1 // PHPCrawlerAbortReasons::ABORTREASON_PASSEDTHROUGH

When in fact it should (I think so, but I'm not sure if this is behavior by design) return:
- user_abort = true
- abort_reason = 4 // PHPCrawlerAbortReasons::ABORTREASON_USERABORT

In my project I need it to show that the user aborted the crawling process, even when the process has already ended because it has no more URL's to do, so I've edited line 606 of PHPCrawler.class.php (where the problem is located) to this:

<?php

if(!$this->CrawlerStatusHandler->getCrawlerStatus()->abort_reason === PHPCrawlerAbortReasons::ABORTREASON_USERABORT) {

$this->CrawlerStatusHandler->updateCrawlerStatus(null, PHPCrawlerAbortReasons::ABORTREASON_PASSEDTHROUGH);

}

(excuse the formatting)

Just wanted to let you know, could you perhaps tell if this was by design or a bug?

Discussion

  • Uwe Hunfeld

    Uwe Hunfeld - 2015-03-31

    Hi,

    thanks a lot for this detailed report!

    I think you are right, in that special case the crawler should report a user-abort.

    Or to be even more concrete: The crawler should report BOTH, cause that's the case here, there's a user-abort AND - at the same time - there's nothing more to do (passedthough/no more URLs in the queue).

    Problem: Right now the crawler isn't capable of reporting multiple abort-reasons.

    What do you think?

     
  • Anonymous

    Anonymous - 2015-04-08

    Hi, excuse me for the late response.

    The crawler should report both indeed, but since it is not possible at this moment (and it would probably take some time to implement), I'd rather have it return the ABORTREASON_USERABORT than ABORTREASON_PASSEDTHROUGH.

    This is because the negative return in handleDocumentInfo happens before it is done crawling the url/website, so ABORTREASON_USERABORT should have priority.

     

    Last edit: Anonymous 2015-04-08

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.