Hai,
When you
- Return a negative value in the overridden method handleDocumentInfo
- And the URL you're currently trying to crawl returns a 404
- And the URL you're currently trying to crawl is the first URL you input (for ex. the main page of a website)
The function getProcessReport() will return:
- user_abort = false
- abort_reason = 1 // PHPCrawlerAbortReasons::ABORTREASON_PASSEDTHROUGH
When in fact it should (I think so, but I'm not sure if this is behavior by design) return:
- user_abort = true
- abort_reason = 4 // PHPCrawlerAbortReasons::ABORTREASON_USERABORT
In my project I need it to show that the user aborted the crawling process, even when the process has already ended because it has no more URL's to do, so I've edited line 606 of PHPCrawler.class.php (where the problem is located) to this:
(excuse the formatting)
Just wanted to let you know, could you perhaps tell if this was by design or a bug?
Anonymous
Hi,
thanks a lot for this detailed report!
I think you are right, in that special case the crawler should report a user-abort.
Or to be even more concrete: The crawler should report BOTH, cause that's the case here, there's a user-abort AND - at the same time - there's nothing more to do (passedthough/no more URLs in the queue).
Problem: Right now the crawler isn't capable of reporting multiple abort-reasons.
What do you think?
View and moderate all "bugs Discussion" comments posted by this user
Mark all as spam, and block user from posting to "Bugs"
Hi, excuse me for the late response.
The crawler should report both indeed, but since it is not possible at this moment (and it would probably take some time to implement), I'd rather have it return the ABORTREASON_USERABORT than ABORTREASON_PASSEDTHROUGH.
This is because the negative return in handleDocumentInfo happens before it is done crawling the url/website, so ABORTREASON_USERABORT should have priority.
Last edit: Anonymous 2015-04-08