But you can do it easily by yourself in your extended cralwer class, something like this:
classMyCrawlerextendsPHPCrawler{protected$redirect_count=0;functionhandleDocumentInfo($DocInfo){// ...if($DocInfo->http_status_code=="301"||$DocInfo->http_status_code=="302"){$this->redirect_count++;}if($DocInfo->http_status_code=="200"){echo"Redirects to this URL: ".$this->redirect_count;}// ...}}
Last edit: Anonymous 2013-06-07
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I thought the question is how many redirects it took to get to the FIRST real page.
If you are going in single process mode, you could simply reset the counter aftert a status-code of 200 occured (am i right?).
If you are using the multi proces mode this is getting difficult since the pages don't get crawled in a "straight" order (ie. process 1 get's the first redirect, process 4 get's the second redirect and process 2 get's the final page after other processes requested completely other pages meanwhile).
Right now there isn't a "easy" solution for the last situation that comes into my mind, sorry.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I couldn't see the relevant info in the headerInfo or pageInfo objects.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi Ed Eliot,
no, there isn't such a property.
But you can do it easily by yourself in your extended cralwer class, something like this:
Last edit: Anonymous 2013-06-07
Fabulous, thanks.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Isn't the number of redirects just going to increment per page request giving you an invalid result i.e
Start
Get page A, 2 redirects
$redirect_count eq 2
Follow link to page B, 1 redirect
$redirect_count eq 3 (bad)
On the second page the count should be 1
Last edit: Pan European 2013-06-28
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Yes, your are right.
I thought the question is how many redirects it took to get to the FIRST real page.
If you are going in single process mode, you could simply reset the counter aftert a status-code of 200 occured (am i right?).
If you are using the multi proces mode this is getting difficult since the pages don't get crawled in a "straight" order (ie. process 1 get's the first redirect, process 4 get's the second redirect and process 2 get's the final page after other processes requested completely other pages meanwhile).
Right now there isn't a "easy" solution for the last situation that comes into my mind, sorry.