I have a domain like: "http://www.foo.de" and this website has a link like "http://www.hamburg.foo.de/" but the crawler can not following it. It can only follow links ended in pdf or html like "http://www.foo.de/file.html
My settings are:
$crawler = flx_Crawler::getInstance();
$crawler->setFollowMode(1);
$crawler->setFollowRedirects(TRUE);
$crawler->setFollowRedirectsTillContent(TRUE);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addContentTypeReceiveRule("#application/pdf#");
$crawler->addURLFilterRule("#.(jpg|jpeg|gif|png)$# i");
$crawler->addURLFilterRule("#.(css|js)$# i");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
$crawler->enableCookieHandling(true);
$crawler->setURL($url);
any suggestions?
In advance thanks a lot
Regards
Jorge von Rudno
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
Anonymous
-
2020-11-13
Post awaiting moderation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your setup looks good, it should work as expected as far as i can see.
To give it a test:
Could you try it again with follow-mode 0? Does it follow "http://www.hamburg.foo.de/" with that? If so, the crawler seems to handle "http://www.hamburg.foo.de/" as a different host than foo.de.
I don't know, maybe this is a bug then, is it a different "host"?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have done your suggestion (setFollowMode(0)), and in fact the crawler can follow the link. I think the most likely is tha the link "http://www.hamburg.foo.de/" go to a different host. There are a way to solve this situation?
Best regards.
Jorge von Rudno
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, i think this is a bug, http://www.hamburg.foo.de ist the same host as "www.foo.com",
i think the crawler should defenatly follow links like that with follow-mode 1!
I'll give it a test soon.
For a workaround you could simply set the follow-rules yourself without using the follow-mode.
Try something like this:
$crawler->setFollowMode(0);
$crawler->addURLFollowRule("#foo\.com#");
...
This let's the crawler follow every URL that contains the string foo.com. You could refine the rule so that it won't follow URLs like "www.bla.com\dabadafoo.com" of course if you want.
Last edit: Anonymous 2014-02-24
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have implement your suggestion and this has solved my problem. for your knowledge and your comments I think you are part of the team of phpCraw development, so please tell me if I should report this case as a bug.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Dear Colleagues,
I have a domain like: "http://www.foo.de" and this website has a link like "http://www.hamburg.foo.de/" but the crawler can not following it. It can only follow links ended in pdf or html like "http://www.foo.de/file.html
My settings are:
$crawler = flx_Crawler::getInstance();
$crawler->setFollowMode(1);
$crawler->setFollowRedirects(TRUE);
$crawler->setFollowRedirectsTillContent(TRUE);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addContentTypeReceiveRule("#application/pdf#");
$crawler->addURLFilterRule("#.(jpg|jpeg|gif|png)$# i");
$crawler->addURLFilterRule("#.(css|js)$# i");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
$crawler->enableCookieHandling(true);
$crawler->setURL($url);
any suggestions?
In advance thanks a lot
Regards
Jorge von Rudno
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi!
Your setup looks good, it should work as expected as far as i can see.
To give it a test:
Could you try it again with follow-mode 0? Does it follow "http://www.hamburg.foo.de/" with that? If so, the crawler seems to handle "http://www.hamburg.foo.de/" as a different host than foo.de.
I don't know, maybe this is a bug then, is it a different "host"?
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi, Thanks alot for your answer.
I have done your suggestion (setFollowMode(0)), and in fact the crawler can follow the link. I think the most likely is tha the link "http://www.hamburg.foo.de/" go to a different host. There are a way to solve this situation?
Best regards.
Jorge von Rudno
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Ok, i think this is a bug, http://www.hamburg.foo.de ist the same host as "www.foo.com",
i think the crawler should defenatly follow links like that with follow-mode 1!
I'll give it a test soon.
For a workaround you could simply set the follow-rules yourself without using the follow-mode.
Try something like this:
$crawler->setFollowMode(0);
$crawler->addURLFollowRule("#foo\.com#");
...
This let's the crawler follow every URL that contains the string foo.com. You could refine the rule so that it won't follow URLs like "www.bla.com\dabadafoo.com" of course if you want.
Last edit: Anonymous 2014-02-24
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi,
I have implement your suggestion and this has solved my problem. for your knowledge and your comments I think you are part of the team of phpCraw development, so please tell me if I should report this case as a bug.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Sorry, I forgot to say Thanks alot for your help.
Regards.
Jorge von Rudno
Hi!
No problem.
I just opened a bug-report for this:
https://sourceforge.net/p/phpcrawl/bugs/67/
Hi Jorge,
do you have an acutal example (page) for this problem? (for testing)