PHPCrawl / Forum / Help: Does Crawler follow all the types of links?

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-02-21

Dear Colleagues,

I have a domain like: "http://www.foo.de" and this website has a link like "http://www.hamburg.foo.de/" but the crawler can not following it. It can only follow links ended in pdf or html like "http://www.foo.de/file.html
My settings are:
$crawler = flx_Crawler::getInstance();
$crawler->setFollowMode(1);
$crawler->setFollowRedirects(TRUE);
$crawler->setFollowRedirectsTillContent(TRUE);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addContentTypeReceiveRule("#application/pdf#");
$crawler->addURLFilterRule("#.(jpg|jpeg|gif|png)$# i");
$crawler->addURLFilterRule("#.(css|js)$# i");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
$crawler->enableCookieHandling(true);
$crawler->setURL($url);

any suggestions?

In advance thanks a lot

Regards

Jorge von Rudno

Dear Colleagues, I have a domain like: "http://www.foo.de" and this website has a link like "http://www.hamburg.foo.de/" but the crawler can not following it. It can only follow links ended in pdf or html like "http://www.foo.de/file.html My settings are: $crawler = flx_Crawler::getInstance(); $crawler->setFollowMode(1); $crawler->setFollowRedirects(TRUE); $crawler->setFollowRedirectsTillContent(TRUE); $crawler->addContentTypeReceiveRule("#text/html#"); $crawler->addContentTypeReceiveRule("#application/pdf#"); $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); $crawler->addURLFilterRule("#\.(css|js)$# i"); $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE); $crawler->enableCookieHandling(true); $crawler->setURL($url); any suggestions? In advance thanks a lot Regards Jorge von Rudno

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.
- Anonymous - 2020-11-13
  
  Post awaiting moderation.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous
    
    Add attachments
    Cancel
    You seem to have CSS turned off. Please don't fill out this field.
    
    You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-02-22

Hi!

Your setup looks good, it should work as expected as far as i can see.

To give it a test:
Could you try it again with follow-mode 0? Does it follow "http://www.hamburg.foo.de/" with that? If so, the crawler seems to handle "http://www.hamburg.foo.de/" as a different host than foo.de.

I don't know, maybe this is a bug then, is it a different "host"?

Hi! Your setup looks good, it should work as expected as far as i can see. To give it a test: Could you try it again with follow-mode 0? Does it follow "http://www.hamburg.foo.de/" with that? If so, the crawler seems to handle "http://www.hamburg.foo.de/" as a different host than foo.de. I don't know, maybe this is a bug then, is it a different "host"?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-02-24

Hi, Thanks alot for your answer.

I have done your suggestion (setFollowMode(0)), and in fact the crawler can follow the link. I think the most likely is tha the link "http://www.hamburg.foo.de/" go to a different host. There are a way to solve this situation?
Best regards.

Jorge von Rudno

Hi, Thanks alot for your answer. I have done your suggestion (setFollowMode(0)), and in fact the crawler can follow the link. I think the most likely is tha the link "http://www.hamburg.foo.de/" go to a different host. There are a way to solve this situation? Best regards. Jorge von Rudno

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-02-24

Ok, i think this is a bug, http://www.hamburg.foo.de ist the same host as "www.foo.com",
i think the crawler should defenatly follow links like that with follow-mode 1!

I'll give it a test soon.

For a workaround you could simply set the follow-rules yourself without using the follow-mode.

Try something like this:
$crawler->setFollowMode(0);
$crawler->addURLFollowRule("#foo\.com#");
...

This let's the crawler follow every URL that contains the string foo.com. You could refine the rule so that it won't follow URLs like "www.bla.com\dabadafoo.com" of course if you want.

Last edit: Anonymous 2014-02-24

Ok, i think this is a bug, http://www.hamburg.foo.de ist the same host as "www.foo.com", i think the crawler should defenatly follow links like that with follow-mode 1! I'll give it a test soon. For a workaround you could simply set the follow-rules yourself without using the follow-mode. Try something like this: $crawler->setFollowMode(0); $crawler->addURLFollowRule("#foo\\.com#"); ... This let's the crawler follow every URL that contains the string foo.com. You could refine the rule so that it won't follow URLs like "www.bla.com\dabadafoo.com" of course if you want.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-02-25

Hi,

I have implement your suggestion and this has solved my problem. for your knowledge and your comments I think you are part of the team of phpCraw development, so please tell me if I should report this case as a bug.

Hi, I have implement your suggestion and this has solved my problem. for your knowledge and your comments I think you are part of the team of phpCraw development, so please tell me if I should report this case as a bug.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2014-02-25

Sorry, I forgot to say Thanks alot for your help.

Regards.

Jorge von Rudno

Sorry, I forgot to say Thanks alot for your help. Regards. Jorge von Rudno

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2014-02-28

Hi!

No problem.

I just opened a bug-report for this:
https://sourceforge.net/p/phpcrawl/bugs/67/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2014-11-27

Hi Jorge,

do you have an acutal example (page) for this problem? (for testing)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Does Crawler follow all the types of links?

Forums

Help

Does Crawler follow all the types of links? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Does Crawler follow all the types of links?