PHPCrawl / Forum / Help: Avoid loop crawling huge sites - I would like crawl only different url

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2016-10-31

Hi guys,

I'm wiriting you because I want to crawl huge sites but I see that it repeats urls, so I would like to crawls only differents url.

How can I do this?

Thanks for your help!!!!

Hi guys, I'm wiriting you because I want to crawl huge sites but I see that it repeats urls, so I would like to crawls only differents url. How can I do this? Thanks for your help!!!!

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

James Shaver - 2016-11-01

What have you tried so far, and what exactly where your results? I think that would tell us why you're getting repeated urls at all.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2016-11-01

Hi James,

Thanks for your help.

I'm doing crawl from sites and I write to mysql the url tha I got, but when I do a count from the url I see that the url repeats several times.

In the handleDocumentInfo method I wrote this condition

function handleDocumentInfo($DocInfo){
global $url_procesadas;
if (in_array($DocInfo->url, $url_procesadas) == false){
array_push($url_procesadas,$DocInfo->url);
... do all I need with the url
}
}
...
$crawler = new MyCrawler();
$crawler->setURL($url);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#.(css|js|jpg|jpeg|gif|png)$# i");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
$crawler->enableCookieHandling(true);
$crawler->go();

But I'm not sure if this solution is the best, because I don`t know if the crawl still make the request and then do nothing with the content.

What do you think about?

Hi James, Thanks for your help. I'm doing crawl from sites and I write to mysql the url tha I got, but when I do a count from the url I see that the url repeats several times. In the handleDocumentInfo method I wrote this condition function handleDocumentInfo($DocInfo){ global $url_procesadas; if (in_array($DocInfo->url, $url_procesadas) == false){ array_push($url_procesadas,$DocInfo->url); ... do all I need with the url } } ... $crawler = new MyCrawler(); $crawler->setURL($url); $crawler->addContentTypeReceiveRule("#text/html#"); $crawler->addURLFilterRule("#\.(css|js|jpg|jpeg|gif|png)$# i"); $crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE); $crawler->enableCookieHandling(true); $crawler->go(); But I'm not sure if this solution is the best, because I don`t know if the crawl still make the request and then do nothing with the content. What do you think about?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

James Shaver - 2016-11-22

You might want to check that the http code is OK, but otherwise it looks good.

function handleDocumentInfo($DocInfo){ global $url_procesadas; if (in_array($DocInfo->url, $url_procesadas) == false && $DocInfo->http_status_code == 200){ array_push($url_procesadas,$DocInfo->url); ... do all I need with the url } }
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Avoid loop crawling huge sites - I would like crawl only different url

Forums

Help

Avoid loop crawling huge sites - I would like crawl only different url document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Avoid loop crawling huge sites - I would like crawl only different url