I'm doing crawl from sites and I write to mysql the url tha I got, but when I do a count from the url I see that the url repeats several times.
In the handleDocumentInfo method I wrote this condition
function handleDocumentInfo($DocInfo){
global $url_procesadas;
if (in_array($DocInfo->url, $url_procesadas) == false){
array_push($url_procesadas,$DocInfo->url);
... do all I need with the url
}
}
...
$crawler = new MyCrawler();
$crawler->setURL($url);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#.(css|js|jpg|jpeg|gif|png)$# i");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
$crawler->enableCookieHandling(true);
$crawler->go();
But I'm not sure if this solution is the best, because I don`t know if the crawl still make the request and then do nothing with the content.
What do you think about?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi guys,
I'm wiriting you because I want to crawl huge sites but I see that it repeats urls, so I would like to crawls only differents url.
How can I do this?
Thanks for your help!!!!
What have you tried so far, and what exactly where your results? I think that would tell us why you're getting repeated urls at all.
View and moderate all "Help" comments posted by this user
Mark all as spam, and block user from posting to "Forum"
Hi James,
Thanks for your help.
I'm doing crawl from sites and I write to mysql the url tha I got, but when I do a count from the url I see that the url repeats several times.
In the handleDocumentInfo method I wrote this condition
function handleDocumentInfo($DocInfo){
global $url_procesadas;
if (in_array($DocInfo->url, $url_procesadas) == false){
array_push($url_procesadas,$DocInfo->url);
... do all I need with the url
}
}
...
$crawler = new MyCrawler();
$crawler->setURL($url);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#.(css|js|jpg|jpeg|gif|png)$# i");
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE);
$crawler->enableCookieHandling(true);
$crawler->go();
But I'm not sure if this solution is the best, because I don`t know if the crawl still make the request and then do nothing with the content.
What do you think about?
You might want to check that the http code is OK, but otherwise it looks good.