Hello! I use your script and notice somethings for more convenient.
1.the time crawling
about 248th line...
<code> // Additional infos for the override-function handlePageData() $page_data["protocol"]=$protocol; $page_data["host"]=$host; $page_data["path"]=$path; $page_data["file"]=$file; $page_data["query"]=$query; $page_data["url"]=$protocol.$host.$path.$file.$query;
/*Adding the time*/ $page_data["time"]=date('Y/m/d H:i:s',time()); return($page_data); </code>
and you use $page_data["time"] in function handlePageData($page_data).
2.sleep function
To prevent DOS suspicion, you have better using sleep function. Such as Google has using bot every 1second.
about 390th line
<code> if ($content_found==false && $rd[0]!="" && $this->follow_redirects_till_content==true) { PHPCrawlerUtils::addToArray($rd, $this->urls_to_crawl, $this->urls_to_crawl[$key], $this->referers_to_urls_to_crawl); } /*Adding sleep in 1 second*/ sleep(1);
} // end of main-loop </code>
Best regards!!
Anonymous
You seem to have CSS turned off. Please don't fill out this field.
Hello!
I use your script and notice somethings for more convenient.
1.the time crawling
about 248th line...
<code>
// Additional infos for the override-function handlePageData()
$page_data["protocol"]=$protocol;
$page_data["host"]=$host;
$page_data["path"]=$path;
$page_data["file"]=$file;
$page_data["query"]=$query;
$page_data["url"]=$protocol.$host.$path.$file.$query;
/*Adding the time*/
$page_data["time"]=date('Y/m/d H:i:s',time());
return($page_data);
</code>
and you use $page_data["time"] in function handlePageData($page_data).
2.sleep function
To prevent DOS suspicion, you have better using sleep function.
Such as Google has using bot every 1second.
about 390th line
<code>
if ($content_found==false && $rd[0]!="" && $this->follow_redirects_till_content==true) {
PHPCrawlerUtils::addToArray($rd, $this->urls_to_crawl, $this->urls_to_crawl[$key], $this->referers_to_urls_to_crawl);
}
/*Adding sleep in 1 second*/
sleep(1);
} // end of main-loop
</code>
Best regards!!