I am looking to try and have example.php also output a file "sitemap.xml", each time the website is crawled…and if possible just overwrite itself each time it is crawled.
For the file to just be outputted onto the server/FTP where the script resides or the root is fine…
Has anybody done this already?…or could anybody assist - I am trying to have example.php do this.
Any assistance would be much appreciated!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ok, here's a quick example of a sitemap-generator.
Maybe needs some further tweaks (proper url escaping aso), but should work.
<?php// Inculde the phpcrawl-mainclassinclude("libs/PHPCrawler.class.php");classSitemapGeneratorextendsPHPCrawler{protected$sitemap_output_file;publicfunctionsetSitemapOutputFile($file){$this->sitemap_output_file=$file;if(file_exists($this->sitemap_output_file))unlink($this->sitemap_output_file);file_put_contents($this->sitemap_output_file,"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n"."<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",FILE_APPEND);}publicfunctionhandleDocumentInfo($DocInfo){// Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").if(PHP_SAPI=="cli")$lb="\n";else$lb="<br />";echo"Adding ".$DocInfo->url." to sitemap file".$lb;file_put_contents($this->sitemap_output_file," <url>\r\n"." <loc>".$DocInfo->url."</loc>\r\n"." </url>\r\n",FILE_APPEND);flush();}publicfunctioncloseFile(){file_put_contents($this->sitemap_output_file,'</urlset>',FILE_APPEND);}}$crawler=newSitemapGenerator();$crawler->setSitemapOutputFile("sitemap.xml");// Set output-file$crawler->setURL("www.php.net");$crawler->addContentTypeReceiveRule("#text/html#");$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");// ... apply all other options and rules to the crawler$crawler->setPageLimit(10);// Just for testing$crawler->goMultiProcessed(5);// Or use go() if you don't want multiple processes$crawler->closeFile();?>
If you have any questions just let me know.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thank you!…what do you mean by "proper URL escaping"…also I will make all these additions available to the forum as much as possible as I am sure this will help many other…and thank you!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just take a look at the specifications of sitemap files over at http://www.sitemaps.org/protocol.html, URLs have to be entity escaped (and some other stuff i think).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am still having an issue - as the example did not seem to work for me? …This is the code I have that is working, I am just trying to add the part the puts the sitemap info into the sitemap.xml file…right now it just creates the file.
<?php// It may take a whils to crawl a site ...set_time_limit(10000);// Inculde the phpcrawl-mainclassinclude("libs/PHPCrawler.class.php");// Extend the class and override the handleDocumentInfo()-method classMyCrawlerextendsPHPCrawler{functionhandleDocumentInfo($DocInfo){// Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").if(PHP_SAPI=="cli")$lb="\n";else$lb="<br />";// Print the URL and the HTTP-status-Codeecho"Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;// Print the refering URLecho"Referer-page: ".$DocInfo->referer_url.$lb;// Print if the content of the document was be recieved or notif($DocInfo->received==true)echo"Content received: ".$DocInfo->bytes_received." bytes".$lb;elseecho"Content not received".$lb;// Now you should do something with the content of the actual// received page or file ($DocInfo->source), we skip it in this example echo$lb;flush();}}// Now, create a instance of your class, define the behaviour// of the crawler (see class-reference for more options and details)// and start the crawling-process. $crawler=newMyCrawler();// URL to crawl$crawler->setURL("www.php.net");// Only receive content of files with content-type "text/html"$crawler->addContentTypeReceiveRule("#text/html#");$crawler->addStreamToFileContentType("#^((?!text/html).)*$#");// Ignore links to pictures, dont even request pictures$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");// Ignore links with '?' after$crawler->addNonFollowMatch("/\?/");// Store and send cookie-data like a browser does$crawler->enableCookieHandling(true);// Set the traffic-limit to 1 MB (in bytes,// for testing we dont want to "suck" the whole site)$crawler->setTrafficLimit(1000*1024);// Thats enough, now here we go$crawler->go();// At the end, after the process is finished, we print a short// report (see method getProcessReport() for more information)$report=$crawler->getProcessReport();if(PHP_SAPI=="cli")$lb="\n";else$lb="<br />";echo"Summary:".$lb;echo"Links followed: ".$report->links_followed.$lb;echo"Documents received: ".$report->files_received.$lb;echo"Bytes received: ".$report->bytes_received." bytes".$lb;echo"Process runtime: ".$report->process_runtime." sec".$lb;$var="This is a test";$var2="Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")";$fh=fopen("sitemap.xml","w+");fwrite($fh,$var2);fclose?>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
my provided code just had a ";" to much, this should work:
<?php// Inculde the phpcrawl-mainclassinclude("libs/PHPCrawler.class.php");classSitemapGeneratorextendsPHPCrawler{protected$sitemap_output_file;publicfunctionsetSitemapOutputFile($file){$this->sitemap_output_file=$file;if(file_exists($this->sitemap_output_file))unlink($this->sitemap_output_file);file_put_contents($this->sitemap_output_file,"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n"."<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",FILE_APPEND);}publicfunctionhandleDocumentInfo($DocInfo){// Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").if(PHP_SAPI=="cli")$lb="\n";else$lb="<br />";echo"Adding ".$DocInfo->url." to sitemap file".$lb;file_put_contents($this->sitemap_output_file," <url>\r\n"." <loc>".$DocInfo->url."</loc>\r\n"." </url>\r\n",FILE_APPEND);flush();}publicfunctioncloseFile(){file_put_contents($this->sitemap_output_file,'</urlset>',FILE_APPEND);}}$crawler=newSitemapGenerator();$crawler->setSitemapOutputFile("sitemap.xml");// Set output-file$crawler->setURL("www.php.net");$crawler->addContentTypeReceiveRule("#text/html#");$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");// ... apply all other options and rules to the crawler$crawler->setPageLimit(10);// Just for testing$crawler->goMultiProcessed(5);// Or use go() if you don't want multiple processes$crawler->closeFile();?>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
my code just had a ";" too much, this should work:
<?php// Inculde the phpcrawl-mainclassinclude("libs/PHPCrawler.class.php");classSitemapGeneratorextendsPHPCrawler{protected$sitemap_output_file;publicfunctionsetSitemapOutputFile($file){$this->sitemap_output_file=$file;if(file_exists($this->sitemap_output_file))unlink($this->sitemap_output_file);file_put_contents($this->sitemap_output_file,"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n"."<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",FILE_APPEND);}publicfunctionhandleDocumentInfo($DocInfo){// Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").if(PHP_SAPI=="cli")$lb="\n";else$lb="<br />";echo"Adding ".$DocInfo->url." to sitemap file".$lb;file_put_contents($this->sitemap_output_file," <url>\r\n"." <loc>".$DocInfo->url."</loc>\r\n"." </url>\r\n",FILE_APPEND);flush();}publicfunctioncloseFile(){file_put_contents($this->sitemap_output_file,'</urlset>',FILE_APPEND);}}$crawler=newSitemapGenerator();$crawler->setSitemapOutputFile("sitemap.xml");// Set output-file$crawler->setURL("www.php.net");$crawler->addContentTypeReceiveRule("#text/html#");$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");// ... apply all other options and rules to the crawler$crawler->setPageLimit(10);// Just for testing$crawler->goMultiProcessed(5);// Or use go() if you don't want multiple processes$crawler->closeFile();?>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, of course you can use fopen and fwrite to fill you sitemap-file if you want to, but it doesn't matter what you use, file_put_contents does it as well. Why should fopen/fwrite be more optimal?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am looking to try and have example.php also output a file "sitemap.xml", each time the website is crawled…and if possible just overwrite itself each time it is crawled.
For the file to just be outputted onto the server/FTP where the script resides or the root is fine…
Has anybody done this already?…or could anybody assist - I am trying to have example.php do this.
Any assistance would be much appreciated!
Hi, yes exactly like that! Thank yo very much…I've been trying all day and still no luck…and am quickly approaching crunch time.
your assistance would be greatly appreciated.
Hi again,
ok, here's a quick example of a sitemap-generator.
Maybe needs some further tweaks (proper url escaping aso), but should work.
If you have any questions just let me know.
thank you!…what do you mean by "proper URL escaping"…also I will make all these additions available to the forum as much as possible as I am sure this will help many other…and thank you!
Hi again!
Just take a look at the specifications of sitemap files over at http://www.sitemaps.org/protocol.html, URLs have to be entity escaped (and some other stuff i think).
I am still having an issue - as the example did not seem to work for me? …This is the code I have that is working, I am just trying to add the part the puts the sitemap info into the sitemap.xml file…right now it just creates the file.
Hi,
my provided code just had a ";" to much, this should work:
Hi,
my code just had a ";" too much, this should work:
Hi,
it's just not possible to paste code here without destroying parts of it (always adds ";" somewhere).
so here it goes again:
http://pastebin.com/EQDGtGA1
Should work correctly.
Best regards!
So I see that it creates "sitemap.xml"…however, it only is putting in the XML header and not anything actually for the sitemapped URLs, etc….
Any idea?…thank you so much for your assistance…
Got it going now…is there a way to filter a URL by keyword or anything like that?
Hey,
did you read the docs at all? ;)
http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_addURLFollowRule.htm
yes got it all going. just wondering now if there is a way to do it via "fopen/write/close" vs. file_get_contents….since it is more optimal?
Yes, of course you can use fopen and fwrite to fill you sitemap-file if you want to, but it doesn't matter what you use, file_put_contents does it as well. Why should fopen/fwrite be more optimal?