I am looking to try and have example.php also output a file "sitemap.xml", each time the website is crawled…and if possible just overwrite itself each time it is crawled.
For the file to just be outputted onto the server/FTP where the script resides or the root is fine…
Has anybody done this already?…or could anybody assist - I am trying to have example.php do this.
Any assistance would be much appreciated!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ok, here's a quick example of a sitemap-generator.
Maybe needs some further tweaks (proper url escaping aso), but should work.
<?php// Inculde the phpcrawl-mainclassinclude("libs/PHPCrawler.class.php");classSitemapGeneratorextendsPHPCrawler{protected$sitemap_output_file;publicfunctionsetSitemapOutputFile($file){$this->sitemap_output_file=$file;if(file_exists($this->sitemap_output_file))unlink($this->sitemap_output_file);file_put_contents($this->sitemap_output_file,"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n"."<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",FILE_APPEND);}publicfunctionhandleDocumentInfo($DocInfo){// Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").if(PHP_SAPI=="cli")$lb="\n";else$lb="<br />";echo"Adding ".$DocInfo->url." to sitemap file".$lb;file_put_contents($this->sitemap_output_file," <url>\r\n"." <loc>".$DocInfo->url."</loc>\r\n"." </url>\r\n",FILE_APPEND);flush();}publicfunctioncloseFile(){file_put_contents($this->sitemap_output_file,'</urlset>',FILE_APPEND);}}$crawler=newSitemapGenerator();$crawler->setSitemapOutputFile("sitemap.xml");// Set output-file$crawler->setURL("www.php.net");$crawler->addContentTypeReceiveRule("#text/html#");$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");// ... apply all other options and rules to the crawler$crawler->setPageLimit(10);// Just for testing$crawler->goMultiProcessed(5);// Or use go() if you don't want multiple processes$crawler->closeFile();?>
If you have any questions just let me know.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
thank you!…what do you mean by "proper URL escaping"…also I will make all these additions available to the forum as much as possible as I am sure this will help many other…and thank you!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just take a look at the specifications of sitemap files over at http://www.sitemaps.org/protocol.html, URLs have to be entity escaped (and some other stuff i think).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am still having an issue - as the example did not seem to work for me? …This is the code I have that is working, I am just trying to add the part the puts the sitemap info into the sitemap.xml file…right now it just creates the file.
<?php//Itmaytakeawhilstocrawlasite...set_time_limit(10000);//Inculdethephpcrawl-mainclassinclude("libs/PHPCrawler.class.php");//ExtendtheclassandoverridethehandleDocumentInfo()-methodclassMyCrawlerextendsPHPCrawler{functionhandleDocumentInfo($DocInfo){//Justdetectlinebreakforoutput("\n"inCLI-mode,otherwise"<br>").if(PHP_SAPI=="cli")$lb="\n";else$lb="<br />";//PrinttheURLandtheHTTP-status-Codeecho"Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;//PrintthereferingURLecho"Referer-page: ".$DocInfo->referer_url.$lb;//Printifthecontentofthedocumentwasberecievedornotif($DocInfo->received==true)echo"Content received: ".$DocInfo->bytes_received." bytes".$lb;elseecho"Content not received".$lb;//Nowyoushoulddosomethingwiththecontentoftheactual//receivedpageorfile($DocInfo->source),weskipitinthisexampleecho$lb;flush();}}//Now,createainstanceofyourclass,definethebehaviour//ofthecrawler(seeclass-referenceformoreoptionsanddetails)//andstartthecrawling-process.$crawler=newMyCrawler();//URLtocrawl$crawler->setURL("www.php.net");//Onlyreceivecontentoffileswithcontent-type"text/html"$crawler->addContentTypeReceiveRule("#text/html#");$crawler->addStreamToFileContentType("#^((?!text/html).)*$#");//Ignorelinkstopictures,dontevenrequestpictures$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");//Ignorelinkswith'?'after$crawler->addNonFollowMatch("/\?/");//Storeandsendcookie-datalikeabrowserdoes$crawler->enableCookieHandling(true);//Setthetraffic-limitto1MB(inbytes,//fortestingwedontwantto"suck"thewholesite)$crawler->setTrafficLimit(1000*1024);//Thatsenough,nowherewego$crawler->go();//Attheend,aftertheprocessisfinished,weprintashort//report(seemethodgetProcessReport()formoreinformation)$report=$crawler->getProcessReport();if(PHP_SAPI=="cli")$lb="\n";else$lb="<br />";echo"Summary:".$lb;echo"Links followed: ".$report->links_followed.$lb;echo"Documents received: ".$report->files_received.$lb;echo"Bytes received: ".$report->bytes_received." bytes".$lb;echo"Process runtime: ".$report->process_runtime." sec".$lb;$var="This is a test";$var2="Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")";$fh=fopen("sitemap.xml","w+");fwrite($fh,$var2);fclose?>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
my provided code just had a ";" to much, this should work:
<?php// Inculde the phpcrawl-mainclassinclude("libs/PHPCrawler.class.php");classSitemapGeneratorextendsPHPCrawler{protected$sitemap_output_file;publicfunctionsetSitemapOutputFile($file){$this->sitemap_output_file=$file;if(file_exists($this->sitemap_output_file))unlink($this->sitemap_output_file);file_put_contents($this->sitemap_output_file,"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n"."<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",FILE_APPEND);}publicfunctionhandleDocumentInfo($DocInfo){// Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").if(PHP_SAPI=="cli")$lb="\n";else$lb="<br />";echo"Adding ".$DocInfo->url." to sitemap file".$lb;file_put_contents($this->sitemap_output_file," <url>\r\n"." <loc>".$DocInfo->url."</loc>\r\n"." </url>\r\n",FILE_APPEND);flush();}publicfunctioncloseFile(){file_put_contents($this->sitemap_output_file,'</urlset>',FILE_APPEND);}}$crawler=newSitemapGenerator();$crawler->setSitemapOutputFile("sitemap.xml");// Set output-file$crawler->setURL("www.php.net");$crawler->addContentTypeReceiveRule("#text/html#");$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");// ... apply all other options and rules to the crawler$crawler->setPageLimit(10);// Just for testing$crawler->goMultiProcessed(5);// Or use go() if you don't want multiple processes$crawler->closeFile();?>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, of course you can use fopen and fwrite to fill you sitemap-file if you want to, but it doesn't matter what you use, file_put_contents does it as well. Why should fopen/fwrite be more optimal?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am looking to try and have example.php also output a file "sitemap.xml", each time the website is crawled…and if possible just overwrite itself each time it is crawled.
For the file to just be outputted onto the server/FTP where the script resides or the root is fine…
Has anybody done this already?…or could anybody assist - I am trying to have example.php do this.
Any assistance would be much appreciated!
Hi, yes exactly like that! Thank yo very much…I've been trying all day and still no luck…and am quickly approaching crunch time.
your assistance would be greatly appreciated.
Hi again,
ok, here's a quick example of a sitemap-generator.
Maybe needs some further tweaks (proper url escaping aso), but should work.
If you have any questions just let me know.
thank you!…what do you mean by "proper URL escaping"…also I will make all these additions available to the forum as much as possible as I am sure this will help many other…and thank you!
Hi again!
Just take a look at the specifications of sitemap files over at http://www.sitemaps.org/protocol.html, URLs have to be entity escaped (and some other stuff i think).
I am still having an issue - as the example did not seem to work for me? …This is the code I have that is working, I am just trying to add the part the puts the sitemap info into the sitemap.xml file…right now it just creates the file.
Hi,
my provided code just had a ";" to much, this should work:
Hi,
my code just had a ";" too much, this should work:
Hi,
it's just not possible to paste code here without destroying parts of it (always adds ";" somewhere).
so here it goes again:
http://pastebin.com/EQDGtGA1
Should work correctly.
Best regards!
So I see that it creates "sitemap.xml"…however, it only is putting in the XML header and not anything actually for the sitemapped URLs, etc….
Any idea?…thank you so much for your assistance…
Got it going now…is there a way to filter a URL by keyword or anything like that?
Hey,
did you read the docs at all? ;)
http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_addURLFollowRule.htm
yes got it all going. just wondering now if there is a way to do it via "fopen/write/close" vs. file_get_contents….since it is more optimal?
Yes, of course you can use fopen and fwrite to fill you sitemap-file if you want to, but it doesn't matter what you use, file_put_contents does it as well. Why should fopen/fwrite be more optimal?