Menu

example.php outputting to "sitemap...

Help
2012-10-18
2013-04-09
  • Nobody/Anonymous

    I am looking to try and have example.php also output a file "sitemap.xml", each time the website is crawled…and if possible just overwrite itself each time it is crawled.

    For the file to just be outputted onto the server/FTP where the script resides or the root is fine…

    Has anybody done this already?…or could anybody assist - I am trying to have example.php do this.

    Any assistance would be much appreciated!

     
  • Nobody/Anonymous

    Hi, yes exactly like that! Thank yo very much…I've been trying all day and still no luck…and am quickly approaching crunch time.

    your assistance would be greatly appreciated.

     
  • Nobody/Anonymous

    Hi again,

    ok, here's a quick example of a sitemap-generator.
    Maybe needs some further tweaks (proper url escaping aso), but should work.

    <?php
    // Inculde the phpcrawl-mainclass
    include("libs/PHPCrawler.class.php");
    class SitemapGenerator extends PHPCrawler 
    {
      protected $sitemap_output_file;
      
      public function setSitemapOutputFile($file)
      {
        $this->sitemap_output_file = $file;
        
        if (file_exists($this->sitemap_output_file)) unlink($this->sitemap_output_file);
        
        file_put_contents($this->sitemap_output_file,
                          "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n".
                          "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",
                          FILE_APPEND);
      }
      
      public function handleDocumentInfo($DocInfo) 
      {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
        if (PHP_SAPI == "cli") $lb = "\n";
        else $lb = "<br />";
        echo "Adding ".$DocInfo->url." to sitemap file".$lb;
        
        file_put_contents($this->sitemap_output_file,
                          " <url>\r\n".
                          "  <loc>".$DocInfo->url."</loc>\r\n".
                          " </url>\r\n",
                          FILE_APPEND);
        flush();
      }
      
      public function closeFile()
      {
        file_put_contents($this->sitemap_output_file, '</urlset>', FILE_APPEND);
      }
    }
    
    $crawler = new SitemapGenerator();
    $crawler->setSitemapOutputFile("sitemap.xml"); // Set output-file
    $crawler->setURL("www.php.net");
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
    // ... apply all other options and rules to the crawler
    $crawler->setPageLimit(10); // Just for testing
    $crawler->goMultiProcessed(5); // Or use go() if you don't want multiple processes
    $crawler->closeFile();
    ?>
    

    If you have any questions just let me know.

     
  • Nobody/Anonymous

    thank you!…what do you mean by "proper URL escaping"…also I will make all these additions available to the forum as much as possible as I am sure this will help many other…and thank you!

     
  • Nobody/Anonymous

    Hi again!

    Just take a look at the specifications of sitemap files over at http://www.sitemaps.org/protocol.html, URLs have to be entity escaped (and some other stuff i think).

     
  • Nobody/Anonymous

    I am still having an issue - as the example did not seem to work for me? …This is the code I have that is working, I am just trying to add the part the puts the sitemap info into the sitemap.xml file…right now it just creates the file.                 

    <?php
    // It may take a whils to crawl a site ...
    set_time_limit(10000);
    // Inculde the phpcrawl-mainclass
    include("libs/PHPCrawler.class.php");
    // Extend the class and override the handleDocumentInfo()-method 
    class MyCrawler extends PHPCrawler 
    {
      function handleDocumentInfo($DocInfo) 
      {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
        if (PHP_SAPI == "cli") $lb = "\n";
        else $lb = "<br />";
        // Print the URL and the HTTP-status-Code
        echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
        
        // Print the refering URL
        echo "Referer-page: ".$DocInfo->referer_url.$lb;
        
        // Print if the content of the document was be recieved or not
        if ($DocInfo->received == true)
          echo "Content received: ".$DocInfo->bytes_received." bytes".$lb;
        else
          echo "Content not received".$lb; 
        
        // Now you should do something with the content of the actual
        // received page or file ($DocInfo->source), we skip it in this example 
        
        echo $lb;
        
        flush();
      } 
    }
    // Now, create a instance of your class, define the behaviour
    // of the crawler (see class-reference for more options and details)
    // and start the crawling-process. 
    $crawler = new MyCrawler();
    // URL to crawl
    $crawler->setURL("www.php.net");
    // Only receive content of files with content-type "text/html"
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addStreamToFileContentType("#^((?!text/html).)*$#");
    // Ignore links to pictures, dont even request pictures
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
    // Ignore links with '?' after
    $crawler->addNonFollowMatch("/\?/");
    // Store and send cookie-data like a browser does
    $crawler->enableCookieHandling(true);
    // Set the traffic-limit to 1 MB (in bytes,
    // for testing we dont want to "suck" the whole site)
    $crawler->setTrafficLimit(1000 * 1024);
    // Thats enough, now here we go
    $crawler->go();
    // At the end, after the process is finished, we print a short
    // report (see method getProcessReport() for more information)
    $report = $crawler->getProcessReport();
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";
        
    echo "Summary:".$lb;
    echo "Links followed: ".$report->links_followed.$lb;
    echo "Documents received: ".$report->files_received.$lb;
    echo "Bytes received: ".$report->bytes_received." bytes".$lb;
    echo "Process runtime: ".$report->process_runtime." sec".$lb;
    $var = "This is a test";
    $var2 = "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")";
    $fh = fopen("sitemap.xml", "w+");
    fwrite( $fh, $var2);
    fclose
    ?>
    
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-22

    Hi,

    my provided code just had a ";" to much, this should work:

    <?php
    // Inculde the phpcrawl-mainclass
    include("libs/PHPCrawler.class.php");
    class SitemapGenerator extends PHPCrawler
    {
      protected $sitemap_output_file;
      
      public function setSitemapOutputFile($file)
      {
        $this->sitemap_output_file = $file;
        
        if (file_exists($this->sitemap_output_file)) unlink($this->sitemap_output_file);
        
        file_put_contents($this->sitemap_output_file, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n".
                                                      "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n", FILE_APPEND);
      }
      
      public function handleDocumentInfo($DocInfo)
      {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
        if (PHP_SAPI == "cli") $lb = "\n";
        else $lb = "<br />";
        
        echo "Adding ".$DocInfo->url." to sitemap file".$lb;
        
        file_put_contents($this->sitemap_output_file, " <url>\r\n".
                                                      "  <loc>".$DocInfo->url."</loc>\r\n".
                                                      " </url>\r\n", FILE_APPEND);
        
        flush();
      }
      
      public function closeFile()
      {
        file_put_contents($this->sitemap_output_file, '</urlset>', FILE_APPEND);
      } 
    }
    $crawler = new SitemapGenerator();
    $crawler->setSitemapOutputFile("sitemap.xml"); // Set output-file
    $crawler->setURL("www.php.net");
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 
    // ... apply all other options and rules to the crawler
    $crawler->setPageLimit(10); // Just for testing
    $crawler->goMultiProcessed(5); // Or use go() if you don't want multiple processes
    $crawler->closeFile();
    ?>
    
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-22

    Hi,

    my code just had a ";" too much, this should work:

    <?php
    // Inculde the phpcrawl-mainclass
    include("libs/PHPCrawler.class.php");
    class SitemapGenerator extends PHPCrawler
    {
      protected $sitemap_output_file;
      
      public function setSitemapOutputFile($file)
      {
        $this->sitemap_output_file = $file;
        
        if (file_exists($this->sitemap_output_file)) unlink($this->sitemap_output_file);
        
        file_put_contents($this->sitemap_output_file, 
                          "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n".
                          "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",
                          FILE_APPEND);
      }
      
      public function handleDocumentInfo($DocInfo)
      {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
        if (PHP_SAPI == "cli") $lb = "\n";
        else $lb = "<br />";
        
        echo "Adding ".$DocInfo->url." to sitemap file".$lb;
        
        file_put_contents($this->sitemap_output_file, " <url>\r\n".
                                                      "  <loc>".$DocInfo->url."</loc>\r\n".
                                                      " </url>\r\n", FILE_APPEND);
        
        flush();
      }
      
      public function closeFile()
      {
        file_put_contents($this->sitemap_output_file, '</urlset>', FILE_APPEND);
      } 
    }
    $crawler = new SitemapGenerator();
    $crawler->setSitemapOutputFile("sitemap.xml"); // Set output-file
    $crawler->setURL("www.php.net");
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 
    // ... apply all other options and rules to the crawler
    $crawler->setPageLimit(10); // Just for testing
    $crawler->goMultiProcessed(5); // Or use go() if you don't want multiple processes
    $crawler->closeFile();
    ?>
    
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-22

    Hi,

    it's just not possible to paste code here without destroying parts of it (always adds ";" somewhere).
    so here it goes again:

    http://pastebin.com/EQDGtGA1

    Should work correctly.

    Best regards!

     
  • Nobody/Anonymous

    So I see that it creates "sitemap.xml"…however, it only is putting in the XML header and not anything actually for the sitemapped URLs, etc….

    Any idea?…thank you so much for your assistance…

     
  • Nobody/Anonymous

    Got it going now…is there a way to filter a URL by keyword or anything like that?

     
  • Nobody/Anonymous

    yes got it all going. just wondering now if there is a way to do it via "fopen/write/close" vs. file_get_contents….since it is more optimal?

     
  • Nobody/Anonymous

    Yes, of course you can use fopen and fwrite to fill you sitemap-file if you want to, but it doesn't matter what you use, file_put_contents does it as well. Why should fopen/fwrite be more optimal?

     

Anonymous
Anonymous

Add attachments
Cancel