Menu

example.php outputting to "sitemap...

Help
2012-10-18
2013-04-09
  • Nobody/Anonymous

    I am looking to try and have example.php also output a file "sitemap.xml", each time the website is crawled…and if possible just overwrite itself each time it is crawled.

    For the file to just be outputted onto the server/FTP where the script resides or the root is fine…

    Has anybody done this already?…or could anybody assist - I am trying to have example.php do this.

    Any assistance would be much appreciated!

     
  • Nobody/Anonymous

    Hi, yes exactly like that! Thank yo very much…I've been trying all day and still no luck…and am quickly approaching crunch time.

    your assistance would be greatly appreciated.

     
  • Nobody/Anonymous

    Hi again,

    ok, here's a quick example of a sitemap-generator.
    Maybe needs some further tweaks (proper url escaping aso), but should work.

    <?php
    // Inculde the phpcrawl-mainclass
    include("libs/PHPCrawler.class.php");
    class SitemapGenerator extends PHPCrawler 
    {
      protected $sitemap_output_file;
      
      public function setSitemapOutputFile($file)
      {
        $this->sitemap_output_file = $file;
        
        if (file_exists($this->sitemap_output_file)) unlink($this->sitemap_output_file);
        
        file_put_contents($this->sitemap_output_file,
                          "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n".
                          "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",
                          FILE_APPEND);
      }
      
      public function handleDocumentInfo($DocInfo) 
      {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
        if (PHP_SAPI == "cli") $lb = "\n";
        else $lb = "<br />";
        echo "Adding ".$DocInfo->url." to sitemap file".$lb;
        
        file_put_contents($this->sitemap_output_file,
                          " <url>\r\n".
                          "  <loc>".$DocInfo->url."</loc>\r\n".
                          " </url>\r\n",
                          FILE_APPEND);
        flush();
      }
      
      public function closeFile()
      {
        file_put_contents($this->sitemap_output_file, '</urlset>', FILE_APPEND);
      }
    }
    
    $crawler = new SitemapGenerator();
    $crawler->setSitemapOutputFile("sitemap.xml"); // Set output-file
    $crawler->setURL("www.php.net");
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
    // ... apply all other options and rules to the crawler
    $crawler->setPageLimit(10); // Just for testing
    $crawler->goMultiProcessed(5); // Or use go() if you don't want multiple processes
    $crawler->closeFile();
    ?>
    

    If you have any questions just let me know.

     
  • Nobody/Anonymous

    thank you!…what do you mean by "proper URL escaping"…also I will make all these additions available to the forum as much as possible as I am sure this will help many other…and thank you!

     
  • Nobody/Anonymous

    Hi again!

    Just take a look at the specifications of sitemap files over at http://www.sitemaps.org/protocol.html, URLs have to be entity escaped (and some other stuff i think).

     
  • Nobody/Anonymous

    I am still having an issue - as the example did not seem to work for me? …This is the code I have that is working, I am just trying to add the part the puts the sitemap info into the sitemap.xml file…right now it just creates the file.                 

    <?php
    // It may take a whils to crawl a site ...
    set_time_limit(10000);
    // Inculde the phpcrawl-mainclass
    include("libs/PHPCrawler.class.php");
    // Extend the class and override the handleDocumentInfo()-method 
    class MyCrawler extends PHPCrawler 
    {
      function handleDocumentInfo($DocInfo) 
      {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
        if (PHP_SAPI == "cli") $lb = "\n";
        else $lb = "<br />";
        // Print the URL and the HTTP-status-Code
        echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
        
        // Print the refering URL
        echo "Referer-page: ".$DocInfo->referer_url.$lb;
        
        // Print if the content of the document was be recieved or not
        if ($DocInfo->received == true)
          echo "Content received: ".$DocInfo->bytes_received." bytes".$lb;
        else
          echo "Content not received".$lb; 
        
        // Now you should do something with the content of the actual
        // received page or file ($DocInfo->source), we skip it in this example 
        
        echo $lb;
        
        flush();
      } 
    }
    // Now, create a instance of your class, define the behaviour
    // of the crawler (see class-reference for more options and details)
    // and start the crawling-process. 
    $crawler = new MyCrawler();
    // URL to crawl
    $crawler->setURL("www.php.net");
    // Only receive content of files with content-type "text/html"
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addStreamToFileContentType("#^((?!text/html).)*$#");
    // Ignore links to pictures, dont even request pictures
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
    // Ignore links with '?' after
    $crawler->addNonFollowMatch("/\?/");
    // Store and send cookie-data like a browser does
    $crawler->enableCookieHandling(true);
    // Set the traffic-limit to 1 MB (in bytes,
    // for testing we dont want to "suck" the whole site)
    $crawler->setTrafficLimit(1000 * 1024);
    // Thats enough, now here we go
    $crawler->go();
    // At the end, after the process is finished, we print a short
    // report (see method getProcessReport() for more information)
    $report = $crawler->getProcessReport();
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";
        
    echo "Summary:".$lb;
    echo "Links followed: ".$report->links_followed.$lb;
    echo "Documents received: ".$report->files_received.$lb;
    echo "Bytes received: ".$report->bytes_received." bytes".$lb;
    echo "Process runtime: ".$report->process_runtime." sec".$lb;
    $var = "This is a test";
    $var2 = "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")";
    $fh = fopen("sitemap.xml", "w+");
    fwrite( $fh, $var2);
    fclose
    ?>
    
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-22

    Hi,

    my provided code just had a ";" to much, this should work:

    <?php
    // Inculde the phpcrawl-mainclass
    include("libs/PHPCrawler.class.php");
    class SitemapGenerator extends PHPCrawler
    {
      protected $sitemap_output_file;
      
      public function setSitemapOutputFile($file)
      {
        $this->sitemap_output_file = $file;
        
        if (file_exists($this->sitemap_output_file)) unlink($this->sitemap_output_file);
        
        file_put_contents($this->sitemap_output_file, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n".
                                                      "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n", FILE_APPEND);
      }
      
      public function handleDocumentInfo($DocInfo)
      {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
        if (PHP_SAPI == "cli") $lb = "\n";
        else $lb = "<br />";
        
        echo "Adding ".$DocInfo->url." to sitemap file".$lb;
        
        file_put_contents($this->sitemap_output_file, " <url>\r\n".
                                                      "  <loc>".$DocInfo->url."</loc>\r\n".
                                                      " </url>\r\n", FILE_APPEND);
        
        flush();
      }
      
      public function closeFile()
      {
        file_put_contents($this->sitemap_output_file, '</urlset>', FILE_APPEND);
      } 
    }
    $crawler = new SitemapGenerator();
    $crawler->setSitemapOutputFile("sitemap.xml"); // Set output-file
    $crawler->setURL("www.php.net");
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 
    // ... apply all other options and rules to the crawler
    $crawler->setPageLimit(10); // Just for testing
    $crawler->goMultiProcessed(5); // Or use go() if you don't want multiple processes
    $crawler->closeFile();
    ?>
    
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-22

    Hi,

    my code just had a ";" too much, this should work:

    <?php
    // Inculde the phpcrawl-mainclass
    include("libs/PHPCrawler.class.php");
    class SitemapGenerator extends PHPCrawler
    {
      protected $sitemap_output_file;
      
      public function setSitemapOutputFile($file)
      {
        $this->sitemap_output_file = $file;
        
        if (file_exists($this->sitemap_output_file)) unlink($this->sitemap_output_file);
        
        file_put_contents($this->sitemap_output_file, 
                          "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n".
                          "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",
                          FILE_APPEND);
      }
      
      public function handleDocumentInfo($DocInfo)
      {
        // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
        if (PHP_SAPI == "cli") $lb = "\n";
        else $lb = "<br />";
        
        echo "Adding ".$DocInfo->url." to sitemap file".$lb;
        
        file_put_contents($this->sitemap_output_file, " <url>\r\n".
                                                      "  <loc>".$DocInfo->url."</loc>\r\n".
                                                      " </url>\r\n", FILE_APPEND);
        
        flush();
      }
      
      public function closeFile()
      {
        file_put_contents($this->sitemap_output_file, '</urlset>', FILE_APPEND);
      } 
    }
    $crawler = new SitemapGenerator();
    $crawler->setSitemapOutputFile("sitemap.xml"); // Set output-file
    $crawler->setURL("www.php.net");
    $crawler->addContentTypeReceiveRule("#text/html#");
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 
    // ... apply all other options and rules to the crawler
    $crawler->setPageLimit(10); // Just for testing
    $crawler->goMultiProcessed(5); // Or use go() if you don't want multiple processes
    $crawler->closeFile();
    ?>
    
     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-10-22

    Hi,

    it's just not possible to paste code here without destroying parts of it (always adds ";" somewhere).
    so here it goes again:

    http://pastebin.com/EQDGtGA1

    Should work correctly.

    Best regards!

     
  • Nobody/Anonymous

    So I see that it creates "sitemap.xml"…however, it only is putting in the XML header and not anything actually for the sitemapped URLs, etc….

    Any idea?…thank you so much for your assistance…

     
  • Nobody/Anonymous

    Got it going now…is there a way to filter a URL by keyword or anything like that?

     
  • Nobody/Anonymous

    yes got it all going. just wondering now if there is a way to do it via "fopen/write/close" vs. file_get_contents….since it is more optimal?

     
  • Nobody/Anonymous

    Yes, of course you can use fopen and fwrite to fill you sitemap-file if you want to, but it doesn't matter what you use, file_put_contents does it as well. Why should fopen/fwrite be more optimal?

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.