PHPCrawl / Forum / Help: example.php outputting to "sitemap...

Nobody/Anonymous - 2012-10-18

I am looking to try and have example.php also output a file "sitemap.xml", each time the website is crawled…and if possible just overwrite itself each time it is crawled.

For the file to just be outputted onto the server/FTP where the script resides or the root is fine…

Has anybody done this already?…or could anybody assist - I am trying to have example.php do this.

Any assistance would be much appreciated!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-10-18

Hi, yes exactly like that! Thank yo very much…I've been trying all day and still no luck…and am quickly approaching crunch time.

your assistance would be greatly appreciated.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Hi again,

ok, here's a quick example of a sitemap-generator.
Maybe needs some further tweaks (proper url escaping aso), but should work.

<?php
// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");
class SitemapGenerator extends PHPCrawler 
{
  protected $sitemap_output_file;
  
  public function setSitemapOutputFile($file)
  {
    $this->sitemap_output_file = $file;
    
    if (file_exists($this->sitemap_output_file)) unlink($this->sitemap_output_file);
    
    file_put_contents($this->sitemap_output_file,
                      "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n".
                      "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",
                      FILE_APPEND);
  }
  
  public function handleDocumentInfo($DocInfo) 
  {
    // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";
    echo "Adding ".$DocInfo->url." to sitemap file".$lb;
    
    file_put_contents($this->sitemap_output_file,
                      " <url>\r\n".
                      "  <loc>".$DocInfo->url."</loc>\r\n".
                      " </url>\r\n",
                      FILE_APPEND);
    flush();
  }
  
  public function closeFile()
  {
    file_put_contents($this->sitemap_output_file, '</urlset>', FILE_APPEND);
  }
}

$crawler = new SitemapGenerator();
$crawler->setSitemapOutputFile("sitemap.xml"); // Set output-file
$crawler->setURL("www.php.net");
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
// ... apply all other options and rules to the crawler
$crawler->setPageLimit(10); // Just for testing
$crawler->goMultiProcessed(5); // Or use go() if you don't want multiple processes
$crawler->closeFile();
?>

If you have any questions just let me know.

Anonymous

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-10-20

thank you!…what do you mean by "proper URL escaping"…also I will make all these additions available to the forum as much as possible as I am sure this will help many other…and thank you!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-10-21

Hi again!

Just take a look at the specifications of sitemap files over at http://www.sitemaps.org/protocol.html, URLs have to be entity escaped (and some other stuff i think).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

I am still having an issue - as the example did not seem to work for me? …This is the code I have that is working, I am just trying to add the part the puts the sitemap info into the sitemap.xml file…right now it just creates the file.

<?php
// It may take a whils to crawl a site ...
set_time_limit(10000);
// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");
// Extend the class and override the handleDocumentInfo()-method 
class MyCrawler extends PHPCrawler 
{
  function handleDocumentInfo($DocInfo) 
  {
    // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";
    // Print the URL and the HTTP-status-Code
    echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
    
    // Print the refering URL
    echo "Referer-page: ".$DocInfo->referer_url.$lb;
    
    // Print if the content of the document was be recieved or not
    if ($DocInfo->received == true)
      echo "Content received: ".$DocInfo->bytes_received." bytes".$lb;
    else
      echo "Content not received".$lb; 
    
    // Now you should do something with the content of the actual
    // received page or file ($DocInfo->source), we skip it in this example 
    
    echo $lb;
    
    flush();
  } 
}
// Now, create a instance of your class, define the behaviour
// of the crawler (see class-reference for more options and details)
// and start the crawling-process. 
$crawler = new MyCrawler();
// URL to crawl
$crawler->setURL("www.php.net");
// Only receive content of files with content-type "text/html"
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addStreamToFileContentType("#^((?!text/html).)*$#");
// Ignore links to pictures, dont even request pictures
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
// Ignore links with '?' after
$crawler->addNonFollowMatch("/\?/");
// Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);
// Set the traffic-limit to 1 MB (in bytes,
// for testing we dont want to "suck" the whole site)
$crawler->setTrafficLimit(1000 * 1024);
// Thats enough, now here we go
$crawler->go();
// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler->getProcessReport();
if (PHP_SAPI == "cli") $lb = "\n";
else $lb = "<br />";
    
echo "Summary:".$lb;
echo "Links followed: ".$report->links_followed.$lb;
echo "Documents received: ".$report->files_received.$lb;
echo "Bytes received: ".$report->bytes_received." bytes".$lb;
echo "Process runtime: ".$report->process_runtime." sec".$lb;
$var = "This is a test";
$var2 = "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")";
$fh = fopen("sitemap.xml", "w+");
fwrite( $fh, $var2);
fclose
?>

Anonymous

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

Hi,

my provided code just had a ";" to much, this should work:

<?php
// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");
class SitemapGenerator extends PHPCrawler
{
  protected $sitemap_output_file;
  
  public function setSitemapOutputFile($file)
  {
    $this->sitemap_output_file = $file;
    
    if (file_exists($this->sitemap_output_file)) unlink($this->sitemap_output_file);
    
    file_put_contents($this->sitemap_output_file, "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n".
                                                  "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n", FILE_APPEND);
  }
  
  public function handleDocumentInfo($DocInfo)
  {
    // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";
    
    echo "Adding ".$DocInfo->url." to sitemap file".$lb;
    
    file_put_contents($this->sitemap_output_file, " <url>\r\n".
                                                  "  <loc>".$DocInfo->url."</loc>\r\n".
                                                  " </url>\r\n", FILE_APPEND);
    
    flush();
  }
  
  public function closeFile()
  {
    file_put_contents($this->sitemap_output_file, '</urlset>', FILE_APPEND);
  } 
}
$crawler = new SitemapGenerator();
$crawler->setSitemapOutputFile("sitemap.xml"); // Set output-file
$crawler->setURL("www.php.net");
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 
// ... apply all other options and rules to the crawler
$crawler->setPageLimit(10); // Just for testing
$crawler->goMultiProcessed(5); // Or use go() if you don't want multiple processes
$crawler->closeFile();
?>

Anonymous

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

Hi,

my code just had a ";" too much, this should work:

<?php
// Inculde the phpcrawl-mainclass
include("libs/PHPCrawler.class.php");
class SitemapGenerator extends PHPCrawler
{
  protected $sitemap_output_file;
  
  public function setSitemapOutputFile($file)
  {
    $this->sitemap_output_file = $file;
    
    if (file_exists($this->sitemap_output_file)) unlink($this->sitemap_output_file);
    
    file_put_contents($this->sitemap_output_file, 
                      "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\r\n".
                      "<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">\r\n",
                      FILE_APPEND);
  }
  
  public function handleDocumentInfo($DocInfo)
  {
    // Just detect linebreak for output ("\n" in CLI-mode, otherwise "<br>").
    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";
    
    echo "Adding ".$DocInfo->url." to sitemap file".$lb;
    
    file_put_contents($this->sitemap_output_file, " <url>\r\n".
                                                  "  <loc>".$DocInfo->url."</loc>\r\n".
                                                  " </url>\r\n", FILE_APPEND);
    
    flush();
  }
  
  public function closeFile()
  {
    file_put_contents($this->sitemap_output_file, '</urlset>', FILE_APPEND);
  } 
}
$crawler = new SitemapGenerator();
$crawler->setSitemapOutputFile("sitemap.xml"); // Set output-file
$crawler->setURL("www.php.net");
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i"); 
// ... apply all other options and rules to the crawler
$crawler->setPageLimit(10); // Just for testing
$crawler->goMultiProcessed(5); // Or use go() if you don't want multiple processes
$crawler->closeFile();
?>

Anonymous

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

Uwe Hunfeld - 2012-10-22

Hi,

it's just not possible to paste code here without destroying parts of it (always adds ";" somewhere).
so here it goes again:

http://pastebin.com/EQDGtGA1

Should work correctly.

Best regards!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-10-22

So I see that it creates "sitemap.xml"…however, it only is putting in the XML header and not anything actually for the sitemapped URLs, etc….

Any idea?…thank you so much for your assistance…

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-10-22

Got it going now…is there a way to filter a URL by keyword or anything like that?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-10-22

Hey,

did you read the docs at all? ;)

http://phpcrawl.cuab.de/classreferences/PHPCrawler/method_detail_tpl_method_addURLFollowRule.htm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-10-22

yes got it all going. just wondering now if there is a way to do it via "fopen/write/close" vs. file_get_contents….since it is more optimal?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Nobody/Anonymous - 2012-10-22

Yes, of course you can use fopen and fwrite to fill you sitemap-file if you want to, but it doesn't matter what you use, file_put_contents does it as well. Why should fopen/fwrite be more optimal?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

example.php outputting to "sitemap...

Forums

Help

example.php outputting to "sitemap... document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

example.php outputting to "sitemap...