Menu

mYslq multithread

Help
2012-08-16
2014-12-28
  • Nobody/Anonymous

    Hi Guys,
    Thanks for the script. Question, im finding that when i try to instead the contents in my DB with ->goMultiProcessed the script hangs,. It works fine on $crawler->go();

    TLDR;
    Any ideas on how to insert data into DB using  $crawler->goMultiProcessed(10);

     
  • Nobody/Anonymous

    Hi,

    could you post your script or explain how you open the DB-connection and how you write
    into the mysql-database?

    Normally it shouldn't get any problems …

     
  • Nobody/Anonymous

    usual details

    $username = "xxxxx";
    $password = "xxxxxx";
    $host = "localhost";
    $database = "dbcccccc";
    mysql_connect($host,$username,$password) or die("Cannot connect to the database.<br>" . mysql_error());
    mysql_select_db($database) or die("Cannot select the database.<br>" . mysql_error());

    $sqlx = "INSERT INTO  table SET
    linksetid = '$linkuniq',
    ftimestamp = '$ntime',
    url = '$mylinks',
    anchor = '$anchor',
    level = '9',
    crawl_now = '2',
    ltype = '20'"
    $queryx = mysql_query($sqlx) or die("Cannot query the database.<br>" . mysql_error());

     
  • Nobody/Anonymous

    hi agin,

    i mean where/when in your script do you open the db-connection and where/when are you doing the instert-statement?
    If you post your script, i'll make a check.

    And for your second problem you maybe should escape your INSERT statement properly, this has nothing to do with phpcrawl itself (i guess).

     
  • Nobody/Anonymous

    Thanks,  for your help, I realized this after, posting . silly me
    Insert statement goes here

    class MyCrawler extends PHPCrawler
    {
      function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo)
      {
    $topanchor =  $PageInfo->refering_linktext;
    $url =  $PageInfo->url;

    {INSERT INTO DB HERE}

    //I also need the external links
    $linksfound = $PageInfo->links_found;

    foreach( $linksfound as $key => $value){

          //get external links
            $mylinks = $value;
               $anchor = $value;

    {INSERT AGAIN INTO DB}

    }//end foreach

      } //emd handler
    }//end extender

     
  • Nobody/Anonymous

    and where do you open the DB-connection (mysql_connect)?

     
  • Nobody/Anonymous

    I'm inserting the connect using an include file at the top of the file. another I;ve noticed as well. It is crawling in goMultiProcessed mode but it takes ages when I include the insert into db  as part of the script.  I've i comment out the insert statement, the script is fast..  it also fly when its on single thread mode with a db.  db + goMultiProcessed = very slow 
    I'm on a dedicated box.

    <?
    include("libs/PHPCrawler.class.php");
    include("../../conn.php");

    class MyCrawler extends PHPCrawler
    {
      function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo)
      {
    $topanchor =  $PageInfo->refering_linktext;
    $url =  $PageInfo->url;

    {INSERT INTO DB HERE}

    //I also need the external links
    $linksfound = $PageInfo->links_found;

    foreach( $linksfound as $key => $value){

          //get external links
            $mylinks = $value;
               $anchor = $value;

    {INSERT AGAIN INTO DB}

    }//end foreach

      } //emd handler
    }//end extender

     
  • Uwe Hunfeld

    Uwe Hunfeld - 2012-08-17

    Hi!

    I just made a quick test with the script listed below (works the same way as yours).
    Im sorry, but i cant find any problem, there's no difference in the process runtiume between
    ther script WITH inerst-statement and without. Both takes around 12 seconds over here.
    ( I used  PHP 5.3.2 and MySql 5.1.61 on a Ubuntu 10.04.1 system for testing)

    So again, sorry, but i don't know whats the problem with your script or server or mysql-database.

    This is the script i used:

    <?php
    // Inculde the phpcrawl-mainclass
    include("libs/PHPCrawler.class.php");

    mysql_connect("localhost","root","passwd");
    mysql_select_db("test");

    class MyCrawler extends PHPCrawler
    {
      function handleDocumentInfo($DocInfo)
      {
        if (PHP_SAPI == "cli") $lb = "\n";
        else $lb = "<br />";

        echo "Page requested: ".$DocInfo->url." (".$DocInfo->http_status_code.")".$lb;
        mysql_query("INSERT INTO test SET url = '".$DocInfo->url."';");
      }
    }

    $crawler = new MyCrawler();
    $crawler->setURL("anyurl.com");
    $crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
    $crawler->setPageLimit(100);
    $crawler->setWorkingDirectory("/dev/shm/");
    $crawler->goMultiProcessed(10);

    $report = $crawler->getProcessReport();

    if (PHP_SAPI == "cli") $lb = "\n";
    else $lb = "<br />";
       
    echo "Summary:".$lb;
    echo "Links followed: ".$report->links_followed.$lb;
    echo "Documents received: ".$report->files_received.$lb;
    echo "Bytes received: ".$report->bytes_received." bytes".$lb;
    echo "Process runtime: ".$report->process_runtime." sec".$lb;
    ?>

    Did you use php CLI (console) or a browser?

     
  • Anonymous

    Anonymous - 2014-12-28

    Hi, i can't figure where to put the code please help.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.