Menu

Problem with resume crawl at (very) large websites

Help
2013-05-05
2013-05-13
  • brundleseth

    brundleseth - 2013-05-05

    Hi there,

    First of - thanks for a superb product - its been a pleasure using it for some time now ;-)

    I've successfully been using PhpCrawl to crawl websites with resume crawl (using the SQLite implementation). This is working very nicely for medium sized websites.

    However, when I get to go through some hundreds of thousands of pages, it somehow crashes upon restart.

    Ie. it starts all over when I restart it (which it does not normally, so I'm "assuming" its not coding issue on my side).

    I was considering if there are any limitations on the db-size perhaps? In the cache-folder, the urlcache.db3-file is well over 1 mb, but again, that does not seem that heavy.

    Any clues?

    It should be mentioned that the actual results, I'm storing in MySQL. Would it be possible for me to implement a mysql alternative to the SQLite? This would make all my current data usable, etc., as I would rather not crawl all over again ;-))

    Thank you for any help or pointers!!

    :)

     
  • Anonymous

    Anonymous - 2013-05-06

    Hi brundleseth!

    An sqlite-DB-file should only be limited by the space on the harddrive (or maybe other OS limitations).

    But 1 mb seems much to less for an sqlite-DB conatining hundredthousands of URLs,
    or did you mean 1GB?

    And does it crash wwhn you try to resart a process or does it crash during the crawling-process?

    This is a difficulty one i think since a lot of factors could have something to do with your problem (OS, HD-space, file-system, limitations in the PHP-PDO-extension ans so on).

    And yes, you can (easily) implement an mysql-URL-cache (i think).
    There's a base-class called "PHPCrawlerURLCacheBase". You have to extend this class and implement all the methods, that's it ;)

    Otherwise i could put it on the list of feature-requests if you want.

     

    Last edit: Anonymous 2013-05-06
  • brundleseth

    brundleseth - 2013-05-06

    The exact filesize for urlcache.db3 is 1,268,736 bytes :)

    I'm not sure why it stopped; I was running it from the CLI here, so it could be I closed it by accident (not knowing) or that it just crashed. As a part of the script I've stored each URL in MySQL, and there are now 400k URL's.

    No matter what, it would then resume (many hundred thousands of URL's) from scratch. My code will not overwrite the existing URL's, but it will fail to get above those couple of hundred thousands of URL's. And i positively know that there is approx 5 times as many as what I've crawled yet.

    I know its pretty hard to debug remotely like that; but given how it did not break during the first 300k URL's it must somehow be some system-limitations I'm thinking? I'm on Ubuntu 12.04 LTS.

    If you would put the MySQL cache on the feature list then that would be a killer !!

     
  • Anonymous

    Anonymous - 2013-05-06

    The strange thing is the small size of the urlcache.db3 file, so it seems to be kind of empty. You can check this with the sqlite3-client and look into the db-file and see how much urls are in there.

    Or is it the size of the file AFTER you restarted your script and aborted it after some seconds?

    So...did this happen one time (crash -> restart -> crawler begins from the scratch again)?
    Or are you able to reproduce this behaivour?

    I'm using ubuntu for myself too for my crawler-projects and never had this problem so far. And yes, it's hard to say or to test what's going wrong over there since it happens after hunderedthoussands of pages.

    You know .. it just may be a corrupted sector on your harddrive that's causing the problem or a corrupted filesystem or corrupted memory...

    Did you take a look in the systems logfile if it says something about something like this or about segfaults or similar?

     
  • brundleseth

    brundleseth - 2013-05-13

    I'm Still trying to debug this one - its weird.

    The filesize could easily be due to "recrawl", thats true.

    But it crawls, and then stops/crashes before its finished - and I'm hosted on Rackspace, so I doubt that its a bad harddrive, as they're running RAID6 etc :-/

     
  • Anonymous

    Anonymous - 2013-05-13

    Hmm, what can i do to help you?

    Are you sure it just didn't finish?
    Does it crash at a specific point in the process, like always after URL #145230?

    Could you send me your project/script together with the URL you are trying to crawl?
    And maybe togehter with the urlcache.db3 file DIRECTLY after the crawler crashed
    (or event better the entire phpcrawl_tmp folder)?

    Then i'll take a look.

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.