PHPCrawl / Forum / Help: Problem with resume crawl at (very) large websites

brundleseth - 2013-05-05

Hi there,

First of - thanks for a superb product - its been a pleasure using it for some time now ;-)

I've successfully been using PhpCrawl to crawl websites with resume crawl (using the SQLite implementation). This is working very nicely for medium sized websites.

However, when I get to go through some hundreds of thousands of pages, it somehow crashes upon restart.

Ie. it starts all over when I restart it (which it does not normally, so I'm "assuming" its not coding issue on my side).

I was considering if there are any limitations on the db-size perhaps? In the cache-folder, the urlcache.db3-file is well over 1 mb, but again, that does not seem that heavy.

Any clues?

It should be mentioned that the actual results, I'm storing in MySQL. Would it be possible for me to implement a mysql alternative to the SQLite? This would make all my current data usable, etc., as I would rather not crawl all over again ;-))

Thank you for any help or pointers!!

:)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-05-06

Hi brundleseth!

An sqlite-DB-file should only be limited by the space on the harddrive (or maybe other OS limitations).

But 1 mb seems much to less for an sqlite-DB conatining hundredthousands of URLs,
or did you mean 1GB?

And does it crash wwhn you try to resart a process or does it crash during the crawling-process?

This is a difficulty one i think since a lot of factors could have something to do with your problem (OS, HD-space, file-system, limitations in the PHP-PDO-extension ans so on).

And yes, you can (easily) implement an mysql-URL-cache (i think).
There's a base-class called "PHPCrawlerURLCacheBase". You have to extend this class and implement all the methods, that's it ;)

Otherwise i could put it on the list of feature-requests if you want.

Last edit: Anonymous 2013-05-06

Hi brundleseth! An sqlite-DB-file should only be limited by the space on the harddrive (or maybe other OS limitations). But 1 mb seems much to less for an sqlite-DB conatining hundredthousands of URLs, or did you mean 1GB? And does it crash wwhn you try to resart a process or does it crash during the crawling-process? This is a difficulty one i think since a lot of factors could have something to do with your problem (OS, HD-space, file-system, limitations in the PHP-PDO-extension ans so on). And yes, you can (easily) implement an mysql-URL-cache (i think). There's a base-class called "PHPCrawlerURLCacheBase". You have to extend this class and implement all the methods, that's it ;) Otherwise i could put it on the list of feature-requests if you want.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

brundleseth - 2013-05-06

The exact filesize for urlcache.db3 is 1,268,736 bytes :)

I'm not sure why it stopped; I was running it from the CLI here, so it could be I closed it by accident (not knowing) or that it just crashed. As a part of the script I've stored each URL in MySQL, and there are now 400k URL's.

No matter what, it would then resume (many hundred thousands of URL's) from scratch. My code will not overwrite the existing URL's, but it will fail to get above those couple of hundred thousands of URL's. And i positively know that there is approx 5 times as many as what I've crawled yet.

I know its pretty hard to debug remotely like that; but given how it did not break during the first 300k URL's it must somehow be some system-limitations I'm thinking? I'm on Ubuntu 12.04 LTS.

If you would put the MySQL cache on the feature list then that would be a killer !!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-05-06

The strange thing is the small size of the urlcache.db3 file, so it seems to be kind of empty. You can check this with the sqlite3-client and look into the db-file and see how much urls are in there.

Or is it the size of the file AFTER you restarted your script and aborted it after some seconds?

So...did this happen one time (crash -> restart -> crawler begins from the scratch again)?
Or are you able to reproduce this behaivour?

I'm using ubuntu for myself too for my crawler-projects and never had this problem so far. And yes, it's hard to say or to test what's going wrong over there since it happens after hunderedthoussands of pages.

You know .. it just may be a corrupted sector on your harddrive that's causing the problem or a corrupted filesystem or corrupted memory...

Did you take a look in the systems logfile if it says something about something like this or about segfaults or similar?

The strange thing is the small size of the urlcache.db3 file, so it seems to be kind of empty. You can check this with the sqlite3-client and look into the db-file and see how much urls are in there. Or is it the size of the file AFTER you restarted your script and aborted it after some seconds? So...did this happen one time (crash -> restart -> crawler begins from the scratch again)? Or are you able to reproduce this behaivour? I'm using ubuntu for myself too for my crawler-projects and never had this problem so far. And yes, it's hard to say or to test what's going wrong over there since it happens after hunderedthoussands of pages. You know .. it just may be a corrupted sector on your harddrive that's causing the problem or a corrupted filesystem or corrupted memory... Did you take a look in the systems logfile if it says something about something like this or about segfaults or similar?

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

brundleseth - 2013-05-13

I'm Still trying to debug this one - its weird.

The filesize could easily be due to "recrawl", thats true.

But it crawls, and then stops/crashes before its finished - and I'm hosted on Rackspace, so I doubt that its a bad harddrive, as they're running RAID6 etc :-/

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Comment has been marked as spam.
Undo

View and moderate all "Help" comments posted by this user

Mark all as spam, and block user from posting to "Forum"

Anonymous - 2013-05-13

Hmm, what can i do to help you?

Are you sure it just didn't finish?
Does it crash at a specific point in the process, like always after URL #145230?

Could you send me your project/script together with the URL you are trying to crawl?
And maybe togehter with the urlcache.db3 file DIRECTLY after the crawler crashed
(or event better the entire phpcrawl_tmp folder)?

Then i'll take a look.

Hmm, what can i do to help you? Are you sure it just didn't finish? Does it crash at a specific point in the process, like always after URL #145230? Could you send me your project/script together with the URL you are trying to crawl? And maybe togehter with the urlcache.db3 file DIRECTLY after the crawler crashed (or event better the entire phpcrawl_tmp folder)? Then i'll take a look.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Problem with resume crawl at (very) large websites

Forums

Help

Problem with resume crawl at (very) large websites document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Problem with resume crawl at (very) large websites