[Grub-develop] The Crawler (CCRW)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Ramaseshan sent me this algorithm for the crawler:

Instruction from CORD  Start Crawling
Talk to CREQ and get the set of URLs to crawl
Add the URL to the currently empty list of URLs to search.
Set the state of each URL to Ready to Crawl
For each URL, start a thread
While the list of URLs to search is not empty,
{
   Get the next URL in the list.
   Change the state of the URL to 'Crawling in Progress'

   //Before starting a crawl, each thread reads this state and picks up
   //the URL whose state is 'Ready to Crawl'

   //The following step can be avoided, if the server sends only URLs
   //whose protocol is only HTTP (FTP?)

   Check the URL to make sure its protocol is HTTP (FTP?)
   If not HTTP (FTP?) protocol
      Break
   Else
      See whether there's a robots.txt file at this site that includes a
       "Disallow" statement.
      If the document is disallowed for indexing
         Break
      Else
         Retrieve that document From the Web
         If unable to retrieve the document (determined by the timeouts)
            Change the state of the URL to 'Unable to Crawl'
            Break out of while loop
         End If
         Obey 'Meta Tag =Robots' and get all Links in the document if
allowed
         Resolve links (if present in the document) to get
          the absolute new URLs
         Store those new URLs in the database
         Compress the page
         Change the state of the URL to 'Crawling Completed'
      End If
   End If
}
If some of the URLs could not crawled (unable to crawl, server down, page
not found, etc), return those URLs to the server for rescheduling

My comments on the algorithm:

> Talk to CREQ and get the set of URLs to crawl

[ozra] For the sake of modularity, I think that CCRW should be unaware of
the CREQ's existence.  It needs to only know the CDBP's interface.  This
interface will provide operations such as getting a URL to crawl, storing
the contents of the pages, storing the newly-found URLs, etc.

> Set the state of each URL to 'Ready to Crawl'

[ozra] When CORD schedules CCRW to run, CREQ would have already gotten new
URLs from the Server for crawling, and would have marked them as such (or do
whatever to make them ready for crawling).

> For each URL, start a thread

[ozra] Here is my suggestion on how we should handle the crawling threads.
When CORD starts CRW, it then executes in either its own thread, or perhaps
not.  Let me know what you think about that.  I think it is important to
have a mechanism that will effectively kill the crawling threads if we
needed to.

We could have a "magic number" on the number of threads that will crawl
concurrently.  Say that magic number is 10.  Prior to any crawling, 10
crawler threads will be created.  Then each thread operates in the loop that
you described:

> While the list of URLs to search is not empty,
> {

  ...

> }

[ozra] Then CRW waits for all threads to finish crawling before either the
main CRW thread dies, or a function call that makes CRW run returns.

> Check the URL to make sure its protocol is HTTP (FTP?)
> If not HTTP (FTP?) protocol
>   Break

[ozra] I think we should have a protocol check-up.  For now HTTP only will
be implemented.  But we should make it modular enough so that adding new
protocol would be painless.

[ozra] I am all for using the robots.txt and the robot meta tag during
crawling.  When such pages is attempted to be crawled, we should mark them
"disallowed."  But let's leave this task for future.  Don't worry about it
right know, unless you think it is necessary.

> If some of the URLs could not crawled (unable to crawl, server down, page
not found, etc), return those URLs to the server for rescheduling

[ozra] Remember that this module will have no contact with the central
Server at all.  To retrieve the URLs that are due for crawling, or to store
contents of the pages crawled and the URLs found, the CDBP's interface will
be used.  I know it is a kind of abstract to you at this point because such
interface does not yet exist, but I think it's OK while you are writing the
p-code.  I will assign someone soon to do that.  In fact, a lot of the
CDBP's interface will be figured out after your p-code.

Don't worry about how pages are stored or compressed.  Mikhail is working on
archiving the crawled pages.

Just remember to use the existing code for thread, mutex, and HTTP
operations.  They can be all found on the CVS.  If you need an additional
functionality on the HTTP library, contact Kosta, as he knows it best.  I
know we will use time-outs on waiting for Server responses.  Check to see if
that functionality is OK.

Cheers,

igor.

--------------------------------------------------------------
Igor Stojanovski                                 Grub.Org Inc.
Chief Technical Officer                 5100 N. Brookline #830
                                       Oklahoma City, OK 73112
oz...@gr...                            Voice: (405) 917-9894
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------