From: Igor S. <oz...@gr...> - 2000-08-21 20:15:25
|
Ramaseshan sent me this algorithm for the crawler: Instruction from CORD Start Crawling Talk to CREQ and get the set of URLs to crawl Add the URL to the currently empty list of URLs to search. Set the state of each URL to Ready to Crawl For each URL, start a thread While the list of URLs to search is not empty, { Get the next URL in the list. Change the state of the URL to 'Crawling in Progress' //Before starting a crawl, each thread reads this state and picks up //the URL whose state is 'Ready to Crawl' //The following step can be avoided, if the server sends only URLs //whose protocol is only HTTP (FTP?) Check the URL to make sure its protocol is HTTP (FTP?) If not HTTP (FTP?) protocol Break Else See whether there's a robots.txt file at this site that includes a "Disallow" statement. If the document is disallowed for indexing Break Else Retrieve that document From the Web If unable to retrieve the document (determined by the timeouts) Change the state of the URL to 'Unable to Crawl' Break out of while loop End If Obey 'Meta Tag =Robots' and get all Links in the document if allowed Resolve links (if present in the document) to get the absolute new URLs Store those new URLs in the database Compress the page Change the state of the URL to 'Crawling Completed' End If End If } If some of the URLs could not crawled (unable to crawl, server down, page not found, etc), return those URLs to the server for rescheduling My comments on the algorithm: > Talk to CREQ and get the set of URLs to crawl [ozra] For the sake of modularity, I think that CCRW should be unaware of the CREQ's existence. It needs to only know the CDBP's interface. This interface will provide operations such as getting a URL to crawl, storing the contents of the pages, storing the newly-found URLs, etc. > Set the state of each URL to 'Ready to Crawl' [ozra] When CORD schedules CCRW to run, CREQ would have already gotten new URLs from the Server for crawling, and would have marked them as such (or do whatever to make them ready for crawling). > For each URL, start a thread [ozra] Here is my suggestion on how we should handle the crawling threads. When CORD starts CRW, it then executes in either its own thread, or perhaps not. Let me know what you think about that. I think it is important to have a mechanism that will effectively kill the crawling threads if we needed to. We could have a "magic number" on the number of threads that will crawl concurrently. Say that magic number is 10. Prior to any crawling, 10 crawler threads will be created. Then each thread operates in the loop that you described: > While the list of URLs to search is not empty, > { ... > } [ozra] Then CRW waits for all threads to finish crawling before either the main CRW thread dies, or a function call that makes CRW run returns. > Check the URL to make sure its protocol is HTTP (FTP?) > If not HTTP (FTP?) protocol > Break [ozra] I think we should have a protocol check-up. For now HTTP only will be implemented. But we should make it modular enough so that adding new protocol would be painless. [ozra] I am all for using the robots.txt and the robot meta tag during crawling. When such pages is attempted to be crawled, we should mark them "disallowed." But let's leave this task for future. Don't worry about it right know, unless you think it is necessary. > If some of the URLs could not crawled (unable to crawl, server down, page not found, etc), return those URLs to the server for rescheduling [ozra] Remember that this module will have no contact with the central Server at all. To retrieve the URLs that are due for crawling, or to store contents of the pages crawled and the URLs found, the CDBP's interface will be used. I know it is a kind of abstract to you at this point because such interface does not yet exist, but I think it's OK while you are writing the p-code. I will assign someone soon to do that. In fact, a lot of the CDBP's interface will be figured out after your p-code. Don't worry about how pages are stored or compressed. Mikhail is working on archiving the crawled pages. Just remember to use the existing code for thread, mutex, and HTTP operations. They can be all found on the CVS. If you need an additional functionality on the HTTP library, contact Kosta, as he knows it best. I know we will use time-outs on waiting for Server responses. Check to see if that functionality is OK. Cheers, igor. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |