From: Igor S. <oz...@gr...> - 2000-09-15 17:30:44
|
To Rodrigo: We need to add a table(s) to our database that will store the URLs and some statistics along with it. We need a mechanism which will use the statistics for dispatching/scheduling URLs to the Clients to crawl. We have already talked on this issue. When Clients connect to the Server, they make a request. The response is a list of URLs for crawling. This module should offer an interface that will return a list, and update this action appropriately to the DB. I would like you to work on this module. You should devise an algorithm that will figure out how to schedule the URLs. Use some of the ideas we have already presented via our emails. I pasted an excerpt form an older email at the bottom of the msg. You must take into account several things in you model, though: 1) At the beginning, we will not have many Clients at our disposal, and our database will be overwhelmed with new URLs to be crawled. You must design this algorithm so that even when we have millions of URLs that were never crawled, our Clients will get back to those old pages, so that our database will stay up-to-date as much as possible. Remember, our goal is to have the most up-to-date search engine on the net. 2) In future, we will provide means to measure each Client's crawling performance (pages crawled/day), so that we can assign appropriate number of URLs to each one of them. Don't worry about this one for now. 3) Also, we must think security. We may need to introduce a certain amount of redundancy in order to check whether we get good data from our Clients. For example, we may have 10% redundancy in crawling. If data does not match from two Clients, a third Client may be assigned to crawl the page in question, and to figure out which Client "cheated". Of course, the page may have changed in the short amount of time the two Clients crawled, and we may wrongfully conclude that a Client is rogue. Anyway, I say, don't worry about security for now. Let's leave this for a later stage. 4) The URL scheduling algorithm must be highly configurable and modular enough so that we may add new capabilities to it easily. 5) Many other things I haven't accounted for. Like for example, taking into account the proximity of Clients to sites in dispatching the URLs... From an old message: About dispatching/scheduling URLs to Clients: [ozra] Dispatching (term I borrowed from Robert) is a mechanism for scheduling URLs to Clients for crawling. Here is my suggestion on how to schedule the URLs. Every page that is crawled for the first time by our system is automatically scheduled to be crawled again in (say) two weeks. If in two weeks a Client crawls the page and finds that the page has changed, the next crawling time will be set for one week, or half the previous time; if next week the page changed again, the time will be halfed again to 3 days, and so on. If on the other hand, a page didn't change, we might perhaps double the next scheduled time from two weeks to a month, etc. [Rodrigo] Hmmm sounds good to me...just change doubling and halving to multiplying and dividing by 1.5, I guess that's a more proper value...also, we have to consider the situation where a client starts crawling a HUGE site(Geocities for instance)...of course no one client will crawl all of it, so we have to make it schedule the parts it doesn't...and develop a good schema so that no two clients will be crawling the same thing, and no pages will be left uncrawled... [ozra] Let's not forget that for each URLs that will be crawled, Client needs to get "permission" for the Server. No exception. ---end msg--- Give me your thoughts on this. Cheers, ozra. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- |