From: Igor S. <oz...@gr...> - 2000-09-25 19:11:16
|
Here is what I was thing about scheduling. It is not too complicated for implementation. Say, these are the fields of the URL table: url_id, next_in_list, next_update, other fields... (they are unimportant in our case) ...where next_in_q is a url_id of a next in the list of URLs to be crawled in the same time interval. next_update is calculated time interval which tells how much time must elapse before it is scheduled again, and which is calulated by halving/doubling (or by factor of 1.5) based on how often it changed. For example, let's say the first day grub Clients run we receive four new URLs (never crawled before), and we stick them in the URL table: URL table: url_id next_in_list next_update in_crawl U1 U2 8 0 U2 U3 8 0 U3 U4 8 0 U4 (null) 8 0 next_update tells that once this URL is crawled; if its contents changed (which is sure to happen for never-crawled pages), the time will be halved to 4 for all of them. Then in addition we need a table to represent a time line. Let's say that the smallest time interval is a week, and the week_id is computed as a number of weeks since the beginning of year 2000: Timeline table: time_id front_url_id 0 U1 (Say that this week is 30th.) When week 30 comes, a queue of URLs is assembled via the linked lists of URLs that the timeline table points to, using front_url_id. Here we have time_id = 0, where zero is a special value meaning new URLs never crawled. Now it is week 30, but since we don't have an entry for time_id = 30, we go to time_id = 0 to get URLs never crawled before to attain a crawling queue. It points to U1, and using next_in_list we can derive the queue: QUEUE: U1, U2, U3, U4. We also set the in_crawl flag to 1. This may be useful to make a clean-up of URLs that were never reported back from Clients. When Clients report back the results, the in_crawl is set back to zero, and next_update is halved: URL table: url_id next_in_list next_update in_crawl U1 U2 4 0 U2 U3 4 0 U3 U4 4 0 U4 (null) 4 0 We create entry for time_id = 34 (current week plus next_update -- 30 + 4) and now we have: Timeline table: time_id front_url_id 0 (null) -- we assume we never got any new URLs 34 U1 ...34 is the week these URLs should be recrawled. Now, week 34 comes along, and we derive a queue again, which would look the same (assuming that we never got any new URLs in meantime): QUEUE: U1, U2, U3, U4. Say URLs U1 and U3 have been updated, and U2, U4 not. Then we would double/half the next_update fields appropriately, and now there would be two linked lists (U1->U3, and U2->U4). Every URL in the table belong to a linked list. URL table: url_id next_in_list next_update in_crawl U1 U3 2 0 U2 U4 8 0 U3 (null) 2 0 U4 (null) 8 0 Timeline table: time_id front_url_id 0 (null) 36 U1 42 U2 When week 36 comes, a queue containing U1 and U3 will be crawled, and so on. I think you got my point. When we add a new URLs, it is prepended at (inserted at beginning of the list of) the list under time_id = 0. Deleting URLs from the URL list should not be very difficult. We might add an extra flag to indicate DELETED, and any time a QUEUE for URLs to be crawled is created, another queue for URLs to be deleted can be created at the same time as well by looking at the deleted flag. Which means, that URLs will not be immediately removed, because they are part of the linked list. Of course, we could devise a doubly linked list, but this is absolutely unnecessary. At the beginning we will be overwhelmed with new URLs to be crawled, and if we don't think smart about it, we will not have any updates on previously crawled URLs, but only crawling new URLs, or the other way around. That is why all new URLs are put under time_id = 0, so that our algorithm may combine crawling new URLs and old URLs. It would be great if (for example) 60% of the crawling is spent on crawling (and finishing) old URLs, and 40% on newly-found ones. In this way our database will grow and will be up-to-date. But if crawling old URLs takes around or more then 100%, then we will have to sacrifice the "up-to-dateness" for crawling new URLs. To cope with the problem, if it is week 50 and we are backed up to week 35, our algorithm must be configurable so that we may decrease the amount of new URLs being crawled. Plus if we are so overwhelmed by new coming URLs, we may schedule them in the future, and not make them due immediately, or even reschedule sublists of URLs, which is a very efficient and simple operation. I am just not sure about one thing -- there will be a lot of simple SQL statements needed to traverse through the linked lists. I don't know how much of a burden is this. We will probably need hundreds of thousands to a million queries to build the URL queue (once grub takes off, of course). Besides this, another serious problem with this system may be -- what if we lose part of the URL database (due to hardware failure, for example)? Then, chances are, most of the linked lists will be screwed up. Well, that not so bad of the problem after all (for our system -- losing the URL list by itself is extremely terrible anyway). Give me your thoughts. Cheers, ozra. -------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- -----Original Message----- From: rda...@ca... [mailto:rda...@ca...]On Behalf Of Rodrigo Damazio Sent: Friday, September 22, 2000 4:27 PM To: Igor Stojanovski Subject: Re: Storing and scheduling URLs Igor Stojanovski wrote: > To Rodrigo: > > We need to add a table(s) to our database that will store the URLs and some > statistics along with it. We need a mechanism which will use the statistics > for dispatching/scheduling URLs to the Clients to crawl. We have already > talked on this issue. > > When Clients connect to the Server, they make a request. The response is a > list of URLs for crawling. This module should offer an interface that will > return a list, and update this action appropriately to the DB. > > I would like you to work on this module. > > You should devise an algorithm that will figure out how to schedule the > URLs. Use some of the ideas we have already presented via our emails. I > pasted an excerpt form an older email at the bottom of the msg. > > You must take into account several things in you model, though: > 1) At the beginning, we will not have many Clients at our disposal, and our > database will be overwhelmed with new URLs to be crawled. You must design > this algorithm so that even when we have millions of URLs that were never > crawled, our Clients will get back to those old pages, so that our database > will stay up-to-date as much as possible. Remember, our goal is to have the > most up-to-date search engine on the net. > 2) In future, we will provide means to measure each Client's crawling > performance (pages crawled/day), so that we can assign appropriate number of > URLs to each one of them. Don't worry about this one for now. > 3) Also, we must think security. We may need to introduce a certain amount > of redundancy in order to check whether we get good data from our Clients. > For example, we may have 10% redundancy in crawling. If data does not match > from two Clients, a third Client may be assigned to crawl the page in > question, and to figure out which Client "cheated". Of course, the page may > have changed in the short amount of time the two Clients crawled, and we may > wrongfully conclude that a Client is rogue. Anyway, I say, don't worry > about security for now. Let's leave this for a later stage. > 4) The URL scheduling algorithm must be highly configurable and modular > enough so that we may add new capabilities to it easily. > 5) Many other things I haven't accounted for. Like for example, taking into > account the proximity of Clients to sites in dispatching the URLs... > > >From an old message: > > About dispatching/scheduling URLs to Clients: > > [ozra] Dispatching (term I borrowed from Robert) is a mechanism for > scheduling URLs > to Clients for crawling. Here is my suggestion on how to schedule the URLs. > Every page that is crawled for the first time by our system is automatically > scheduled to be crawled again in (say) two weeks. If in two weeks a Client > crawls the page and finds that the page has changed, the next crawling time > will be set for one week, or half the previous time; if next week the page > changed again, the time will be halfed again to 3 days, and so on. If on > the other hand, a page didn't change, we might perhaps double the next > scheduled time from two weeks to a month, etc. > > [Rodrigo] Hmmm sounds good to me...just change doubling and halving > to multiplying and dividing by 1.5, I guess that's a more proper > value...also, we have to consider the situation where a client starts > crawling a HUGE site(Geocities for instance)...of course no one client will > crawl all of it, so we have to make it schedule the parts it doesn't...and > develop a good schema so that no two clients will be crawling the same > thing, and no pages will be left uncrawled... > > [ozra] Let's not forget that for each URLs that will be crawled, Client > needs to get "permission" for the Server. No exception. > > ---end msg--- > > Give me your thoughts on this. > > Also, which one of your email addresses should I use now? Use this one only...the old one will bump messages... About the algorythm, I agree, and there's one thing I think we should add - make updates(or perhaps even rating) of most visited pages more frequent...so if a page is only visited once a year by someone(through our search of course), it won't be updated every day or anything...to measure how often a page is visited, just add a redirect script instead of putting direct links to the pages... Anyway, you're asking for a high complexity algorythm...I'll try to do it...anyway, we have to organize our ideas on it better...we could always start with a random-picking altorythm..something like this - "take the list of all new URLs to be crawled, add it to the list of websites not recently updated, RANDOMLY mix it all, and start sending it to the clients"...also, we gotta use a cyclic queue reading, in a way that an entry is only removed when a client actually returns the crawled page instead of right when it's sent to a client, yet it won't be sent again to another client for a while(until the queue end has been reached, which means all URLs have been sent to the clients, then ir repeats)...btw what will we do if we have an empy queue?? Start updating everything again?? It's not likely to happen in the future but it'll probably happen in the beginning... Oh, actually, one little addition to the process above - the URLs to be updated will be interpolated with the new ones, so do a sort ONLY on the old ones so that the most recently updated will be the last IN the sequence, that is, rearrange the URLs without changing the positions they got from random mixing... Max |