From: Igor S. <oz...@gr...> - 2000-09-27 01:27:55
|
-------------------------------------------------------------- Igor Stojanovski Grub.Org Inc. Chief Technical Officer 5100 N. Brookline #830 Oklahoma City, OK 73112 oz...@gr... Voice: (405) 917-9894 http://www.grub.org Fax: (405) 848-5477 -------------------------------------------------------------- -----Original Message----- From: rda...@gi... [mailto:rda...@gi...]On Behalf Of Rodrigo Damazio Sent: Monday, September 25, 2000 9:01 PM To: Igor Stojanovski Subject: Re: FW: Storing and scheduling URLs Igor Stojanovski wrote: > Here is what I was thing about scheduling. It is not too complicated for > implementation. > > Say, these are the fields of the URL table: > > url_id, next_in_list, next_update, other fields... (they are unimportant in > our case) > > ...where next_in_q is a url_id of a next in the list of URLs to be crawled > in the same time interval. next_update is calculated time interval which > tells how much time must elapse before it is scheduled again, and which is > calulated by halving/doubling (or by factor of 1.5) based on how often it > changed. > > For example, let's say the first day grub Clients run we receive four new > URLs (never crawled before), and we stick them in the URL table: > > URL table: > url_id next_in_list next_update in_crawl > U1 U2 8 0 > U2 U3 8 0 > U3 U4 8 0 > U4 (null) 8 0 > > next_update tells that once this URL is crawled; if its contents changed > (which is sure to happen for never-crawled pages), the time will be halved > to 4 for all of them. > > Then in addition we need a table to represent a time line. Let's say that > the smallest time interval is a week, and the week_id is computed as a > number of weeks since the beginning of year 2000: > > Timeline table: > time_id front_url_id > 0 U1 > > (Say that this week is 30th.) When week 30 comes, a queue of URLs is > assembled via the linked lists of URLs that the timeline table points to, > using front_url_id. Here we have time_id = 0, where zero is a special value > meaning new URLs never crawled. Now it is week 30, but since we don't have > an entry for time_id = 30, we go to time_id = 0 to get URLs never crawled > before to attain a crawling queue. It points to U1, and using next_in_list > we can derive the queue: > > QUEUE: U1, U2, U3, U4. > > We also set the in_crawl flag to 1. This may be useful to make a clean-up > of URLs that were never reported back from Clients. Hmmm so far sounds good, I just don't think we need to use the next_id thing - we can just do a query like "give me all URLs with time_id = 0", something like that... > When Clients report back the results, the in_crawl is set back to zero, and > next_update is halved: Note: it's only halved if the content has changed!! [ozra] That's what I am saying, too. > URL table: > url_id next_in_list next_update in_crawl > U1 U2 4 0 > U2 U3 4 0 > U3 U4 4 0 > U4 (null) 4 0 > > We create entry for time_id = 34 (current week plus next_update -- 30 + 4) > and now we have: > > Timeline table: > time_id front_url_id > 0 (null) -- we assume we never got any new URLs > 34 U1 > > ...34 is the week these URLs should be recrawled. > > Now, week 34 comes along, and we derive a queue again, which would look the > same (assuming that we never got any new URLs in meantime): > > QUEUE: U1, U2, U3, U4. > > Say URLs U1 and U3 have been updated, and U2, U4 not. > > Then we would double/half the next_update fields appropriately, and now > there would be two linked lists (U1->U3, and U2->U4). Every URL in the > table belong to a linked list. > > URL table: > url_id next_in_list next_update in_crawl > U1 U3 2 0 > U2 U4 8 0 > U3 (null) 2 0 > U4 (null) 8 0 > > Timeline table: > time_id front_url_id > 0 (null) > 36 U1 > 42 U2 > > When week 36 comes, a queue containing U1 and U3 will be crawled, and so on. > I think you got my point. > > When we add a new URLs, it is prepended at (inserted at beginning of the > list of) the list under time_id = 0. > > Deleting URLs from the URL list should not be very difficult. We might add > an extra flag to indicate DELETED, and any time a QUEUE for URLs to be > crawled is created, another queue for URLs to be deleted can be created at > the same time as well by looking at the deleted flag. Which means, that > URLs will not be immediately removed, because they are part of the linked > list. Of course, we could devise a doubly linked list, but this is > absolutely unnecessary. > > At the beginning we will be overwhelmed with new URLs to be crawled, and if > we don't think smart about it, we will not have any updates on previously > crawled URLs, but only crawling new URLs, or the other way around. That is > why all new URLs are put under time_id = 0, so that our algorithm may > combine crawling new URLs and old URLs. It would be great if (for example) > 60% of the crawling is spent on crawling (and finishing) old URLs, and 40% > on newly-found ones. In this way our database will grow and will be > up-to-date. But if crawling old URLs takes around or more then 100%, then > we will have to sacrifice the "up-to-dateness" for crawling new URLs. This is good for the beginning, BUT, say, if we get to have a one-million-URL queue for a certain week, and we only get a small part of that crawled, what happens?? It's a cumulative effect...we have to think of a way to build a queue that doesn't expect a "crawling schedule" to be met by the clients...like my random idea was... [ozra] Well, the queue I am proposing is not schedule-aware, it's just plain and simple. In anyway, you are right, and I think the cumulative effect will occur inevitably. I think I have a solution to this problem. First, let's not forget what our goal is -- the most up-to-date, and later, the most comprehensive search engine on the net (the second is when we get enough Clients). The up-to-date part will be respected from the very beginning. For what I am proposing, first, let's keep in mind three things -- everything I said about the algorithm in my previous email is unchanged (except on how to schedule the URLs for crawling). Second, there will be no list that belongs to a time_id which is in the past, even if it means that when we are backed up we move the whole remaining list to the following day, and third, the Time table in the real workable system is with "resolution" of one day instead of one week (I used week as an example only). We will know (i.e. estimate) the capability our Clients prior to or at the beginning of each day. For this we will use some kind of a prediction function such as moving average (I don't know which is right, I am not a mathematician). We store that value in URLS_PER_DAY variable at beginning of each day. Second, we take the average of sizes of each list of URLs, which is same as total number of URLs that do not belong to the new URLs list (for which time_id = 0): Total_number_of_URLs_in_our_database - Number_of_URLs_for_time_id_equals_zero AVG_LIST_SIZE = -------------------------------------------------------------------------- --- Number_of_entries_in_Time_table - 1 ...and store the value in AVG_LIST_SIZE. So we have the two variables -- URLS_PER_DAY and AVG_LIST_SIZE. The goal here is to have these two values as equal as possible. Because this way our database will always be up-to-date (as much as our algorithm permits), and it will grow only when the number of Clients increases (more correctly, the total ability of the Clients to crawl increases, which should be proportional). And we will know that when the value of URLS_PER_DAY increases. When this happens, we will peek into our new (never crawled) URLs, and send them to the Clients to close the gap, i.e., increase the AVG_LIST_SIZE by scheduling more URLs to the queue. Now, here comes a problem -- what if URLS_PER_DAY decreases? Well, perhaps we may randomly pick URLs to delete from our database (and the indexed data relating to them), or something else -- diminishing our database seems a kind of silly to me. But give me some other ideas. Well, in order to avoid such occurrences from happening often, there must be some gap allowed between AVG_LIST_SIZE and URLS_PER_DAY, in that AVG_LIST_SIZE should always be kept certain percentage lower, instead of making them equal. Also, no prediction will be achieved exactly. If the queue has not emptied completely, it will be inserted at the beginning of the list for the next day. If it was emptied earlier, we should take some URLs from the following day and crawl them earlier. The bad cumulative effect should not take place here as we are protected by the AVG_LIST_SIZE / URLS_PER_DAY ratio to understand and to take care of the any extra URLs that may cause that. > To cope with the problem, if it is week 50 and we are backed up to week 35, > our algorithm must be configurable so that we may decrease the amount of new > URLs being crawled. Hmmm it's still cumulative though...we would get to a point where the site might just "stop" by the crawling bottleneck...it's a geometric progression... [ozra] Said above. > Plus if we are so overwhelmed by new coming URLs, we may schedule them in > the future, and not make them due immediately, or even reschedule sublists > of URLs, which is a very efficient and simple operation. Hmmm we really needed to run a simulation on all this , if we use a maldesigned algorythm we can screw up the whole thing... [ozra] That's a good idea. > I am just not sure about one thing -- there will be a lot of simple SQL > statements needed to traverse through the linked lists. I don't know how > much of a burden is this. We will probably need hundreds of thousands to a > million queries to build the URL queue (once grub takes off, of course). That's not too much...it's very possible, depends only on having good enough hardware to run it fast...like good Fibre channel storage with 800Mhz memory and a few Xeon processors... > Besides this, another serious problem with this system may be -- what if we > lose part of the URL database (due to hardware failure, for example)? Then, > chances are, most of the linked lists will be screwed up. Well, that not so > bad of the problem after all (for our system -- losing the URL list by > itself is extremely terrible anyway). Why linked lists?? And we have to have a good backup system, we can't risk losing our database...I suggest optical storage systems...those can easily get up to one terabyte... > Give me your thoughts. I like your ideas, we just have to think more about the URL overcrowding...think like this - in average, each new crawled URL will have 5 to 10 new links, and that goes like that in a G.P. again...while our users increase in a A.P.(hey, that sounds like the Malthusian Theory for computing LOL)... > Cheers, > > ozra. Max |