[Grub-develop] Storing and scheduling URLs -- getting rid of the cumulative effect

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

--------------------------------------------------------------
Igor Stojanovski                                 Grub.Org Inc.
Chief Technical Officer                 5100 N. Brookline #830
                                       Oklahoma City, OK 73112
oz...@gr...                            Voice: (405) 917-9894
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------

-----Original Message-----
From: rda...@gi...
[mailto:rda...@gi...]On Behalf Of Rodrigo Damazio
Sent: Monday, September 25, 2000 9:01 PM
To: Igor Stojanovski
Subject: Re: FW: Storing and scheduling URLs

Igor Stojanovski wrote:

> Here is what I was thing about scheduling.  It is not too complicated for
> implementation.
>
> Say, these are the fields of the URL table:
>
> url_id, next_in_list, next_update, other fields... (they are unimportant
in
> our case)
>
> ...where next_in_q is a url_id of a next in the list of URLs to be crawled
> in the same time interval.  next_update is calculated time interval which
> tells how much time must elapse before it is scheduled again, and which is
> calulated by halving/doubling (or by factor of 1.5) based on how often it
> changed.
>
> For example, let's say the first day grub Clients run we receive four new
> URLs (never crawled before), and we stick them in the URL table:
>
> URL table:
> url_id  next_in_list next_update in_crawl
> U1      U2           8           0
> U2      U3           8           0
> U3      U4           8           0
> U4      (null)       8           0
>
> next_update tells that once this URL is crawled; if its contents changed
> (which is sure to happen for never-crawled pages), the time will be halved
> to 4 for all of them.
>
> Then in addition we need a table to represent a time line.  Let's say that
> the smallest time interval is a week, and the week_id is computed as a
> number of weeks since the beginning of year 2000:
>
> Timeline table:
> time_id front_url_id
> 0       U1
>
> (Say that this week is 30th.) When week 30 comes, a queue of URLs is
> assembled via the linked lists of URLs that the timeline table points to,
> using front_url_id.  Here we have time_id = 0, where zero is a special
value
> meaning new URLs never crawled.  Now it is week 30, but since we don't
have
> an entry for time_id = 30, we go to time_id = 0 to get URLs never crawled
> before to attain a crawling queue.  It points to U1, and using
next_in_list
> we can derive the queue:
>
> QUEUE: U1, U2, U3, U4.
>
> We also set the in_crawl flag to 1.  This may be useful to make a clean-up
> of URLs that were never reported back from Clients.

        Hmmm so far sounds good, I just don't think we need to use the
next_id
thing - we can just do a query like "give me all URLs with time_id = 0",
something like that...

> When Clients report back the results, the in_crawl is set back to zero,
and
> next_update is halved:

        Note: it's only halved if the content has changed!!
[ozra] That's what I am saying, too.

> URL table:
> url_id  next_in_list next_update in_crawl
> U1      U2           4           0
> U2      U3           4           0
> U3      U4           4           0
> U4      (null)       4           0
>
> We create entry for time_id = 34 (current week plus next_update -- 30 + 4)
> and now we have:
>
> Timeline table:
> time_id front_url_id
> 0       (null)          -- we assume we never got any new URLs
> 34      U1
>
> ...34 is the week these URLs should be recrawled.
>
> Now, week 34 comes along, and we derive a queue again, which would look
the
> same (assuming that we never got any new URLs in meantime):
>
> QUEUE: U1, U2, U3, U4.
>
> Say URLs U1 and U3 have been updated, and U2, U4 not.
>
> Then we would double/half the next_update fields appropriately, and now
> there would be two linked lists (U1->U3, and U2->U4).  Every URL in the
> table belong to a linked list.
>
> URL table:
> url_id  next_in_list next_update in_crawl
> U1      U3           2           0
> U2      U4           8           0
> U3      (null)       2           0
> U4      (null)       8           0
>
> Timeline table:
> time_id front_url_id
> 0       (null)
> 36      U1
> 42      U2
>
> When week 36 comes, a queue containing U1 and U3 will be crawled, and so
on.
> I think you got my point.
>
> When we add a new URLs, it is prepended at (inserted at beginning of the
> list of) the list under time_id = 0.
>
> Deleting URLs from the URL list should not be very difficult.  We might
add
> an extra flag to indicate DELETED, and any time a QUEUE for URLs to be
> crawled is created, another queue for URLs to be deleted can be created at
> the same time as well by looking at the deleted flag.  Which means, that
> URLs will not be immediately removed, because they are part of the linked
> list.  Of course, we could devise a doubly linked list, but this is
> absolutely unnecessary.
>
> At the beginning we will be overwhelmed with new URLs to be crawled, and
if
> we don't think smart about it, we will not have any updates on previously
> crawled URLs, but only crawling new URLs, or the other way around.  That
is
> why all new URLs are put under time_id = 0, so that our algorithm may
> combine crawling new URLs and old URLs.  It would be great if (for
example)
> 60% of the crawling is spent on crawling (and finishing) old URLs, and 40%
> on newly-found ones.  In this way our database will grow and will be
> up-to-date.  But if crawling old URLs takes around or more then 100%, then
> we will have to sacrifice the "up-to-dateness" for crawling new URLs.

        This is good for the beginning, BUT, say, if we get to have a
one-million-URL queue for a certain week, and we only get a small part of
that
crawled, what happens?? It's a cumulative effect...we have to think of a way
to
build a queue that doesn't expect a "crawling schedule" to be met by the
clients...like my random idea was...

[ozra] Well, the queue I am proposing is not schedule-aware, it's just plain
and simple.  In anyway, you are right, and I think the cumulative effect
will occur inevitably.

I think I have a solution to this problem.  First, let's not forget what our
goal is -- the most up-to-date, and later, the most comprehensive search
engine on the net (the second is when we get enough Clients).  The
up-to-date part will be respected from the very beginning.

For what I am proposing, first, let's keep in mind three things --
everything I said about the algorithm in my previous email is unchanged
(except on how to schedule the URLs for crawling).  Second, there will be no
list that belongs to a time_id which is in the past, even if it means that
when we are backed up we move the whole remaining list to the following day,
and third, the Time table in the real workable system is with "resolution"
of one day instead of one week (I used week as an example only).

We will know (i.e. estimate) the capability our Clients prior to or at the
beginning of each day.  For this we will use some kind of a prediction
function such as moving average (I don't know which is right, I am not a
mathematician).  We store that value in URLS_PER_DAY variable at beginning
of each day.  Second, we take the average of sizes of each list of URLs,
which is same as total number of URLs that do not belong to the new URLs
list (for which time_id = 0):

                Total_number_of_URLs_in_our_database -
Number_of_URLs_for_time_id_equals_zero
AVG_LIST_SIZE
= --------------------------------------------------------------------------
---
                                      Number_of_entries_in_Time_table - 1

...and store the value in AVG_LIST_SIZE.  So we have the two variables --
URLS_PER_DAY and AVG_LIST_SIZE.  The goal here is to have these two values
as equal as possible.  Because this way our database will always be
up-to-date (as much as our algorithm permits), and it will grow only when
the number of Clients increases (more correctly, the total ability of the
Clients to crawl increases, which should be proportional).  And we will know
that when the value of URLS_PER_DAY increases.  When this happens, we will
peek into our new (never crawled) URLs, and send them to the Clients to
close the gap, i.e., increase the AVG_LIST_SIZE by scheduling more URLs to
the queue.

Now, here comes a problem -- what if URLS_PER_DAY decreases?  Well, perhaps
we may randomly pick URLs to delete from our database (and the indexed data
relating to them), or something else -- diminishing our database seems a
kind of silly to me.  But give me some other ideas.  Well, in order to avoid
such occurrences from happening often, there must be some gap allowed
between AVG_LIST_SIZE and URLS_PER_DAY, in that AVG_LIST_SIZE should always
be kept certain percentage lower, instead of making them equal.

Also, no prediction will be achieved exactly.  If the queue has not emptied
completely, it will be inserted at the beginning of the list for the next
day.  If it was emptied earlier, we should take some URLs from the following
day and crawl them earlier.  The bad cumulative effect should not take place
here as we are protected by the AVG_LIST_SIZE / URLS_PER_DAY ratio to
understand and to take care of the any extra URLs that may cause that.

> To cope with the problem, if it is week 50 and we are backed up to week
35,
> our algorithm must be configurable so that we may decrease the amount of
new
> URLs being crawled.

        Hmmm it's still cumulative though...we would get to a point where
the
site might just "stop" by the crawling bottleneck...it's a geometric
progression...

[ozra] Said above.

> Plus if we are so overwhelmed by new coming URLs, we may schedule them in
> the future, and not make them due immediately, or even reschedule sublists
> of URLs, which is a very efficient and simple operation.

        Hmmm we really needed to run a simulation on all this , if we use a
maldesigned algorythm we can screw up the whole thing...

[ozra] That's a good idea.

> I am just not sure about one thing -- there will be a lot of simple SQL
> statements needed to traverse through the linked lists.  I don't know how
> much of a burden is this.  We will probably need hundreds of thousands to
a
> million queries to build the URL queue (once grub takes off, of course).

        That's not too much...it's very possible, depends only on having
good
enough hardware to run it fast...like good Fibre channel storage with 800Mhz
memory and a few Xeon processors...

> Besides this, another serious problem with this system may be -- what if
we
> lose part of the URL database (due to hardware failure, for example)?
Then,
> chances are, most of the linked lists will be screwed up.  Well, that not
so
> bad of the problem after all (for our system -- losing the URL list by
> itself is extremely terrible anyway).

        Why linked lists?? And we have to have a good backup system, we
can't
risk losing our database...I suggest optical storage systems...those can
easily
get up to one terabyte...

> Give me your thoughts.

        I like your ideas, we just have to think more about the URL
overcrowding...think like this - in average, each new crawled URL will have
5
to 10 new links, and that goes like that in a G.P. again...while our users
increase in a A.P.(hey, that sounds like the Malthusian Theory for computing
LOL)...

> Cheers,
>
> ozra.

Max