[Grub-develop] RE: Storing and scheduling URLs

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Here is what I was thing about scheduling.  It is not too complicated for
implementation.

Say, these are the fields of the URL table:

url_id, next_in_list, next_update, other fields... (they are unimportant in
our case)

...where next_in_q is a url_id of a next in the list of URLs to be crawled
in the same time interval.  next_update is calculated time interval which
tells how much time must elapse before it is scheduled again, and which is
calulated by halving/doubling (or by factor of 1.5) based on how often it
changed.

For example, let's say the first day grub Clients run we receive four new
URLs (never crawled before), and we stick them in the URL table:

URL table:
url_id  next_in_list next_update in_crawl
U1      U2           8           0
U2      U3           8           0
U3      U4           8           0
U4      (null)       8           0

next_update tells that once this URL is crawled; if its contents changed
(which is sure to happen for never-crawled pages), the time will be halved
to 4 for all of them.

Then in addition we need a table to represent a time line.  Let's say that
the smallest time interval is a week, and the week_id is computed as a
number of weeks since the beginning of year 2000:

Timeline table:
time_id front_url_id
0       U1

(Say that this week is 30th.) When week 30 comes, a queue of URLs is
assembled via the linked lists of URLs that the timeline table points to,
using front_url_id.  Here we have time_id = 0, where zero is a special value
meaning new URLs never crawled.  Now it is week 30, but since we don't have
an entry for time_id = 30, we go to time_id = 0 to get URLs never crawled
before to attain a crawling queue.  It points to U1, and using next_in_list
we can derive the queue:

QUEUE: U1, U2, U3, U4.

We also set the in_crawl flag to 1.  This may be useful to make a clean-up
of URLs that were never reported back from Clients.

When Clients report back the results, the in_crawl is set back to zero, and
next_update is halved:

URL table:
url_id  next_in_list next_update in_crawl
U1      U2           4           0
U2      U3           4           0
U3      U4           4           0
U4      (null)       4           0

We create entry for time_id = 34 (current week plus next_update -- 30 + 4)
and now we have:

Timeline table:
time_id front_url_id
0       (null)          -- we assume we never got any new URLs
34      U1

...34 is the week these URLs should be recrawled.

Now, week 34 comes along, and we derive a queue again, which would look the
same (assuming that we never got any new URLs in meantime):

QUEUE: U1, U2, U3, U4.

Say URLs U1 and U3 have been updated, and U2, U4 not.

Then we would double/half the next_update fields appropriately, and now
there would be two linked lists (U1->U3, and U2->U4).  Every URL in the
table belong to a linked list.

URL table:
url_id  next_in_list next_update in_crawl
U1      U3           2           0
U2      U4           8           0
U3      (null)       2           0
U4      (null)       8           0

Timeline table:
time_id front_url_id
0       (null)
36      U1
42      U2

When week 36 comes, a queue containing U1 and U3 will be crawled, and so on.
I think you got my point.

When we add a new URLs, it is prepended at (inserted at beginning of the
list of) the list under time_id = 0.

Deleting URLs from the URL list should not be very difficult.  We might add
an extra flag to indicate DELETED, and any time a QUEUE for URLs to be
crawled is created, another queue for URLs to be deleted can be created at
the same time as well by looking at the deleted flag.  Which means, that
URLs will not be immediately removed, because they are part of the linked
list.  Of course, we could devise a doubly linked list, but this is
absolutely unnecessary.

At the beginning we will be overwhelmed with new URLs to be crawled, and if
we don't think smart about it, we will not have any updates on previously
crawled URLs, but only crawling new URLs, or the other way around.  That is
why all new URLs are put under time_id = 0, so that our algorithm may
combine crawling new URLs and old URLs.  It would be great if (for example)
60% of the crawling is spent on crawling (and finishing) old URLs, and 40%
on newly-found ones.  In this way our database will grow and will be
up-to-date.  But if crawling old URLs takes around or more then 100%, then
we will have to sacrifice the "up-to-dateness" for crawling new URLs.

To cope with the problem, if it is week 50 and we are backed up to week 35,
our algorithm must be configurable so that we may decrease the amount of new
URLs being crawled.

Plus if we are so overwhelmed by new coming URLs, we may schedule them in
the future, and not make them due immediately, or even reschedule sublists
of URLs, which is a very efficient and simple operation.

I am just not sure about one thing -- there will be a lot of simple SQL
statements needed to traverse through the linked lists.  I don't know how
much of a burden is this.  We will probably need hundreds of thousands to a
million queries to build the URL queue (once grub takes off, of course).

Besides this, another serious problem with this system may be -- what if we
lose part of the URL database (due to hardware failure, for example)?  Then,
chances are, most of the linked lists will be screwed up.  Well, that not so
bad of the problem after all (for our system -- losing the URL list by
itself is extremely terrible anyway).

Give me your thoughts.

Cheers,

ozra.

--------------------------------------------------------------
Igor Stojanovski                                 Grub.Org Inc.
Chief Technical Officer                 5100 N. Brookline #830
                                       Oklahoma City, OK 73112
oz...@gr...                            Voice: (405) 917-9894
http://www.grub.org                        Fax: (405) 848-5477
--------------------------------------------------------------

-----Original Message-----
From: rda...@ca... [mailto:rda...@ca...]On
Behalf Of Rodrigo Damazio
Sent: Friday, September 22, 2000 4:27 PM
To: Igor Stojanovski
Subject: Re: Storing and scheduling URLs

Igor Stojanovski wrote:

> To Rodrigo:
>
> We need to add a table(s) to our database that will store the URLs and
some
> statistics along with it.  We need a mechanism which will use the
statistics
> for dispatching/scheduling URLs to the Clients to crawl.  We have already
> talked on this issue.
>
> When Clients connect to the Server, they make a request.  The response is
a
> list of URLs for crawling.  This module should offer an interface that
will
> return a list, and update this action appropriately to the DB.
>
> I would like you to work on this module.
>
> You should devise an algorithm that will figure out how to schedule the
> URLs.  Use some of the ideas we have already presented via our emails.  I
> pasted an excerpt form an older email at the bottom of the msg.
>
> You must take into account several things in you model, though:
> 1) At the beginning, we will not have many Clients at our disposal, and
our
> database will be overwhelmed with new URLs to be crawled.  You must design
> this algorithm so that even when we have millions of URLs that were never
> crawled, our Clients will get back to those old pages, so that our
database
> will stay up-to-date as much as possible.  Remember, our goal is to have
the
> most up-to-date search engine on the net.
> 2) In future, we will provide means to measure each Client's crawling
> performance (pages crawled/day), so that we can assign appropriate number
of
> URLs to each one of them.  Don't worry about this one for now.
> 3) Also, we must think security.  We may need to introduce a certain
amount
> of redundancy in order to check whether we get good data from our Clients.
> For example, we may have 10% redundancy in crawling.  If data does not
match
> from two Clients, a third Client may be assigned to crawl the page in
> question, and to figure out which Client "cheated".  Of course, the page
may
> have changed in the short amount of time the two Clients crawled, and we
may
> wrongfully conclude that a Client is rogue.  Anyway, I say, don't worry
> about security for now.  Let's leave this for a later stage.
> 4) The URL scheduling algorithm must be highly configurable and modular
> enough so that we may add new capabilities to it easily.
> 5) Many other things I haven't accounted for.  Like for example, taking
into
> account the proximity of Clients to sites in dispatching the URLs...
>
> >From an old message:
>
> About dispatching/scheduling URLs to Clients:
>
> [ozra] Dispatching (term I borrowed from Robert) is a mechanism for
> scheduling URLs
> to Clients for crawling.  Here is my suggestion on how to schedule the
URLs.
> Every page that is crawled for the first time by our system is
automatically
> scheduled to be crawled again in (say) two weeks.  If in two weeks a
Client
> crawls the page and finds that the page has changed, the next crawling
time
> will be set for one week, or half the previous time;  if next week the
page
> changed again, the time will be halfed again to 3 days, and so on.  If on
> the other hand, a page didn't change, we might perhaps double the next
> scheduled time from two weeks to a month, etc.
>
>         [Rodrigo] Hmmm sounds good to me...just change doubling and
halving
> to multiplying and dividing by 1.5, I guess that's a more proper
> value...also, we have to consider the situation where a client starts
> crawling a HUGE site(Geocities for instance)...of course no one client
will
> crawl all of it, so we have to make it schedule the parts it doesn't...and
> develop a good schema so that no two clients will be crawling the same
> thing, and no pages will be left uncrawled...
>
> [ozra] Let's not forget that for each URLs that will be crawled, Client
> needs to get "permission" for the Server.  No exception.
>
> ---end msg---
>
> Give me your thoughts on this.
>
> Also, which one of your email addresses should I use now?

        Use this one only...the old one will bump messages...
        About the algorythm, I agree, and there's one thing I think we
should
add - make updates(or perhaps even rating) of most visited pages more
frequent...so if a page is only visited once a year by someone(through our
search of course), it won't be updated every day or anything...to measure
how
often a page is visited, just add a redirect script instead of putting
direct
links to the pages...
        Anyway, you're asking for a high complexity algorythm...I'll try to
do
it...anyway, we have to organize our ideas on it better...we could always
start
with a random-picking altorythm..something like this - "take the list of all
new URLs to be crawled, add it to the list of websites not recently updated,
RANDOMLY mix it all, and start sending it to the clients"...also, we gotta
use
a cyclic queue reading, in a way that an entry is only removed when a client
actually returns the crawled page instead of right when it's sent to a
client,
yet it won't be sent again to another client for a while(until the queue end
has been reached, which means all URLs have been sent to the clients, then
ir
repeats)...btw what will we do if we have an empy queue?? Start updating
everything again?? It's not likely to happen in the future but it'll
probably
happen in the beginning...
        Oh, actually, one little addition to the process above - the URLs to
be
updated will be interpolated with the new ones, so do a sort ONLY on the old
ones so that the most recently updated will be the last IN the sequence,
that
is, rearrange the URLs without changing the positions they got from random
mixing...

Max