[Sprawler-devel] More Scribble Pad Notes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello.

Well, I couldn't help but notice that it's been quiet here lately. I
haven't seen anything posted here for a while and haven't seen anything
committed to the repository lately. I'd just like top know what's been
happening, and be kept up to date, etc. I personally have not been as
active with this lately, as I was on a short vacation last week and had
another nasty cold. But I'm still loking for ways to help, even though I
haven't been able to code (my PC needs a RAM upgrade for me to be running
the code and testing it the way I want.) Eric, I got your reply to my
comments on the Sprawler Map, and I plan on commenting on that later on in
a later reply (which I expect to come in the next few days.) So as I
wonder what the four of you are doing, let me just dump some more
information from notes that I have scribbled. Here you are:

Here are some more calculations and thoughts on them.

Just as a quick reminder, here are the variables I'm using:

M: number of master machines
N: number of indexer machines
p: pages indexed per indexer
s: (average) size of indexed data per web page

I should also note that we have a value, which I've been calling max(M/N),
which is the maximum number of masters we allow for indexers. Maybe I
should represent this value with a single letter, as I do with the other
values. But I'll leave this as it is for now. It's just a disorganized
list of scratch pad calculations which I or we can make more formal later.

The values I've been using for these variables are as follows:

max(M/N) = 100
s = 100 kB = 0.1 MB
p = 1000 pages

I'm not saying that these are the numbers we will go with, or that these
numbers are the best. It's just what I've been working with for now based
on an initial estimated guess of what might be best. And remember, it's
not so much the numbers that are important, it's the letters. I've come up
with equations such as the one for average amout of data sent to master in
an indexing cycle (which is max(M/N)*p*s = 10 GB) and these are what
reveal the whole framework within which we are working, and put numerical
values in to come up with optimal results.

So here in this situation, in the worst case, where all data is received at
once, 10 GB of data is sent to each master, and so it'll need to have that
all copied over to the NFS disks before the indexers bring in any more
data. I figure I already covered the bandwidth issue, so I'm bringing up
disk space now. Although the more I look at this, the more it appears that
bandwidth will be a bottleneck.

So in this case 10 GB of data should be available for receiving data from
the indexers, but that's not taking into account what I'm going to bring
up next. And those are the issues of expansion and redundancy. Our engine
will expand to take in more pages, and we'll need to have redundancy in
the design in case masters go down.

First, let's talk about expansion.

Hardware expansion is relatively stagnant, I think. It's easier to increase
max(M/n) than it is to get more hardware and money.

We can fairly easily adjust p, and max(M/N).

s is not something that'll change as much. As we look for more ways of
indexing it'll gradually increase.

What we need is a redundancy factor, where we decide how much more each
machine should handle. Suppose we set that factor to 2, in other words,
each machine can handle twice as much as it can in its worst case. This is
something we'll need to maintain. As when one machine goes down, another
can pick up and handle it. But designing large hardware systems isn't
something I'm as familiar with, so please correct me if I'm dead wrong.

And perhaps we need and expansion factor as well. Should machines be able
to handle double their worst-case load in order to make sure it can handle
expansion in the size of the index? As we grow, we'll need to be able to
handle it.

Bandwidth per machine may be what's most static, that's my concern at this
time. Or we may need more masters to compensate for lack of bandwidth. Can
we get ourselves some T3s?

Now for some more calculations.

Estimation of Client Indexing Time
----------------------------------

A client (or indexer) to does the following four things over and over again:

1. Downloads URLs (and perhaps information corresponding to them)
2. Downloads the HTML that from each URL
3. Indexes the data for each URL's HTML
4. Sends this indexed data back to the master

So based on my original numbers, here's my calculation of how much time a
client spends indexing, so we can get an idea of what indexing cycle
length is.

1) Downloading 1000 URLs and their corresponding data on them would just
likely be 100 kB (assuming 100 bytes of data per URL, but we could include
what links to the page or anything else that's have the indexer know more
about the URL, more about this idea later.) So this could become a 1 MB
download, which would only be significant in length if you're using
dialup. I'm assuming we won't have may dialup indexers, and so I won't
even factor this in.

2) Assume 25 kb of HTML per web page. (Let w =25 kB)
   Then data downloaded = w*p = 0.025 MB * 1000 = 25 MB

   Assuming 1.5 MB/s of bandwidth, this phase takes: 25 MB
						     -----
						     1.5 MB/s  = 216.7 s
							       = 3 min 37 s

3) Assume: indexer time = 10 s / page = i (LEt i= 10s / page)

   Time to complete this phase = p*i = 1000 pages * 10 s/page = 10 000 s
							      = 166.6 s
							      = 2 h 46 min

4) Worst case time to send it back : I already estimated to be 1 h 51 min
in worst case.

Total indexing time = 4 hours 40 minutes

And much of that time will be spend with a connection open, sending indexed
data back, or getting HTML from web pages. One of the reasons I decided that
we have each indexer index more pages per cycle would be so that it'd have
more analysis to do, and would have less time with connections open.
Perhaps I'm a little too concerned with making this more SETI@home-like,
even though it can't be, by nature, with the way that it just goes out and
grabs data all the time, from the web and from masters.

Now time to consider two big questions:

1) How many pages will we index at a given time? Call this value P.

2) How much time do we want to spend (re)indexing the web? Let's call this
goal for a time frame T.

T, for now, is just something we are setting.

But what it may all come down to is, which of the above questions is more
important?

For now, let's just say that our goal is to reindex our goal in terms of
number of web pages once a week.

Then T = 168 hours = 10 080 min = 604 800 s

The worst-case cycle length was calculated to be 5 hours (based on my
numbers above.)

Therefore, 33 indexing cycles could be done in this time. But number of
cycles may not be what's important here.

Number of pages indexed = 33*p*M
		        = 33*1000*10
			= 330 000 pages

But what if our goal is to index 1 000 000 (10^6) pages?

Most obvious (and pherps worst) answer: Do another two weeks of indexing.

What we'd prefer: having more indexers and corresponding masters. We can
just triple the M value.

The max(M/N) ratio remains the same though, so we also triple the number of
indexers.

So if we have 100 masters, triple that to 300, then we need 3000 indexers.
And they won't just arrive. Much will need to be done from our own end to
control project growth.

And what about adjusting p? Well, number of cycles and pages indexed per
cycle are inversely proprtional. The time that it all takes gets thrown
off as well. And so, this just emphasizes the importance of completing
indexing cycles very quickly, but we need to keep clients in mind, as we
don't want them to consume excessive bandwidth, or take up excessive HDD
space, etc. So we may need plenty of tweaks for speeding things up, lots
of good hardware, and to find just the right amount of
responsibility-sharing between clients and indexers. More on this to
follow.

Thanks,

J.K.