|
From: J. K. <ja...@ka...> - 2004-03-31 06:04:01
|
Well, that map is just what I figured we needed. It does look quite good,
and after going over it, there are quite a few things I'd like to say,
mostly concerning the physical aspects of it.
First, about the logical aspect, I have nothing against what's currently
there. I can't think of anything wrong with it really, maybe a few subtle
changes could be done about in each of the modules there (ie. controller
vs. processor, maybe controller could do some of what procesor does, but
need more detail.) And since I need more details, I'll just focus mostly
on the physical design.
So here's what I have to say.
Physical Design Issues
----------------------
Ratio of Indexer Machines to Master Machines: It appears to be N:1, in that
each master in the layer has N indexers assigned to it. Is this assignment
static or dynamic? And how large or small should N be? Perhaps it should all
be dynamically assigned, as we would want the load to be distributed as
much as possible. And a maximum value of N needs to be set up for each
master, assuming hardware differences. (Or are all indexers created
equal?) How many connections should each one handle?
Ratios In General: For each level, it says there are N machines. As an
individual who is something of a mathematics geek, this seems to imply that
all levels have equal numbers of machines. It may seem that I'm nit-picking
here, but this could lead to confusion, especially when looking over the
last section.
Adaptability: This concerns the overall structure for all levels. How well
can it adapt to changes? More specifically, can it adapt well to machines
being added/upgraded on the fly? Would configuration files need to be
changed? This last point ties in with the last section somewhat.
Possible Redundancy: Does each master know what other masters are doing? Do
they need to? How do URLs and data get partitioned among master machines? Do
they all have access to a centralized data store of URLs? The physical
design does not seem to indicate where this data on URLs and phases each
URL is in is stored.
Client Idle Time: If there's a situation where the indexer cannot connect to
the master, what can it do? If servers go down, the queue could then become
excessively long. And what if there's nothing left for the client to index
before the next reindexing cycle (dare I suggest this so soon?) Maybe the
client can just reindex the URLs again.
General Bad Stuff: Eric mentioned "nastygrams" could theoretically be sent
over and perhaps the data sent over could be spoofed. And it certainly is
a nasty world out there, with people who may test to see how robust our
servers really are. We may need to consistently work on methods to keep
from being flooded with DDoS attacks and other related floods, and so we
may need keep checking for what is legitimate data and what isn't. But
with indexers, perhaps we can only start out with people we trust.
Separation of Responsibility Between WWW Server and Compute Machines: Who
should do the parsing of user input? Maybe a bit of both would do. All
user input could be encoded into the URL string (as Google does) or maybe
it could all be put into an easily-parsable form on the web side. And
should CGI be used? PHP is free and fast, and doesn't use CGI (right?) .
The data on what's searched for canbe sent through it, and the compute
machine can take the data in perhaps aneasily-parsable form. But this may
be more software-related, and we need to determine the ratio of WWW
machines and Compute machines to find the right balance. I think that
maybe there should be more WWW Servers, as taking in queries may not be
all they do. Remember that we may have personalization features and other
similar things there.
Reducing Disk Access: Much memory would be needed to keep what is most
likely to be requested in memory. And what we need is a good algorithm for
determining which data is most likely to be requested? Would it really be
what's most recent? Wouldn't it be what requested most often? Some
combination of both? And do these are have access to the same data on
this? Just more issues withdistributed computing.
Priorities: We need to determine how to allocate resources. With the money
that we get, certain percentages can be allocated to certain places. But
how should this be done? The Computer and Master layers will need plenty of
hardware, but which may be more important? Is indexing more important than
searching may be what that last question comes down to. And this applies
for all layers. Which need more and/or better hardware?
--------------------------------------------------------------------------
Alright, now in that last section, I mostly raised questions. Now I'm
going to see what I can come up with for answers.
First, answers to questions on the interaction of masters with indexer
clients, where I asked how many indexers each master could handle at a
time. Here, I model the system and come up with a little notation in order
to help us quantify everything and come up with some numbers that will be
useful in the overalldesign of the system.
Let's define some variables first.
M: the number of masters
N: the number of indexers
p: number of pages that each client indexes in each indexing cycle
s: size of file of indexed data for a URL
So what we need to do is somehow come up with a maximum value for what M/N
should be. If the M:N ratio is too high, that'll lead to masters being
bogged down with requests for URLs and to have indexed data stored. And we
don't want that. Now thorughout this, I'm assuming the worst cases for
each indexing cycle. And an indexing cycle is, as you probably know, the
whole cycle of the master finding URLs in the "to be indexed" state, then
having clients request these URLs, then index data in them, and send the
indexed data back.
So let max(M/N) be this maximum value.
Whenever new indexers are added and
registered, the M/N ratio increases, and perhaps the server can be updated
of these changes and somehow it'll need to know how to assign it to a
master server at a time that isn't handling as many indexers at a time.
This assignment would be done dynamically, though perhaps I'm stating the
obvious here. The system just needs to know how many clients each master
is handling.
But then p, the number of URLs that an indexer requests, could also be
made more dynamic, rather than just having a value in a configuration file
for it. The value p could be related to the length of each reindexing
interval, and I'll cover the importance of that next.
But here are some quick little equations to put values into:
Number of pages being indexed per cycle = p * N
Size of data handled by master per indexing cycle = max(M/N) * p * s
In the worst case, the master handles all of this data at once.
We want to maximize p*n. (To index as many pages as possible per cycle.)
But max(M/N)*p*s should be capped. But how? We want to limit the amount of
time clients spend sending data, and masters spend taking it, right? We need
to take bandwith per master into consideration.
Let's say masters get all the data at once: which is the worst case, and
would ideally be avoided. (client badwidth and processing speed can vary
quite a bit, and this could actually be good news for us, causing us to
avoid this, but I digrress.)
Quick experiment:
Say a master has 1.5 MBps of bandwith. (as do clients.)
Let s=100 kB (we've been using this as a worst-case maximum figure)
Let M/N = 100
Let p = 1000 pages/cycle
Then data sent at given time = 100 * 1000 pages/cycle* 0.1 MB/page
= 10 000 MB/cycle
Worst case time per cycle = 10 000 MB / 1.5 MB/s
= 6 667 s
= 111.11 min
= 1 h 51 min 7 s
Now two hours to upload the data does not look good, but this is absolute
worst case, and it does show the inportance of capping p and max(M/N)
values. We can just keep playing with these values, and it's what we're
working with, unless there are a few things here I'm wrong about.
Another issue: Given p, how long would each indexer take to index p pages?
In other words, to go through its mini-cycle? What it works out to is is
follows:
average time for indexer to
complete its part of cycle = p*avg(pagesize) + avg(indexing time per page)
---------------
avg(bandwidth)
We will need to find out these averages above to get total cycle length,
which should be related to the reindexer interval. Perhaps it shoud be
dynamic based on statistics we compile as we index pages? Add that last
equation to worst-case time per cycle, and time master takes to find what
needs to be indexed, and there you have cycle length.
One last thing I should mention is that these are just things that came from
my scratch pad. I've been quite interested in how such a large system would
work, and we'll need to come up with some numbers here. Now, if I'm wrong
about anything here, now would be a good time to tell me. And if so, you can
correct me and this data can be made more formal (ie. I haven't assigned
variables to worst case indexing times, and maybe I should.)
Thanks,
J.K.
> I've written up a "map", or floorplan of some of the conceptual layout
> of the project - both physical and logical. I've put it here:
>
> http://www.sprawler.com/Sprawler-map.pdf
>
> Please feel free to comment, ask questions, etc. Specially on the
> naming of things.
>
> Eric
>
>
> --
> ------------------------------------------------------------------ Eric
> Anderson Sr. Systems Administrator Centaur Technology
> Today is the tomorrow you worried about yesterday.
> ------------------------------------------------------------------
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> Sprawler-devel mailing list
> Spr...@li...
> https://lists.sourceforge.net/lists/listinfo/sprawler-devel
|