Re: [Sprawler-devel] Sprawler map

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

J. Kasprzak wrote:

>Well, that map is just what I figured we needed. It does look quite good,
>and after going over it, there are quite a few things I'd like to say,
>mostly concerning the physical aspects of it.
>
>First, about the logical aspect, I have nothing against what's currently
>there. I can't think of anything wrong with it really, maybe a few subtle
>changes could be done about in each of the modules there (ie. controller
>vs. processor, maybe controller could do some of what procesor does, but
>need more detail.) And since I need more details, I'll just focus mostly
>on the physical design.
>
>So here's what I have to say.
>
>Physical Design Issues
>----------------------
>
>Ratio of Indexer Machines to Master Machines: It appears to be N:1, in that
>each master in the layer has N indexers assigned to it. Is this assignment
>static or dynamic? And how large or small should N be? Perhaps it should all
>be dynamically assigned, as we would want the load to be distributed as
>much as possible. And a maximum value of N needs to be set up for each
>master, assuming hardware differences. (Or are all indexers created
>equal?) How many connections should each one handle?
>
>Ratios In General: For each level, it says there are N machines. As an
>individual who is something of a mathematics geek, this seems to imply that
>all levels have equal numbers of machines. It may seem that I'm nit-picking
>here, but this could lead to confusion, especially when looking over the
>last section.
>  
>
I'm glad you mentioned that.  Unfortunately, I wasn't too clear on that 
part.  Here's what I was thinking of doing.  The harvester's all talk to 
*apparently* the same controller - but it isn't.  If we have 5 
controllers, we have one DNS name (cntlr.sprawler.com) that uses DNS 
round robin to distribute load between as many (or as little) actual 
controller machines as we need.  We can add machines, and within 
minutes, the load will the distributed to those new hosts.  We can 
remove them, etc.  If we lose a machine (hardware failure, etc), we 
remove it from the DNS round robin listing and within about a minute it 
disappears.  The harvester will be smart enough to know if it gets a 
reset connection (host unavailable, broken, etc) it waits x seconds and 
retries (most probably getting a new machine, which should work).  This 
also gives us the flexibility to use any load balancing/distributing 
application we want, with no notice or change to the clients.  There can 
be any number of any of the machines on the map.  I made them all the 
same out of symmetry, but there's no real reason there needs to be any 
specific number of each.  We'll grow the different classes of machines 
as needed (and requires barely any changes to add/remove additional 
machines).

>Adaptability: This concerns the overall structure for all levels. How well
>can it adapt to changes? More specifically, can it adapt well to machines
>being added/upgraded on the fly? Would configuration files need to be
>changed? This last point ties in with the last section somewhat.
>  
>
I think I answered this above, but if I didn't answer to your 
satisfaction, please say so and I can go into more detail.

>Possible Redundancy: Does each master know what other masters are doing? Do
>they need to? How do URLs and data get partitioned among master machines? Do
>they all have access to a centralized data store of URLs? The physical
>design does not seem to indicate where this data on URLs and phases each
>URL is in is stored.
>  
>
The controllers (masters) have no idea what any other controller is 
doing, or even if there are any other controllers.  They don't need to 
know.  They all access a central repository of URLs, and pull (somewhat 
randomly) a list of URLs to hand out to the harvesters.  The actual data 
is stored on NFS shared disks that the controllers can access, and also 
the processors and maintainers can also access.  This is why we can pop 
in more controllers at any given time, or remove them.  It's all at our 
whim.

>Client Idle Time: If there's a situation where the indexer cannot connect to
>the master, what can it do? If servers go down, the queue could then become
>excessively long. And what if there's nothing left for the client to index
>before the next reindexing cycle (dare I suggest this so soon?) Maybe the
>client can just reindex the URLs again.
>  
>
If the harvester cannot connect to a controller, it sleeps x seconds and 
tries again (hopefully getting another controller and moving on).  If it 
continues to fail, it will try forever.  The queue should not get huge 
on us - only if we are gathering data from the harvesters and not 
handing out URLs, or vice versa - we are handing out URLs, but not 
gathering the data back - which shouldn't happen, and even if it does, 
it should not be a real problem except for a little heavy traffic when 
things come back alive.  There will never be a time that we have nothing 
to index - the maintainer will see to that.  It's job is to make sure 
that there are always URLs to be indexed by selecting URLs that have 
already been indexed to be re-indexed, and re-inject them into the 
indexing pool.

>General Bad Stuff: Eric mentioned "nastygrams" could theoretically be sent
>over and perhaps the data sent over could be spoofed. And it certainly is
>a nasty world out there, with people who may test to see how robust our
>servers really are. We may need to consistently work on methods to keep
>from being flooded with DDoS attacks and other related floods, and so we
>may need keep checking for what is legitimate data and what isn't. But
>with indexers, perhaps we can only start out with people we trust.
>  
>
Definitely!  In fact, we'll probably only go to a dstributed indexing 
system once we are well on our way to a working engine, so we should 
have some publicity, and hopefully a few more good developers on the team.

>Separation of Responsibility Between WWW Server and Compute Machines: Who
>should do the parsing of user input? Maybe a bit of both would do. All
>user input could be encoded into the URL string (as Google does) or maybe
>it could all be put into an easily-parsable form on the web side. And
>should CGI be used? PHP is free and fast, and doesn't use CGI (right?) .
>The data on what's searched for canbe sent through it, and the compute
>machine can take the data in perhaps aneasily-parsable form. But this may
>be more software-related, and we need to determine the ratio of WWW
>machines and Compute machines to find the right balance. I think that
>maybe there should be more WWW Servers, as taking in queries may not be
>all they do. Remember that we may have personalization features and other
>similar things there.
>  
>
I was  thinking something similar.  What I had envisioned was the WWW 
servers would do all HTML related creation, user interfacing, etc.  It 
would take a query from a user, order it in a certain fashion, and 
request data from the compute machines (via TCP most likely).  The WWW 
servers would send a search request for the different "tokens" that were 
entered (the WWW server parses this, and determines which compute 
machines to contact, what type of query, etc - more on this later on 
down the road) to the selected compute machine, and the compute machine 
responds back with the data to be shown to the user.  Most of the time, 
the WWW server will need to contact several compute machines (and these 
are scalable also in a similar fashion as above), and take the data from 
each and compile it into a list.  The WWW server will be almost 
completely doing web requests and HTML (or PHP) tricks, but some minimal 
computations could be done.

>Reducing Disk Access: Much memory would be needed to keep what is most
>likely to be requested in memory. And what we need is a good algorithm for
>determining which data is most likely to be requested? Would it really be
>what's most recent? Wouldn't it be what requested most often? Some
>combination of both? And do these are have access to the same data on
>this? Just more issues withdistributed computing.
>  
>
I think what we'll end up with is something that keeps in memory the 
most recent and most popular terms.  Each compute machine will have it's 
own cache (which of course will be different than the others).  I also 
think we'll have compute machines for different sections of index data.  
We can break up the access anyway we want - it really doesn't matter.  
We can have "rulesets" that the  compute machines use to determine their 
"scope" of search, and the WWW servers will be told which compute 
machines have which scopes.  This gives us the ability to spread the 
memory caches across scopes, machines, and the index as a whole, still 
using cheap hardware.

>Priorities: We need to determine how to allocate resources. With the money
>that we get, certain percentages can be allocated to certain places. But
>how should this be done? The Computer and Master layers will need plenty of
>hardware, but which may be more important? Is indexing more important than
>searching may be what that last question comes down to. And this applies
>for all layers. Which need more and/or better hardware?
>  
>
This is going to be interesting.  I think we'll have to attack this one 
day at a time.  Right now, we need disk space, and a few machines to use 
to get an initial setup going.  Once we start indexing full-time, we'll 
need LOTS of disk space, and several machines for disk servers.  As we 
get a larger index, we'll want to add compute and WWW machines, so we 
can support the increasing number of searches that we'll be getting.  It 
will hit a break point, where suddenly many people know about us, and 
use us, yet we're still small and growing, so we'll be at a critical 
time.  Hopefully I can score some good hardware deals with a few vendors 
before that happens.

>--------------------------------------------------------------------------
>
>Alright, now in that last section, I mostly raised questions. Now I'm
>going to see what I can come up with for answers.
>
>First, answers to questions on the interaction of masters with indexer
>clients, where I asked how many indexers each master could handle at a
>time. Here, I model the system and come up with a little notation in order
>to help us quantify everything and come up with some numbers that will be
>useful in the overalldesign of the system.
>
>Let's define some variables first.
>
>M: the number of masters
>N: the number of indexers
>p: number of pages that each client indexes in each indexing cycle
>s: size of file of indexed data for a URL
>
>So what we need to do is somehow come up with a maximum value for what M/N
>should be. If the M:N ratio is too high, that'll lead to masters being
>bogged down with requests for URLs and to have indexed data stored. And we
>don't want that. Now thorughout this, I'm assuming the worst cases for
>each indexing cycle. And an indexing cycle is, as you probably know, the
>whole cycle of the master finding URLs in the "to be indexed" state, then
>having clients request these URLs, then index data in them, and send the
>indexed data back.
>
>So let max(M/N) be this maximum value.
>
>Whenever new indexers are added and
>registered, the M/N ratio increases, and perhaps the server can be updated
>of these changes and somehow it'll need to know how to assign it to a
>master server at a time that isn't handling as many indexers at a time.
>This assignment would be done dynamically, though perhaps I'm stating the
>obvious here. The system just needs to know how many clients each master
>is handling.
>
>But then p, the number of URLs that an indexer requests, could also be
>made more dynamic, rather than just having a value in a configuration file
>for it. The value p could be related to the length of each reindexing
>interval, and I'll cover the importance of that next.
>
>But here are some quick little equations to put values into:
>
>Number of pages being indexed per cycle = p * N
>Size of data handled by master per indexing cycle = max(M/N) * p * s
>
>In the worst case, the master handles all of this data at once.
>
>We want to maximize p*n. (To index as many pages as possible per cycle.)
>
>But max(M/N)*p*s should be capped. But how? We want to limit the amount of
>time clients spend sending data, and masters spend taking it, right? We need
>to take bandwith per master into consideration.
>
>Let's say masters get all the data at once: which is the worst case, and
>would ideally be avoided. (client badwidth and processing speed can vary
>quite a bit, and this could actually be good news for us, causing us to
>avoid this, but I digrress.)
>
>Quick experiment:
>
>Say a master has 1.5 MBps of bandwith. (as do clients.)
>
>Let s=100 kB (we've been using this as a worst-case maximum figure)
>Let M/N = 100
>Let p = 1000 pages/cycle
>
>Then data sent at given time = 100 * 1000 pages/cycle* 0.1 MB/page
>                             = 10 000 MB/cycle
>
>Worst case time per cycle = 10 000 MB  /  1.5 MB/s
>                          = 6 667 s
>                          = 111.11 min
>                          = 1 h 51 min 7 s
>
>
>Now two hours to upload the data does not look good, but this is absolute
>worst case, and it does show the inportance of capping p and max(M/N)
>values. We can just keep playing with these values, and it's what we're
>working with, unless there are a few things here I'm wrong about.
>
>Another issue: Given p, how long would each indexer take to index p pages?
>In other words, to go through its mini-cycle? What it works out to is is
>follows:
>
>average time for indexer to
>complete its part of cycle = p*avg(pagesize)   +  avg(indexing time per page)
>		             ---------------
>                              avg(bandwidth)
>
>
>We will need to find out these averages above to get total cycle length,
>which should be related to the reindexer interval. Perhaps it shoud be
>dynamic based on statistics we compile as we index pages? Add that last
>equation to worst-case time per cycle, and time master takes to find what
>needs to be indexed, and there you have cycle length.
>
>One last thing I should mention is that these are just things that came from
>my scratch pad. I've been quite interested in how such a large system would
>work, and we'll need to come up with some numbers here. Now, if I'm wrong
>about anything here, now would be a good time to tell me. And if so, you can
>correct me and this data can be made more formal (ie. I haven't assigned
>variables to worst case indexing times, and maybe I should.)
>  
>
This is awesome.  I love it.  This is exactly what I hope to see from 
someone working on this project.  Ok - I think your numbers are 
*extreme* worst case scenario, but good to see nonetheless.  There are a 
few things I'd say about them: first, the clients will naturally stagger 
themselves out a bit, so it won't take all clients 2 hrs to send their 
data.  Plus, you are figuring 100 clients, with 1000 pages per cycle - 
which is a little high I think.  It will probably be more like 50 pages 
per cycle (we don't want to slam our servers when they upload, we don't 
want one client to be responsible for too many URLs, and we don't want 
to use up too much space on the client's disk).  

I think you should incorporate some of these calculations in the code.  
Add some simple routines to time and average the page sizes, download 
times per page, per cycle, upload times, etc.  It would be good for us 
to know, and nice for debugging. 

Eric

-- 
------------------------------------------------------------------
Eric Anderson     Sr. Systems Administrator    Centaur Technology
Today is the tomorrow you worried about yesterday.
------------------------------------------------------------------