From: J. K. <ja...@ka...> - 2004-04-24 04:37:13
|
Hello. Well, I couldn't help but notice that it's been quiet here lately. I haven't seen anything posted here for a while and haven't seen anything committed to the repository lately. I'd just like top know what's been happening, and be kept up to date, etc. I personally have not been as active with this lately, as I was on a short vacation last week and had another nasty cold. But I'm still loking for ways to help, even though I haven't been able to code (my PC needs a RAM upgrade for me to be running the code and testing it the way I want.) Eric, I got your reply to my comments on the Sprawler Map, and I plan on commenting on that later on in a later reply (which I expect to come in the next few days.) So as I wonder what the four of you are doing, let me just dump some more information from notes that I have scribbled. Here you are: Here are some more calculations and thoughts on them. Just as a quick reminder, here are the variables I'm using: M: number of master machines N: number of indexer machines p: pages indexed per indexer s: (average) size of indexed data per web page I should also note that we have a value, which I've been calling max(M/N), which is the maximum number of masters we allow for indexers. Maybe I should represent this value with a single letter, as I do with the other values. But I'll leave this as it is for now. It's just a disorganized list of scratch pad calculations which I or we can make more formal later. The values I've been using for these variables are as follows: max(M/N) = 100 s = 100 kB = 0.1 MB p = 1000 pages I'm not saying that these are the numbers we will go with, or that these numbers are the best. It's just what I've been working with for now based on an initial estimated guess of what might be best. And remember, it's not so much the numbers that are important, it's the letters. I've come up with equations such as the one for average amout of data sent to master in an indexing cycle (which is max(M/N)*p*s = 10 GB) and these are what reveal the whole framework within which we are working, and put numerical values in to come up with optimal results. So here in this situation, in the worst case, where all data is received at once, 10 GB of data is sent to each master, and so it'll need to have that all copied over to the NFS disks before the indexers bring in any more data. I figure I already covered the bandwidth issue, so I'm bringing up disk space now. Although the more I look at this, the more it appears that bandwidth will be a bottleneck. So in this case 10 GB of data should be available for receiving data from the indexers, but that's not taking into account what I'm going to bring up next. And those are the issues of expansion and redundancy. Our engine will expand to take in more pages, and we'll need to have redundancy in the design in case masters go down. First, let's talk about expansion. Hardware expansion is relatively stagnant, I think. It's easier to increase max(M/n) than it is to get more hardware and money. We can fairly easily adjust p, and max(M/N). s is not something that'll change as much. As we look for more ways of indexing it'll gradually increase. What we need is a redundancy factor, where we decide how much more each machine should handle. Suppose we set that factor to 2, in other words, each machine can handle twice as much as it can in its worst case. This is something we'll need to maintain. As when one machine goes down, another can pick up and handle it. But designing large hardware systems isn't something I'm as familiar with, so please correct me if I'm dead wrong. And perhaps we need and expansion factor as well. Should machines be able to handle double their worst-case load in order to make sure it can handle expansion in the size of the index? As we grow, we'll need to be able to handle it. Bandwidth per machine may be what's most static, that's my concern at this time. Or we may need more masters to compensate for lack of bandwidth. Can we get ourselves some T3s? Now for some more calculations. Estimation of Client Indexing Time ---------------------------------- A client (or indexer) to does the following four things over and over again: 1. Downloads URLs (and perhaps information corresponding to them) 2. Downloads the HTML that from each URL 3. Indexes the data for each URL's HTML 4. Sends this indexed data back to the master So based on my original numbers, here's my calculation of how much time a client spends indexing, so we can get an idea of what indexing cycle length is. 1) Downloading 1000 URLs and their corresponding data on them would just likely be 100 kB (assuming 100 bytes of data per URL, but we could include what links to the page or anything else that's have the indexer know more about the URL, more about this idea later.) So this could become a 1 MB download, which would only be significant in length if you're using dialup. I'm assuming we won't have may dialup indexers, and so I won't even factor this in. 2) Assume 25 kb of HTML per web page. (Let w =25 kB) Then data downloaded = w*p = 0.025 MB * 1000 = 25 MB Assuming 1.5 MB/s of bandwidth, this phase takes: 25 MB ----- 1.5 MB/s = 216.7 s = 3 min 37 s 3) Assume: indexer time = 10 s / page = i (LEt i= 10s / page) Time to complete this phase = p*i = 1000 pages * 10 s/page = 10 000 s = 166.6 s = 2 h 46 min 4) Worst case time to send it back : I already estimated to be 1 h 51 min in worst case. Total indexing time = 4 hours 40 minutes And much of that time will be spend with a connection open, sending indexed data back, or getting HTML from web pages. One of the reasons I decided that we have each indexer index more pages per cycle would be so that it'd have more analysis to do, and would have less time with connections open. Perhaps I'm a little too concerned with making this more SETI@home-like, even though it can't be, by nature, with the way that it just goes out and grabs data all the time, from the web and from masters. Now time to consider two big questions: 1) How many pages will we index at a given time? Call this value P. 2) How much time do we want to spend (re)indexing the web? Let's call this goal for a time frame T. T, for now, is just something we are setting. But what it may all come down to is, which of the above questions is more important? For now, let's just say that our goal is to reindex our goal in terms of number of web pages once a week. Then T = 168 hours = 10 080 min = 604 800 s The worst-case cycle length was calculated to be 5 hours (based on my numbers above.) Therefore, 33 indexing cycles could be done in this time. But number of cycles may not be what's important here. Number of pages indexed = 33*p*M = 33*1000*10 = 330 000 pages But what if our goal is to index 1 000 000 (10^6) pages? Most obvious (and pherps worst) answer: Do another two weeks of indexing. What we'd prefer: having more indexers and corresponding masters. We can just triple the M value. The max(M/N) ratio remains the same though, so we also triple the number of indexers. So if we have 100 masters, triple that to 300, then we need 3000 indexers. And they won't just arrive. Much will need to be done from our own end to control project growth. And what about adjusting p? Well, number of cycles and pages indexed per cycle are inversely proprtional. The time that it all takes gets thrown off as well. And so, this just emphasizes the importance of completing indexing cycles very quickly, but we need to keep clients in mind, as we don't want them to consume excessive bandwidth, or take up excessive HDD space, etc. So we may need plenty of tweaks for speeding things up, lots of good hardware, and to find just the right amount of responsibility-sharing between clients and indexers. More on this to follow. Thanks, J.K. |