You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(30) |
Sep
(41) |
Oct
(5) |
Nov
(3) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(46) |
Feb
(25) |
Mar
(42) |
Apr
(2) |
May
(3) |
Jun
(8) |
Jul
|
Aug
(3) |
Sep
|
Oct
|
Nov
|
Dec
|
From: <mni...@mo...> - 2004-08-18 23:32:31
|
>>>>> "Eric" == Eric Anderson <and...@ce...> writes: > I don't see any error? Can you try re-sending? When I have something to commit I will. >> FYI I'm getting the following error on commits to cvs. Checking in lib/Sprawler.pm; /cvsroot/sprawler/sprawler/lib/Sprawler.pm,v <-- Sprawler.pm new revision: 1.11; previous revision: 1.10 done mojo |
From: Eric A. <and...@ce...> - 2004-08-18 14:56:28
|
I don't see any error? Can you try re-sending? Mojo B. Nichols wrote: >Eric, > >FYI >I'm getting the following error on commits to cvs. > >Mojo > >-- >An American is a man with two arms and four wheels. > -- A Chinese child > > > >------------------------------------------------------- >SF.Net email is sponsored by Shop4tech.com-Lowest price on Blank Media >100pk Sonic DVD-R 4x for only $29 -100pk Sonic DVD+R for only $33 >Save 50% off Retail on Ink & Toner - Free Shipping and Free Gift. >http://www.shop4tech.com/z/Inkjet_Cartridges/9_108_r285 >_______________________________________________ >Sprawler-devel mailing list >Spr...@li... >https://lists.sourceforge.net/lists/listinfo/sprawler-devel > > -- ------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Talk sense to a fool and he calls you foolish. ------------------------------------------------------------------ |
From: <mni...@mo...> - 2004-08-18 03:56:17
|
Eric, FYI I'm getting the following error on commits to cvs. Mojo -- An American is a man with two arms and four wheels. -- A Chinese child |
From: <mni...@mo...> - 2004-06-23 04:26:27
|
>>>>> "Eric" == Eric Anderson <and...@ce...> writes: > Mojo B. Nichols wrote: >>> Actually I think it may be my perl and just the sockets... can >>> somebody else try this on linux? I said client because that >>> fails, but upon closer inspection it doesn't seem that simple. >>> >> Whew no not my sockets. Basically the problem was two fold: One my >> client database didn't have my client in there. If I add it >> blindly to the database that takes it past that point. Then my url >> seed db was either empty or broken or something. removing that >> index allowed it to start working (reseeded it etc). I'm going to >> shuffle this off to the side and see if I can figure out where it >> went wrong. Perhaps the seeded url db is in cvs? I'll check it out. > Glad to hear you got it working! That makes me feel better > anyway.. :) >> I'm curious about this client db and its intended use. >> > Basically, to keep clients from being able to check in/upload data > for *ANY* arbitrary url they desire. That way, an evil indexer > can't fake an index for it's own website, with all kinds of fake or > misleading data in it, causing our index to be invalid. They > request URL's to index, then they must check in those URL's. You > can't check in URL's you have not checked out.. > Maybe it's time we have someone write up some documentation on all > this? How to use each piece, with example and syntax, etc.. What > do you think? Last I checked documentation was pretty there although this being a new method may need to be added. It sounds as if we need to add them to the client db upon sending a set of urls. I have to think about this a little more. I thought we could use client redundancy and checksums to insure index integrity. As we receive a batch we put it in a queue as soon as its redundant client (or clients) return with indexes and they checkout the master accepts them. The theory here is that then a rogue client would have to occupy a large percentage of client machines to every get skewed results in. As for the client check in alone, it seems like it could still be manipulated. If they obtain a set of urls what is to prevent them from under reporting those urls, or other such mischievousness. Any way at the end of the day it probably doesn't hurt to ensure that urls sent to a client come back from a client... so I'm not really arguing against it. In fact it would be a necessary step in preventing a rogue client from just sending the required amount of skewed indexes to try to fool the master in the redundancy scheme. We can dub the redundancy scheme RAIC Redundant Array of Independent Clients:-) or some other such nonsense. mojo -- When the Apple IIc was introduced, the informative copy led off with a couple of asterisked sentences: It weighs less than 8 pounds.* And costs less than $1,300.** In tiny type were these "fuller explanations": * Don't asterisks make you suspicious as all get out? Well, all this means is that the IIc alone weights 7.5 pounds. The power pack, monitor, an extra disk drive, a printer and several bricks will make the IIc weigh more. Our lawyers were concerned that you might not be able to figure this out for yourself. ** The FTC is concerned about price fixing. You can pay more if you really want to. Or less. -- Forbes |
From: Eric A. <and...@ce...> - 2004-06-23 03:39:29
|
Mojo B. Nichols wrote: >>Actually I think it may be my perl and just the sockets... can >>somebody else try this on linux? I said client because that fails, >>but upon closer inspection it doesn't seem that simple. >> >> > >Whew no not my sockets. Basically the problem was two fold: One my >client database didn't have my client in there. If I add it blindly >to the database that takes it past that point. Then my url seed db >was either empty or broken or something. removing that index allowed >it to start working (reseeded it etc). I'm going to shuffle this off >to the side and see if I can figure out where it went wrong. Perhaps >the seeded url db is in cvs? I'll check it out. > > Glad to hear you got it working! That makes me feel better anyway.. :) >I'm curious about this client db and its intended use. > Basically, to keep clients from being able to check in/upload data for *ANY* arbitrary url they desire. That way, an evil indexer can't fake an index for it's own website, with all kinds of fake or misleading data in it, causing our index to be invalid. They request URL's to index, then they must check in those URL's. You can't check in URL's you have not checked out.. Maybe it's time we have someone write up some documentation on all this? How to use each piece, with example and syntax, etc.. What do you think? Eric -- ------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Talk sense to a fool and he calls you foolish. ------------------------------------------------------------------ |
From: <mni...@mo...> - 2004-06-23 02:34:03
|
>>>>> "Mojo" == Mojo B Nichols <mni...@mo...> writes: >>>>> "Eric" == Eric Anderson <and...@ce...> writes: >> Mojo B. Nichols wrote: >>> the client check seems somewhat broken. >>> >> Uhh, how about some additional details? Eric > Actually I think it may be my perl and just the sockets... can > somebody else try this on linux? I said client because that fails, > but upon closer inspection it doesn't seem that simple. Whew no not my sockets. Basically the problem was two fold: One my client database didn't have my client in there. If I add it blindly to the database that takes it past that point. Then my url seed db was either empty or broken or something. removing that index allowed it to start working (reseeded it etc). I'm going to shuffle this off to the side and see if I can figure out where it went wrong. Perhaps the seeded url db is in cvs? I'll check it out. I'm curious about this client db and its intended use. Thanks, -- HP had a unique policy of allowing its engineers to take parts from stock as long as they built something. "They figured that with every design, they were getting a better engineer. It's a policy I urge all companies to adopt." -- Apple co-founder Steve Wozniak, "Will Wozniak's class give Apple to teacher?" EE Times, June 6, 1988, pg 45 |
From: <mni...@mo...> - 2004-06-22 21:24:35
|
>>>>> "Eric" == Eric Anderson <and...@ce...> writes: > Mojo B. Nichols wrote: >> the client check seems somewhat broken. >> > Uhh, how about some additional details? Eric Actually I think it may be my perl and just the sockets... can somebody else try this on linux? I said client because that fails, but upon closer inspection it doesn't seem that simple. Thanks, Mojo -- Life is both difficult and time consuming. |
From: Eric A. <and...@ce...> - 2004-06-22 13:06:40
|
Mojo B. Nichols wrote: >the client check seems somewhat broken. > Uhh, how about some additional details? Eric -- ------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Talk sense to a fool and he calls you foolish. ------------------------------------------------------------------ |
From: <mni...@mo...> - 2004-06-19 14:27:19
|
the client check seems somewhat broken. -- Nobody ever died from oven crude poisoning. |
From: <mni...@mo...> - 2004-06-10 10:53:42
|
MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii --text follows this line-- Hi all, When I attempt to run sprawler locally this is what I get: ./master.pl No urls in db to index! Seeding url index ... 78 urls added to seed list. print() on closed filehandle LOG at lib/Sprawler.pm line 92. print() on closed filehandle LOG at lib/Sprawler.pm line 92. print() on closed filehandle LOG at lib/Sprawler.pm line 92. print() on closed filehandleLOG at lib/Sprawler.pm line 92, <PARENT> line 1. print() on closed filehandle LOG at lib/Sprawler.pm line 92, <CHILD> line 1. Use of uninitialized value in string eq at ./master.pl line 125. Client tes...@sp...-1031080407379 attempted to steal from us! print() on closed filehandle LOG at lib/Sprawler.pm line 92. print() on closed filehandle LOG at lib/Sprawler.pm line 92. and ./indexer.pl -s localhost Prototype mismatch: sub Sprawler::Client::get ($) vs none at lib/Sprawler/Client.pm line 127. Index path: ./indexes/tes...@sp...-1031080407379/ Indexable content types: text/html text/plain Requesting urls Before I dig into what's causing this does anybody else see this behavior when trying this simple test. I think our biggest challenge will be keeping our programs relatively platform agnostic. Eric I understand this works for you in freebsd? Thanks, mojo -- World Domination, One CPU Cycle At A Time Forget about searching for alien signals or prime numbers. The real distributed computing application is "Domination@World", a program to advocate Linux and Apache to every website in the world that uses Windows and IIS. The goal of the project is to probe every IP number to determine what kind of platform each Net-connected machine is running. "That's a tall order... we need lots of computers running our Domination@World clients to help probe every nook and cranny of the Net," explained Mr. Zell Litt, the project head. After the probing is complete, the second phase calls for the data to be cross-referenced with the InterNIC whois database. "This way we'll have the names, addresses, and phone numbers for every Windows-using system administrator on the planet," Zell gloated. "That's when the fun begins." The "fun" part involves LART (Linux Advocacy & Re-education Training), a plan for extreme advocacy. As part of LART, each Linux User Group will receive a list of the Windows-using weenies in their region. The LUG will then be able to employ various advocacy techniques, ranging from a soft-sell approach (sending the target a free Linux CD in the mail) all the way to "LARTcon 5" (cracking into their system and forcibly installing Linux). |
From: <mni...@mo...> - 2004-06-10 10:48:10
|
This is a test. If it had been a real mail you would have wanted to read it. mojo -- People are very flexible and learn to adjust to strange surroundings -- they can become accustomed to read Lisp and Fortran programs, for example. - Leon Sterling and Ehud Shapiro, Art of Prolog, MIT Press |
From: <ben...@id...> - 2004-05-22 12:22:50
|
Dear Open Source developer I am doing a research project on "Fun and Software Development" in which I kindly invite you to participate. You will find the online survey under http://fasd.ethz.ch/qsf/. The questionnaire consists of 53 questions and you will need about 15 minutes to complete it. With the FASD project (Fun and Software Development) we want to define the motivational significance of fun when software developers decide to engage in Open Source projects. What is special about our research project is that a similar survey is planned with software developers in commercial firms. This procedure allows the immediate comparison between the involved individuals and the conditions of production of these two development models. Thus we hope to obtain substantial new insights to the phenomenon of Open Source Development. With many thanks for your participation, Benno Luthiger PS: The results of the survey will be published under http://www.isu.unizh.ch/fuehrung/blprojects/FASD/. We have set up the mailing list fa...@we... for this study. Please see http://fasd.ethz.ch/qsf/mailinglist_en.html for registration to this mailing list. _______________________________________________________________________ Benno Luthiger Swiss Federal Institute of Technology Zurich 8092 Zurich Mail: benno.luthiger(at)id.ethz.ch _______________________________________________________________________ |
From: <mni...@mo...> - 2004-05-04 02:30:55
|
>>>>> "J" == J Kasprzak <ja...@ka...> writes: > Hello. Well, I couldn't help but notice that it's been quiet here > lately. I haven't seen anything posted here for a while and haven't > seen anything committed to the repository lately. I'd just like top > know what's been happening, and be kept up to date, etc. I > personally have not been as active with this lately, as I was on a > short vacation last week and had another nasty cold. But I'm still > loking for ways to help, even though I haven't been able to code (my I'm still here, but have been busy with guests literally from the beginning of april to now. Last guest scheduled left today, so I will begin coding again soon. Actually I did a bug fix, but some other things weren't working so I haven't committed. I plan on getting back into full swing here, soon. regards, mojo -- Having children is like having a bowling alley installed in your brain. -- Martin Mull |
From: J. K. <ja...@ka...> - 2004-05-03 04:24:11
|
Hello. Good to hear from you. I was wondering what was happening, any time there isn't anything going on, no talk, no new code, no website updates, etc. I get concerned. In fact, I was wondering about what I should be doing. If the other members here don't seem to be working at it, then I wondered if I should keep working. This is the reason I did not send a reply to the message on the Sprawler Map within a few days as I said I would. If we were not going to go over important matters and compare notes, etc. then I figured I might be wasting my time. About all I wanted was to know that the rest of you were active here, because I certainly can't work on this alone. But I must say that I am glad that our leader, Eric, appears to have the determination needed for this project to be a success. Without his support, I cannot see this project taking off. This is his vision, and as the person who has the vision, the importance of that position cannot be underestimated. Anyway, what about the rest of you? It'd be good to hear fromall of you as well. Thanks, J.K. > Ok - first, it HAS been quiet around here. I've pretty much been 150% > busy at work, so I've been disconnected f rom the project. I'll be back > in action very soon, and I guarantee I'll make up for the lost time, so > now's a good time to start studying the code, play with it, etc, > because it's going to be a coding frenzy when I start up again soon. :) > I'll reread these notes, and comment on them separately.. > > Anyone who's kicking back and waiting for some action - prepare for > Sprawler's most important time.. > > Eric > > > > -- > ------------------------------------------------------------------ Eric > Anderson Sr. Systems Administrator Centaur Technology > Today is the tomorrow you worried about yesterday. > ------------------------------------------------------------------ > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: Oracle 10g > Get certified on the hottest thing ever to hit the market... Oracle 10g. > Take an Oracle 10g class now, and we'll give you the exam FREE. > http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click > _______________________________________________ > Sprawler-devel mailing list > Spr...@li... > https://lists.sourceforge.net/lists/listinfo/sprawler-devel |
From: Eric A. <and...@ce...> - 2004-04-30 13:35:15
|
J. Kasprzak wrote: >Hello. > >Well, I couldn't help but notice that it's been quiet here lately. I >haven't seen anything posted here for a while and haven't seen anything >committed to the repository lately. I'd just like top know what's been >happening, and be kept up to date, etc. I personally have not been as >active with this lately, as I was on a short vacation last week and had >another nasty cold. > [..snip good stuff..] Ok - first, it HAS been quiet around here. I've pretty much been 150% busy at work, so I've been disconnected f rom the project. I'll be back in action very soon, and I guarantee I'll make up for the lost time, so now's a good time to start studying the code, play with it, etc, because it's going to be a coding frenzy when I start up again soon. :) I'll reread these notes, and comment on them separately.. Anyone who's kicking back and waiting for some action - prepare for Sprawler's most important time.. Eric -- ------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Today is the tomorrow you worried about yesterday. ------------------------------------------------------------------ |
From: J. K. <ja...@ka...> - 2004-04-24 04:37:13
|
Hello. Well, I couldn't help but notice that it's been quiet here lately. I haven't seen anything posted here for a while and haven't seen anything committed to the repository lately. I'd just like top know what's been happening, and be kept up to date, etc. I personally have not been as active with this lately, as I was on a short vacation last week and had another nasty cold. But I'm still loking for ways to help, even though I haven't been able to code (my PC needs a RAM upgrade for me to be running the code and testing it the way I want.) Eric, I got your reply to my comments on the Sprawler Map, and I plan on commenting on that later on in a later reply (which I expect to come in the next few days.) So as I wonder what the four of you are doing, let me just dump some more information from notes that I have scribbled. Here you are: Here are some more calculations and thoughts on them. Just as a quick reminder, here are the variables I'm using: M: number of master machines N: number of indexer machines p: pages indexed per indexer s: (average) size of indexed data per web page I should also note that we have a value, which I've been calling max(M/N), which is the maximum number of masters we allow for indexers. Maybe I should represent this value with a single letter, as I do with the other values. But I'll leave this as it is for now. It's just a disorganized list of scratch pad calculations which I or we can make more formal later. The values I've been using for these variables are as follows: max(M/N) = 100 s = 100 kB = 0.1 MB p = 1000 pages I'm not saying that these are the numbers we will go with, or that these numbers are the best. It's just what I've been working with for now based on an initial estimated guess of what might be best. And remember, it's not so much the numbers that are important, it's the letters. I've come up with equations such as the one for average amout of data sent to master in an indexing cycle (which is max(M/N)*p*s = 10 GB) and these are what reveal the whole framework within which we are working, and put numerical values in to come up with optimal results. So here in this situation, in the worst case, where all data is received at once, 10 GB of data is sent to each master, and so it'll need to have that all copied over to the NFS disks before the indexers bring in any more data. I figure I already covered the bandwidth issue, so I'm bringing up disk space now. Although the more I look at this, the more it appears that bandwidth will be a bottleneck. So in this case 10 GB of data should be available for receiving data from the indexers, but that's not taking into account what I'm going to bring up next. And those are the issues of expansion and redundancy. Our engine will expand to take in more pages, and we'll need to have redundancy in the design in case masters go down. First, let's talk about expansion. Hardware expansion is relatively stagnant, I think. It's easier to increase max(M/n) than it is to get more hardware and money. We can fairly easily adjust p, and max(M/N). s is not something that'll change as much. As we look for more ways of indexing it'll gradually increase. What we need is a redundancy factor, where we decide how much more each machine should handle. Suppose we set that factor to 2, in other words, each machine can handle twice as much as it can in its worst case. This is something we'll need to maintain. As when one machine goes down, another can pick up and handle it. But designing large hardware systems isn't something I'm as familiar with, so please correct me if I'm dead wrong. And perhaps we need and expansion factor as well. Should machines be able to handle double their worst-case load in order to make sure it can handle expansion in the size of the index? As we grow, we'll need to be able to handle it. Bandwidth per machine may be what's most static, that's my concern at this time. Or we may need more masters to compensate for lack of bandwidth. Can we get ourselves some T3s? Now for some more calculations. Estimation of Client Indexing Time ---------------------------------- A client (or indexer) to does the following four things over and over again: 1. Downloads URLs (and perhaps information corresponding to them) 2. Downloads the HTML that from each URL 3. Indexes the data for each URL's HTML 4. Sends this indexed data back to the master So based on my original numbers, here's my calculation of how much time a client spends indexing, so we can get an idea of what indexing cycle length is. 1) Downloading 1000 URLs and their corresponding data on them would just likely be 100 kB (assuming 100 bytes of data per URL, but we could include what links to the page or anything else that's have the indexer know more about the URL, more about this idea later.) So this could become a 1 MB download, which would only be significant in length if you're using dialup. I'm assuming we won't have may dialup indexers, and so I won't even factor this in. 2) Assume 25 kb of HTML per web page. (Let w =25 kB) Then data downloaded = w*p = 0.025 MB * 1000 = 25 MB Assuming 1.5 MB/s of bandwidth, this phase takes: 25 MB ----- 1.5 MB/s = 216.7 s = 3 min 37 s 3) Assume: indexer time = 10 s / page = i (LEt i= 10s / page) Time to complete this phase = p*i = 1000 pages * 10 s/page = 10 000 s = 166.6 s = 2 h 46 min 4) Worst case time to send it back : I already estimated to be 1 h 51 min in worst case. Total indexing time = 4 hours 40 minutes And much of that time will be spend with a connection open, sending indexed data back, or getting HTML from web pages. One of the reasons I decided that we have each indexer index more pages per cycle would be so that it'd have more analysis to do, and would have less time with connections open. Perhaps I'm a little too concerned with making this more SETI@home-like, even though it can't be, by nature, with the way that it just goes out and grabs data all the time, from the web and from masters. Now time to consider two big questions: 1) How many pages will we index at a given time? Call this value P. 2) How much time do we want to spend (re)indexing the web? Let's call this goal for a time frame T. T, for now, is just something we are setting. But what it may all come down to is, which of the above questions is more important? For now, let's just say that our goal is to reindex our goal in terms of number of web pages once a week. Then T = 168 hours = 10 080 min = 604 800 s The worst-case cycle length was calculated to be 5 hours (based on my numbers above.) Therefore, 33 indexing cycles could be done in this time. But number of cycles may not be what's important here. Number of pages indexed = 33*p*M = 33*1000*10 = 330 000 pages But what if our goal is to index 1 000 000 (10^6) pages? Most obvious (and pherps worst) answer: Do another two weeks of indexing. What we'd prefer: having more indexers and corresponding masters. We can just triple the M value. The max(M/N) ratio remains the same though, so we also triple the number of indexers. So if we have 100 masters, triple that to 300, then we need 3000 indexers. And they won't just arrive. Much will need to be done from our own end to control project growth. And what about adjusting p? Well, number of cycles and pages indexed per cycle are inversely proprtional. The time that it all takes gets thrown off as well. And so, this just emphasizes the importance of completing indexing cycles very quickly, but we need to keep clients in mind, as we don't want them to consume excessive bandwidth, or take up excessive HDD space, etc. So we may need plenty of tweaks for speeding things up, lots of good hardware, and to find just the right amount of responsibility-sharing between clients and indexers. More on this to follow. Thanks, J.K. |
From: Eric A. <and...@ce...> - 2004-03-31 16:25:01
|
J. Kasprzak wrote: >Well, that map is just what I figured we needed. It does look quite good, >and after going over it, there are quite a few things I'd like to say, >mostly concerning the physical aspects of it. > >First, about the logical aspect, I have nothing against what's currently >there. I can't think of anything wrong with it really, maybe a few subtle >changes could be done about in each of the modules there (ie. controller >vs. processor, maybe controller could do some of what procesor does, but >need more detail.) And since I need more details, I'll just focus mostly >on the physical design. > >So here's what I have to say. > >Physical Design Issues >---------------------- > >Ratio of Indexer Machines to Master Machines: It appears to be N:1, in that >each master in the layer has N indexers assigned to it. Is this assignment >static or dynamic? And how large or small should N be? Perhaps it should all >be dynamically assigned, as we would want the load to be distributed as >much as possible. And a maximum value of N needs to be set up for each >master, assuming hardware differences. (Or are all indexers created >equal?) How many connections should each one handle? > >Ratios In General: For each level, it says there are N machines. As an >individual who is something of a mathematics geek, this seems to imply that >all levels have equal numbers of machines. It may seem that I'm nit-picking >here, but this could lead to confusion, especially when looking over the >last section. > > I'm glad you mentioned that. Unfortunately, I wasn't too clear on that part. Here's what I was thinking of doing. The harvester's all talk to *apparently* the same controller - but it isn't. If we have 5 controllers, we have one DNS name (cntlr.sprawler.com) that uses DNS round robin to distribute load between as many (or as little) actual controller machines as we need. We can add machines, and within minutes, the load will the distributed to those new hosts. We can remove them, etc. If we lose a machine (hardware failure, etc), we remove it from the DNS round robin listing and within about a minute it disappears. The harvester will be smart enough to know if it gets a reset connection (host unavailable, broken, etc) it waits x seconds and retries (most probably getting a new machine, which should work). This also gives us the flexibility to use any load balancing/distributing application we want, with no notice or change to the clients. There can be any number of any of the machines on the map. I made them all the same out of symmetry, but there's no real reason there needs to be any specific number of each. We'll grow the different classes of machines as needed (and requires barely any changes to add/remove additional machines). >Adaptability: This concerns the overall structure for all levels. How well >can it adapt to changes? More specifically, can it adapt well to machines >being added/upgraded on the fly? Would configuration files need to be >changed? This last point ties in with the last section somewhat. > > I think I answered this above, but if I didn't answer to your satisfaction, please say so and I can go into more detail. >Possible Redundancy: Does each master know what other masters are doing? Do >they need to? How do URLs and data get partitioned among master machines? Do >they all have access to a centralized data store of URLs? The physical >design does not seem to indicate where this data on URLs and phases each >URL is in is stored. > > The controllers (masters) have no idea what any other controller is doing, or even if there are any other controllers. They don't need to know. They all access a central repository of URLs, and pull (somewhat randomly) a list of URLs to hand out to the harvesters. The actual data is stored on NFS shared disks that the controllers can access, and also the processors and maintainers can also access. This is why we can pop in more controllers at any given time, or remove them. It's all at our whim. >Client Idle Time: If there's a situation where the indexer cannot connect to >the master, what can it do? If servers go down, the queue could then become >excessively long. And what if there's nothing left for the client to index >before the next reindexing cycle (dare I suggest this so soon?) Maybe the >client can just reindex the URLs again. > > If the harvester cannot connect to a controller, it sleeps x seconds and tries again (hopefully getting another controller and moving on). If it continues to fail, it will try forever. The queue should not get huge on us - only if we are gathering data from the harvesters and not handing out URLs, or vice versa - we are handing out URLs, but not gathering the data back - which shouldn't happen, and even if it does, it should not be a real problem except for a little heavy traffic when things come back alive. There will never be a time that we have nothing to index - the maintainer will see to that. It's job is to make sure that there are always URLs to be indexed by selecting URLs that have already been indexed to be re-indexed, and re-inject them into the indexing pool. >General Bad Stuff: Eric mentioned "nastygrams" could theoretically be sent >over and perhaps the data sent over could be spoofed. And it certainly is >a nasty world out there, with people who may test to see how robust our >servers really are. We may need to consistently work on methods to keep >from being flooded with DDoS attacks and other related floods, and so we >may need keep checking for what is legitimate data and what isn't. But >with indexers, perhaps we can only start out with people we trust. > > Definitely! In fact, we'll probably only go to a dstributed indexing system once we are well on our way to a working engine, so we should have some publicity, and hopefully a few more good developers on the team. >Separation of Responsibility Between WWW Server and Compute Machines: Who >should do the parsing of user input? Maybe a bit of both would do. All >user input could be encoded into the URL string (as Google does) or maybe >it could all be put into an easily-parsable form on the web side. And >should CGI be used? PHP is free and fast, and doesn't use CGI (right?) . >The data on what's searched for canbe sent through it, and the compute >machine can take the data in perhaps aneasily-parsable form. But this may >be more software-related, and we need to determine the ratio of WWW >machines and Compute machines to find the right balance. I think that >maybe there should be more WWW Servers, as taking in queries may not be >all they do. Remember that we may have personalization features and other >similar things there. > > I was thinking something similar. What I had envisioned was the WWW servers would do all HTML related creation, user interfacing, etc. It would take a query from a user, order it in a certain fashion, and request data from the compute machines (via TCP most likely). The WWW servers would send a search request for the different "tokens" that were entered (the WWW server parses this, and determines which compute machines to contact, what type of query, etc - more on this later on down the road) to the selected compute machine, and the compute machine responds back with the data to be shown to the user. Most of the time, the WWW server will need to contact several compute machines (and these are scalable also in a similar fashion as above), and take the data from each and compile it into a list. The WWW server will be almost completely doing web requests and HTML (or PHP) tricks, but some minimal computations could be done. >Reducing Disk Access: Much memory would be needed to keep what is most >likely to be requested in memory. And what we need is a good algorithm for >determining which data is most likely to be requested? Would it really be >what's most recent? Wouldn't it be what requested most often? Some >combination of both? And do these are have access to the same data on >this? Just more issues withdistributed computing. > > I think what we'll end up with is something that keeps in memory the most recent and most popular terms. Each compute machine will have it's own cache (which of course will be different than the others). I also think we'll have compute machines for different sections of index data. We can break up the access anyway we want - it really doesn't matter. We can have "rulesets" that the compute machines use to determine their "scope" of search, and the WWW servers will be told which compute machines have which scopes. This gives us the ability to spread the memory caches across scopes, machines, and the index as a whole, still using cheap hardware. >Priorities: We need to determine how to allocate resources. With the money >that we get, certain percentages can be allocated to certain places. But >how should this be done? The Computer and Master layers will need plenty of >hardware, but which may be more important? Is indexing more important than >searching may be what that last question comes down to. And this applies >for all layers. Which need more and/or better hardware? > > This is going to be interesting. I think we'll have to attack this one day at a time. Right now, we need disk space, and a few machines to use to get an initial setup going. Once we start indexing full-time, we'll need LOTS of disk space, and several machines for disk servers. As we get a larger index, we'll want to add compute and WWW machines, so we can support the increasing number of searches that we'll be getting. It will hit a break point, where suddenly many people know about us, and use us, yet we're still small and growing, so we'll be at a critical time. Hopefully I can score some good hardware deals with a few vendors before that happens. >-------------------------------------------------------------------------- > >Alright, now in that last section, I mostly raised questions. Now I'm >going to see what I can come up with for answers. > >First, answers to questions on the interaction of masters with indexer >clients, where I asked how many indexers each master could handle at a >time. Here, I model the system and come up with a little notation in order >to help us quantify everything and come up with some numbers that will be >useful in the overalldesign of the system. > >Let's define some variables first. > >M: the number of masters >N: the number of indexers >p: number of pages that each client indexes in each indexing cycle >s: size of file of indexed data for a URL > >So what we need to do is somehow come up with a maximum value for what M/N >should be. If the M:N ratio is too high, that'll lead to masters being >bogged down with requests for URLs and to have indexed data stored. And we >don't want that. Now thorughout this, I'm assuming the worst cases for >each indexing cycle. And an indexing cycle is, as you probably know, the >whole cycle of the master finding URLs in the "to be indexed" state, then >having clients request these URLs, then index data in them, and send the >indexed data back. > >So let max(M/N) be this maximum value. > >Whenever new indexers are added and >registered, the M/N ratio increases, and perhaps the server can be updated >of these changes and somehow it'll need to know how to assign it to a >master server at a time that isn't handling as many indexers at a time. >This assignment would be done dynamically, though perhaps I'm stating the >obvious here. The system just needs to know how many clients each master >is handling. > >But then p, the number of URLs that an indexer requests, could also be >made more dynamic, rather than just having a value in a configuration file >for it. The value p could be related to the length of each reindexing >interval, and I'll cover the importance of that next. > >But here are some quick little equations to put values into: > >Number of pages being indexed per cycle = p * N >Size of data handled by master per indexing cycle = max(M/N) * p * s > >In the worst case, the master handles all of this data at once. > >We want to maximize p*n. (To index as many pages as possible per cycle.) > >But max(M/N)*p*s should be capped. But how? We want to limit the amount of >time clients spend sending data, and masters spend taking it, right? We need >to take bandwith per master into consideration. > >Let's say masters get all the data at once: which is the worst case, and >would ideally be avoided. (client badwidth and processing speed can vary >quite a bit, and this could actually be good news for us, causing us to >avoid this, but I digrress.) > >Quick experiment: > >Say a master has 1.5 MBps of bandwith. (as do clients.) > >Let s=100 kB (we've been using this as a worst-case maximum figure) >Let M/N = 100 >Let p = 1000 pages/cycle > >Then data sent at given time = 100 * 1000 pages/cycle* 0.1 MB/page > = 10 000 MB/cycle > >Worst case time per cycle = 10 000 MB / 1.5 MB/s > = 6 667 s > = 111.11 min > = 1 h 51 min 7 s > > >Now two hours to upload the data does not look good, but this is absolute >worst case, and it does show the inportance of capping p and max(M/N) >values. We can just keep playing with these values, and it's what we're >working with, unless there are a few things here I'm wrong about. > >Another issue: Given p, how long would each indexer take to index p pages? >In other words, to go through its mini-cycle? What it works out to is is >follows: > >average time for indexer to >complete its part of cycle = p*avg(pagesize) + avg(indexing time per page) > --------------- > avg(bandwidth) > > >We will need to find out these averages above to get total cycle length, >which should be related to the reindexer interval. Perhaps it shoud be >dynamic based on statistics we compile as we index pages? Add that last >equation to worst-case time per cycle, and time master takes to find what >needs to be indexed, and there you have cycle length. > >One last thing I should mention is that these are just things that came from >my scratch pad. I've been quite interested in how such a large system would >work, and we'll need to come up with some numbers here. Now, if I'm wrong >about anything here, now would be a good time to tell me. And if so, you can >correct me and this data can be made more formal (ie. I haven't assigned >variables to worst case indexing times, and maybe I should.) > > This is awesome. I love it. This is exactly what I hope to see from someone working on this project. Ok - I think your numbers are *extreme* worst case scenario, but good to see nonetheless. There are a few things I'd say about them: first, the clients will naturally stagger themselves out a bit, so it won't take all clients 2 hrs to send their data. Plus, you are figuring 100 clients, with 1000 pages per cycle - which is a little high I think. It will probably be more like 50 pages per cycle (we don't want to slam our servers when they upload, we don't want one client to be responsible for too many URLs, and we don't want to use up too much space on the client's disk). I think you should incorporate some of these calculations in the code. Add some simple routines to time and average the page sizes, download times per page, per cycle, upload times, etc. It would be good for us to know, and nice for debugging. Eric -- ------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Today is the tomorrow you worried about yesterday. ------------------------------------------------------------------ |
From: <mni...@mo...> - 2004-03-31 13:24:53
|
My additions below. I'm thinking for testing we should schedule a time frame and try to get everyone on the irc channel, to discuss what we see. I guess from 9:00pm EST to 12:00 pm EST is good for me. Here is the updated to-do list (as promised). Please feel free to pick something you like off the list, send an email to this list (-devel), and let us know what you're working on. Harvester: (indexer.pl) my additions although I've been a little out of the loop lately. ----------------------- o indexer doesn't completely store data in the local db files. for instance, the urls are stored, but not the text linking to those urls (the linked text should be stored with each url) (open) o there are a lot of index types missing (header text, small text, strong text, etc, etc) (open) o fix pick_lanquage method (Eric) o test and select an html parser (HTML:Parser,XML::Parser, TokeParser, Pull Parser) based on efficency (Ilya). o methods for determining font clashes (open) o renaming of all classes, methods to reflect current naming convention. (mojo) Controller: (master.pl) ----------------------- o Controller needs to check for "nastigrams" - charactors and such that could cause the Controller to execute commands on behalf of the user it is running as. (open)(mojo) o Methods for toggling states, re-indexing, etc (open). o Patch needed to make Controller only allow checkout of a max number of urls per Harvester, so we need to check how many they currently have checked out, and get the difference. (open) Queue Processor: (queue.pl) ------------------------- o add a queue processing agent that goes through the db files sent by the harvester to the controller and parse the data out and put it in the index tree. (open) Queue Maintainer: (maintainer.pl) ---------------------------- o Agent that runs independently of other programs, goes through the state db's and finds urls that need reindexing, and re-injects them into the queue by changed it's state (and moving it to the corresponding state db). (open) General ----------------- o TESTERS! We need your bandwidth! This is an easy way to get involved! (EVERYONE) o Should we try a certain time and have every one join # sprawler. o Design good user interface for web front end (open) o general error checking and code robustness. (open) |
From: <mni...@mo...> - 2004-03-31 13:14:52
|
>>>>> "Eric" == Eric Anderson <and...@ce...> writes: > Processor: (processor.pl) ------------------------- I feel processor in general is too vague of a term. Don't get me wrong I don't care that much what it's named, but how about just queue.pl. > Maintainer: (maintainer.pl) ---------------------------- o Agent > that runs independently of other programs, goes through the state > db's and finds urls that need reindexing, and re-injects them into > the queue by changed it's state (and moving it to the corresponding > state db). (open) juries still out on this one. |
From: J. K. <ja...@ka...> - 2004-03-31 06:04:01
|
Well, that map is just what I figured we needed. It does look quite good, and after going over it, there are quite a few things I'd like to say, mostly concerning the physical aspects of it. First, about the logical aspect, I have nothing against what's currently there. I can't think of anything wrong with it really, maybe a few subtle changes could be done about in each of the modules there (ie. controller vs. processor, maybe controller could do some of what procesor does, but need more detail.) And since I need more details, I'll just focus mostly on the physical design. So here's what I have to say. Physical Design Issues ---------------------- Ratio of Indexer Machines to Master Machines: It appears to be N:1, in that each master in the layer has N indexers assigned to it. Is this assignment static or dynamic? And how large or small should N be? Perhaps it should all be dynamically assigned, as we would want the load to be distributed as much as possible. And a maximum value of N needs to be set up for each master, assuming hardware differences. (Or are all indexers created equal?) How many connections should each one handle? Ratios In General: For each level, it says there are N machines. As an individual who is something of a mathematics geek, this seems to imply that all levels have equal numbers of machines. It may seem that I'm nit-picking here, but this could lead to confusion, especially when looking over the last section. Adaptability: This concerns the overall structure for all levels. How well can it adapt to changes? More specifically, can it adapt well to machines being added/upgraded on the fly? Would configuration files need to be changed? This last point ties in with the last section somewhat. Possible Redundancy: Does each master know what other masters are doing? Do they need to? How do URLs and data get partitioned among master machines? Do they all have access to a centralized data store of URLs? The physical design does not seem to indicate where this data on URLs and phases each URL is in is stored. Client Idle Time: If there's a situation where the indexer cannot connect to the master, what can it do? If servers go down, the queue could then become excessively long. And what if there's nothing left for the client to index before the next reindexing cycle (dare I suggest this so soon?) Maybe the client can just reindex the URLs again. General Bad Stuff: Eric mentioned "nastygrams" could theoretically be sent over and perhaps the data sent over could be spoofed. And it certainly is a nasty world out there, with people who may test to see how robust our servers really are. We may need to consistently work on methods to keep from being flooded with DDoS attacks and other related floods, and so we may need keep checking for what is legitimate data and what isn't. But with indexers, perhaps we can only start out with people we trust. Separation of Responsibility Between WWW Server and Compute Machines: Who should do the parsing of user input? Maybe a bit of both would do. All user input could be encoded into the URL string (as Google does) or maybe it could all be put into an easily-parsable form on the web side. And should CGI be used? PHP is free and fast, and doesn't use CGI (right?) . The data on what's searched for canbe sent through it, and the compute machine can take the data in perhaps aneasily-parsable form. But this may be more software-related, and we need to determine the ratio of WWW machines and Compute machines to find the right balance. I think that maybe there should be more WWW Servers, as taking in queries may not be all they do. Remember that we may have personalization features and other similar things there. Reducing Disk Access: Much memory would be needed to keep what is most likely to be requested in memory. And what we need is a good algorithm for determining which data is most likely to be requested? Would it really be what's most recent? Wouldn't it be what requested most often? Some combination of both? And do these are have access to the same data on this? Just more issues withdistributed computing. Priorities: We need to determine how to allocate resources. With the money that we get, certain percentages can be allocated to certain places. But how should this be done? The Computer and Master layers will need plenty of hardware, but which may be more important? Is indexing more important than searching may be what that last question comes down to. And this applies for all layers. Which need more and/or better hardware? -------------------------------------------------------------------------- Alright, now in that last section, I mostly raised questions. Now I'm going to see what I can come up with for answers. First, answers to questions on the interaction of masters with indexer clients, where I asked how many indexers each master could handle at a time. Here, I model the system and come up with a little notation in order to help us quantify everything and come up with some numbers that will be useful in the overalldesign of the system. Let's define some variables first. M: the number of masters N: the number of indexers p: number of pages that each client indexes in each indexing cycle s: size of file of indexed data for a URL So what we need to do is somehow come up with a maximum value for what M/N should be. If the M:N ratio is too high, that'll lead to masters being bogged down with requests for URLs and to have indexed data stored. And we don't want that. Now thorughout this, I'm assuming the worst cases for each indexing cycle. And an indexing cycle is, as you probably know, the whole cycle of the master finding URLs in the "to be indexed" state, then having clients request these URLs, then index data in them, and send the indexed data back. So let max(M/N) be this maximum value. Whenever new indexers are added and registered, the M/N ratio increases, and perhaps the server can be updated of these changes and somehow it'll need to know how to assign it to a master server at a time that isn't handling as many indexers at a time. This assignment would be done dynamically, though perhaps I'm stating the obvious here. The system just needs to know how many clients each master is handling. But then p, the number of URLs that an indexer requests, could also be made more dynamic, rather than just having a value in a configuration file for it. The value p could be related to the length of each reindexing interval, and I'll cover the importance of that next. But here are some quick little equations to put values into: Number of pages being indexed per cycle = p * N Size of data handled by master per indexing cycle = max(M/N) * p * s In the worst case, the master handles all of this data at once. We want to maximize p*n. (To index as many pages as possible per cycle.) But max(M/N)*p*s should be capped. But how? We want to limit the amount of time clients spend sending data, and masters spend taking it, right? We need to take bandwith per master into consideration. Let's say masters get all the data at once: which is the worst case, and would ideally be avoided. (client badwidth and processing speed can vary quite a bit, and this could actually be good news for us, causing us to avoid this, but I digrress.) Quick experiment: Say a master has 1.5 MBps of bandwith. (as do clients.) Let s=100 kB (we've been using this as a worst-case maximum figure) Let M/N = 100 Let p = 1000 pages/cycle Then data sent at given time = 100 * 1000 pages/cycle* 0.1 MB/page = 10 000 MB/cycle Worst case time per cycle = 10 000 MB / 1.5 MB/s = 6 667 s = 111.11 min = 1 h 51 min 7 s Now two hours to upload the data does not look good, but this is absolute worst case, and it does show the inportance of capping p and max(M/N) values. We can just keep playing with these values, and it's what we're working with, unless there are a few things here I'm wrong about. Another issue: Given p, how long would each indexer take to index p pages? In other words, to go through its mini-cycle? What it works out to is is follows: average time for indexer to complete its part of cycle = p*avg(pagesize) + avg(indexing time per page) --------------- avg(bandwidth) We will need to find out these averages above to get total cycle length, which should be related to the reindexer interval. Perhaps it shoud be dynamic based on statistics we compile as we index pages? Add that last equation to worst-case time per cycle, and time master takes to find what needs to be indexed, and there you have cycle length. One last thing I should mention is that these are just things that came from my scratch pad. I've been quite interested in how such a large system would work, and we'll need to come up with some numbers here. Now, if I'm wrong about anything here, now would be a good time to tell me. And if so, you can correct me and this data can be made more formal (ie. I haven't assigned variables to worst case indexing times, and maybe I should.) Thanks, J.K. > I've written up a "map", or floorplan of some of the conceptual layout > of the project - both physical and logical. I've put it here: > > http://www.sprawler.com/Sprawler-map.pdf > > Please feel free to comment, ask questions, etc. Specially on the > naming of things. > > Eric > > > -- > ------------------------------------------------------------------ Eric > Anderson Sr. Systems Administrator Centaur Technology > Today is the tomorrow you worried about yesterday. > ------------------------------------------------------------------ > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > Sprawler-devel mailing list > Spr...@li... > https://lists.sourceforge.net/lists/listinfo/sprawler-devel |
From: Eric A. <and...@ce...> - 2004-03-31 04:29:14
|
Here is the updated to-do list (as promised). Please feel free to pick something you like off the list, send an email to this list (-devel), and let us know what you're working on. Harvester: (indexer.pl) ----------------------- o indexer doesn't completely store data in the local db files. for instance, the urls are stored, but not the text linking to those urls (the linked text should be stored with each url) (open) o there are a lot of index types missing (header text, small text, strong text, etc, etc) (open) o fix pick_lanquage method (Eric) o test and select an html parser (HTML:Parser,XML::Parser, TokeParser, Pull Parser) based on efficency (Ilya). o methods for determining font clashes (open) Controller: (master.pl) ----------------------- o Controller needs to check for "nastigrams" - charactors and such that could cause the Controller to execute commands on behalf of the user it is running as. (open) o Methods for toggling states, re-indexing, etc (open). o Patch needed to make Controller only allow checkout of a max number of urls per Harvester, so we need to check how many they currently have checked out, and get the difference. (open) Processor: (processor.pl) ------------------------- o add a queue processing agent that goes through the db files sent by the harvester to the controller and parse the data out and put it in the index tree. (open) Maintainer: (maintainer.pl) ---------------------------- o Agent that runs independently of other programs, goes through the state db's and finds urls that need reindexing, and re-injects them into the queue by changed it's state (and moving it to the corresponding state db). (open) General ----------------- o TESTERS! We need your bandwidth! This is an easy way to get involved! (EVERYONE) o Design good user interface for web front end (open) o general error checking and code robustness. (open) -- ------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Today is the tomorrow you worried about yesterday. ------------------------------------------------------------------ |
From: Eric A. <and...@ce...> - 2004-03-31 04:28:21
|
As was noted in the sprawler-map.pdf I pointed to a couple weeks back, I proposed to change the names of the various parts of the Sprawler software. Since I have heard no objections, I plan on doing that this weekend unless I hear cries and complaints by then. New to-do list coming up.. Eric -- ------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Today is the tomorrow you worried about yesterday. ------------------------------------------------------------------ |
From: Eric A. <and...@ce...> - 2004-03-31 04:26:44
|
Just to keep everyone in the loop, I've recently built another Sprawler development/test box - this machine has the following specs: Processor: P4 1.8GHz (512K Cache) Memory: 768MB RAM (DDR 333) Disk: ~ 350GB usable disk space Connection: T1 It has room for growth, so as we need more disk space for testing/building indexes, we'll add drives. Once this machine is maxed, we should be ready to roll into a full layout. If anyone has space hard disks laying around, feel free to donate them. Eric -- ------------------------------------------------------------------ Eric Anderson Sr. Systems Administrator Centaur Technology Today is the tomorrow you worried about yesterday. ------------------------------------------------------------------ |
From: J. K. <ja...@ka...> - 2004-03-26 06:40:40
|
Hello there. > I think we must to add next fields of HTTP header: > > Expires: > it may be helpful for master - not index document earle then Expires > date. > > Last-Modified: > it might be a good for indexer: > if (date_now>Expires: and Last-Modified:<Expires:) > { > not index this document > } > > or > > if (date_of_index<=Last-Modified:) > { > not index this document > } I supported the whole idea of using this information in the HTTP headers. But after chatting with Eric on the #sprawler channel irc.freenode.net the other day, he informed me that the "last-modifed" date isn't always accurate. That doesn't mean we can't use it, it just means we can't rely on it. But there is some other information we can get. And that would be the size of the document, which you may notice is what Google stores for pages it has cached (for some reason, non-cached pages don't have this information in Google's results.) Now, I do understand the pages may change before we can reindex them, but do they tend to change so much that size data would be so far off? Having that stored on a resuilts page can give users an idea of the size of a page, which can be very useful if they want to know how much content is there. It's also good to know when you have a dialup connection. :) Here's a tweak for you: I see that we have a line in Client.pm that says: my @docheader=LWP::Simple::head($document); Also added was: $self->{CONTENT_TYPE}=undef; Well, we can add this line after where we get the header: $self->{SIZE_IN_BYTES}=$docheader[1]; Of course, the SIZE_IN_BYTES value would need to be declared first, but you get the idea. If I'm not mistaken, other values returned from the header are modification time and expiration date (in that order in an array returned by the function.) We can take that info as well, but what we do with it is another story. Anyway, just thought I'd throw that in. I've mentioned any data we can get from headers may be valuable, and size can also be good for gathering statisitics, getting a greater idea of average web page size for our purposes. It'll be good to know, so we'll know how much hardware we'll need. And then there's the issue of caching web pages, where size of pages, of course, is definitely a factor. On a completely unrelated note, I plan on commenting on the Sprawler map within the next few days (this time I'll make sure of it, Eric.) So much to do, so little time. Thanks, J.K. |
From: J. K. <ja...@ka...> - 2004-03-26 06:22:41
|
Hello. > I think it looks great, and is very well done. I have only read through > the beginning in detail, and skimmed the rest, but it looks awesome! > I'm curious about what everyone thinks we should do - have a separate > document like this for the documentation, or insert it inline with the > code? I like inline, but it can make it harder to skim through code > without all the docs. Well, it's good to hear that you (and Mojo) like what I've come up with. Mojo said that this will facilitate changes and new additions and that's just what I was hoping to do with this. And might be best to leave this as it is in order to help us understand it, see the big picture, etc. And having it inline would be good as well. You can just copy and paste it all in, and whenever changes are made, if you'd like to have it externally, all it'll take is a quick script to update the external file. I think if we just have the commented data have extra sharp symbols as comments (also can be done by a short script) then this other script for updating the external documentation can be doen automatically. In fact, I'm attaching a text file with two separate Perl scrip[ts for doing those two things. >>Did I go into too much or too little detail? There are some >>inconsistencies in the format, and what is in it, (ie. I didn' include >> the who-calls-what, but that may be a little to detailed for internal >> documentation, perhaps it's more appropriate for more external >>doumentation, whcih I may come up with next.) Also, it might not be >> that wasy to undersand, in particular, where there are optional >> parameters. >> >> > > Perfect amount of detail I think. Perhaps it is, although maybe I should include more of the who-calls-who for each method wherever possible, in order to give a better idea of not just what the methods do, but how they interact. Also, maybe there are some ways I can change the format to make it look more readable. I'd like to hear any suggestions you may have. >>You can tell me what it is that you'd like done here, as I do plan on >> expanding on it, cleaning it up, etc. I say we should also keep the big >> picture in mind, and while Eric's Sprawler map was highly informative, >> we need to bridge the gap between that and the code. This to help with >> that. You could think of it as "Sprawler Code for Dummies" or whichever >> you prefer. Anyway, I think I'll go look through that TODO list for >> another task to work on, even though I may work a little more on this. >> You can do what you want with it, though. And maybe this is a file that >> can be kept separate, for a lookup of all classes and the methods in >> them. >> >> > > Maybe we should put all these notes/docs on the website? The SF website? Hmm, I was actually thinking of having it in the repository. It is something that's be updated quite often and since it gives somewhat low level details on the code, maybe it should be there with the code in the repository. I was thinking that it'll be easier to just commit this rather than update any of our websites. >>On a remotely-related note, here's something to take a look at: >>http://sourceforge.net/projects/dotproject/ >>I've taken a look at it and am thinking that is something we could use. >> And I must say I like their home page. It's just another reason we can >> think of setting up our own blog. And with this project, we cna keep >> our internal documentation there, rather than have to go theough >> mailing list archives. Just a thought. >> >> > > I've looked at that before. Very interesting - we could use it, but we > don't even use the sourceforge stuff that much as it is. Make we should > start? There are task managers, etc. > > In fact - what do you think about putting all this documentation, > todo's, etc, in the task manager and documentation manager? Would you > like to maintain this? Silly me, I hadn't thought very much about using what's there. We definitely need to have the Task Manager that's currently there updated (it hasn't been in months) and the whole project description perhaps, as I don't think we're the first to have the open source search technology. We could also use the Doc Manager, although that isn't something I've seen used very often. Maybe I can look into that, and then I can tell you if that's something I'd like to work on (or if we should use what's there at all.) But those are some good ideas. Thanks, J.K. > Eric > > -- > ------------------------------------------------------------------ Eric > Anderson Sr. Systems Administrator Centaur Technology > Today is the tomorrow you worried about yesterday. > ------------------------------------------------------------------ > > > > ------------------------------------------------------- > This SF.Net email is sponsored by: IBM Linux Tutorials > Free Linux tutorial presented by Daniel Robbins, President and CEO of > GenToo technologies. Learn everything from fundamentals to system > administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click > _______________________________________________ > Sprawler-devel mailing list > Spr...@li... > https://lists.sourceforge.net/lists/listinfo/sprawler-devel |