The webcrawler prototype has been completed. This is an incredibly rudimentary webcrawler, and will often break (it can't handle IP addresses, tries to view all sorts of linked files and in some cases enters this odd infinite recursion).
As the addresses we will be passing to it will be homogeneous, and all the linked pages will contain the same content, I don't forsee any problems with using this as the webcrawl engine for our project. Once the first version of the project is done, *then* we can go back and worry about making the webcrawler perfect, if it even matters (for something other than self-pride, of course).
I've written and submitted (both to SVN and to our documentation) a technical document on the webcrawl.py script. The doc contains a detailed overview of the webcrawler prototype, with commentary on each variable, object, method and the algorithms therein.
This should be useful for debugging and modifying the webcrawl. I believe that we can use a large portion of this prototype (once finished) for our Facebot spider.
I implemented the meat of the recursiveVisit() function. I say "the meat" because it's still buggy as all hell.
What needs to be done now is the regular expresions need to be honed down (after a while they'll start to grab random HTML around the link URLs) and we need to figure out just what recursiveVisit should return.
...and then there comes the matter of the actual *project*...
I accidentally changed html_parser.py on the SVN system. This is due to my complete lack of knowledge on how to use this thing (which explains why the update log for the file is talking about the unfinised webcrawler prototype). However, these changes are largley superficial and may or may not be worth reverting (Justin, decide for yourself, I guess).
I have added a "website generator" script to the repository. This script will be used to generate dummy "websites" on my local HD which with which we can use to test and debug webcrawlers.
My AIM transcripts are perfectly suited to this. I have managed to generate a shallow (two levels deep) website, basically an index page with many links (my AIM transcripts in this case).
How it works (from the source comments):
This script is executed from the topmost directory of some file tree, where each of the child directories contain files whose links are desired on the index page. This script will visit each directory in alphabetical order. Upon reaching a leaf directory (child directory with no other children) or all children of the current node has been visited, it will generate a link to each *.html, *.htm, and *.txt file in the directory on the index page.