I'm trying to spider a domain to retrieve all internal links for that site so I can create a sitemap based on it. Can this be done with Snoopy?
Yes, although you have to walk through the tree yourself. Have you ever written a program to traverse a tree (actually a directed graph in this case)? You have to keep track of the path you are taking, either by having a subroutine that calls itself, and its stack has your path, or keep it yourself in an array. You keep track of what URLs you've visited and also the ones you've seen (the link to) but haven't visited yet (to gey its links.) At the end, the latter is empty and the former is full of your URLs.
c h v o l a t a o l d o t c o m
"Dedicated to all those people who think that computers can't figure out that NAME AT ISP DOT COM is an email address." - CSV
Thanks for the info. I haven't written one myself, but I'm looking into it.
What's the best/fastest way to do this for larger sites? Using arrays & regular expressions or would using a database help any?
Would this even be possible to do on a single server if the site had thousands of pages?
Thanks for your help!
>Thanks for the info. I haven't written one myself, but I'm looking into it.
>What's the best/fastest way to do this for larger sites? Using arrays & regular expressions or would using a database help any?
If you mean keeping the information in an sql table while gathering the links, then of course a local array is a zillion times faster.
I don't know if you can do anything special to make it faster. The code itself is short but complex (unconventional), and all of the work is in calling Snoopy, so there is probably no need to try to segregate (concentrate) the Snoopy references. The program will be constantly waiting for Snoopy anyway.
There may be some system tricks to speed up Snoopy - I don't know. I also believe that Snoopy has an entry point to gather up the links, but that is probably so generalized that you can do better coding that yourself (I always do.)
Some people (especially AI people) talk aboput there being 2 ways to traverse a tree - depth 1st and breadth 1st. I don't know if that applies, as I always do it the same way.
At the lower level, of couse you should do things like use the PHP strpos function to find each link, rather than going through every character of HTML and checking.
>Would this even be possible to do on a single server if the site had thousands of pages?
I think Snoopy typically takes about a second to fetch a page, so do the math.
Did I answer all of your questions?
>Thanks for your help!
>Did I answer all of your questions?
Yes, you did. Many thanks for your time and knowledge..
Log in to post a comment.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.