From: Darren S. E. <ds...@mo...> - 2003-03-12 08:36:02
|
Hello, One of my clients is a medium-sized web site that uses ht://Dig for its search engine. The engine indexes local HTML files on the site, as well as certain remote web sites. Local and remote documents are stored in a single database. What I want to be able to do is have a once-a-week run that checks all local and remote pages; and a once-a-day run that *only* checks local pages. A few things I've tried which have failed... - The -m option is not good enough because it instructs htdig to *only* check the URLs specified. If I specify the main site URL, it'll only check the front page, not treat it as a pattern or anything. - Setting start_url and/or limit_urls_to to contain only the local site URLs doesn't work. Unfortunately, remote documents already in the database still get checked. - Setting local_urls_only while htdig is running doesn't work either. This causes htmerge to remove the remote documents from the database, whether or not local_urls_only is set when htmerge is running. - Running htdump (before the htdig and htmerge steps), grepping for local URLs, and using the -m option... this *almost* works, but it won't catch new files. One solution which will probably work is to run a find in the local document root, convert the filenames to URLs, and use the -m option with the results. Of course this wouldn't work if I wanted to check SOME remote sites every day. Any other suggestions? Thanks, -- Darren Stuart Embry http://www.webonastick.com/ |