I think I got it. Although, it could be 1 of 2 things...
Looking at the rundig -vvv I noticed that it didn't find robots.txt in =
each of the sites. I copied in a robots.txt to each page.
I also moved the first url to the end of the list.
I deleted the DB files and re-ran rundig. Now when I search I can find =
things in all sites.
Thanks for all your help.
From: Johnson, S=20
Sent: Tuesday, November 11, 2003 9:42 AM
To: Jim Cole
Subject: RE: [htdig] Searching multiple sites
The index is being performed on the same server as the sites are being =
hosted... The sites all together are about 6 gig in size.
I was using vi for the editor... I went back into the config per your =
suggestion and verified that I have spaces between all the site names.
Your suggestion on the limit urls may have merit...
I have the main site: mysite.k12.mn.us, then I have DNS shortcuts to the =
schools: school1.mysite.k12.mn.us, school2.mysite.k12.mn.us, etc... The =
directory structure is within the mainsite's URL. I'm thinking that if =
I remove the main page/main site url from the search that it may work? =
Still doesn't explain the DB size for the main page...
I'll work with these suggestions and post my results. =20
From: Jim Cole [mailto:lists@...
Sent: Friday, November 07, 2003 12:20 AM
To: Johnson, S
Subject: Re: [htdig] Searching multiple sites
On Thursday, November 6, 2003, at 09:24 AM, Johnson, S wrote:
> I performed the htdig which took a very long time (which I=20
> expected).=A0After which I verified the DB files and they're around 15 =
> mb in size.=A0So I'm thinking that it did gather all the search info =
> the sites.
Any idea how big the sites are? Are you using a slow connection to=20
index the sites? I wouldn't expect it to take a "very long time" to=20
index sites that resulted in a 15 MB database unless you are working=20
with very limited bandwidth or in some other way throttling htdig.
> Now when I go to search I type in a term that should bring up hundred=20
> of terms but only get 1 hit. =A0This hit happens to be on the first =
> I search which is only one page in length.
You might want to double check your configuration file and verify that=20
there are no problems with the start_url attribute. If for example your=20
editor automatically inserted some line breaks, it might be that htdig=20
is only seeing the first URL.
Did you modify your limit_urls_to attribute? Or are you still using the=20
default? If incorrectly modified, this attribute could be excluding=20
some of the pages that you are trying to index. If on the other hand=20
you are using the default, what do the URLs look like in your start_url=20
attribute? The default limit_urls_to assumes that you are not=20
explicitly providing the name of the start page. For example, it=20
assumes something like http://server.tld/path/ rather than=20
http://server.tld/path/index.html. If you are using start URLs of the=20
latter form with the default limit_urls_to, htdig will exclude=20
everything except for the initial page.
If the above items don't appear to be related to the problem you are=20
encountering, you might want to take a look at=20
http://www.htdig.org/FAQ.html#q5.25 and the other FAQs it references.=20
These address a number of issues related to documents being missed=20
> I re-read the FAQ and it talks about using restrict to search multiple =
> sites.=A0I looked at the description for this metatag and it didn't=20
> really explain what I needed to do to use this.=A0I then looked at the =
> search.html file that I got from htdig to test.=A0I noticed a restrict =
> line in there so I typed in the url I wanted to restrict the search on =
> (http://mysite.school.k12.mn.us) and reloaded the form in my=20
> browser.=A0It now doesn't find anything...
I would avoid messing with restrict until you resolve the more=20
fundamental problem. It sounds like you are using it correctly, but at=20
this point it is likely as not to just complicate the process of=20
tracking down the missing pages.
> Does anyone have any suggestions on what I can do to fix this?
If none of the above helps, rerun the indexing process with -vvv and=20
carefully examine the output. There are two things in particular to=20
look for. First try to determine that htdig is actually seeing all of=20
the URLs that you want indexed. Second look for messages stating that=20
URLs are being rejected. Some rejected URLs are usually to be expected,=20
but if any of the URLs that you are trying to grab show up as rejected,=20
then the accompanying reason might point you to the source of the=20
> How do I search on everything in the database?
Try using an asterisk (*) as the search term.
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
ht://Dig general mailing list: <htdig-general@...>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)