From: <mic...@bt...> - 2005-04-27 16:25:36
|
If you want all files indexed within that tree, then you could use some sort of script to dump out a recursive directory listing to a file, then use that file as the source for your Start_URL. If you only want a subset then that technique might not be practical. Mike > -----Original Message----- > From: htd...@li...=20 > [mailto:htd...@li...] On Behalf=20 > Of Andreas Siegert > Sent: 27 April 2005 17:13 > To: htd...@li... > Subject: [htdig] indexing a directory tree in the file system >=20 >=20 > Hi, >=20 > is there a simple incantation of htdig that would allow me to index a > file tree (not via the web server)? > I have a hodgepodge of files, all text, some with and some without > extensions that I would like to have full text search capabilities on. > But I can not figure out how to get them all indexed. Some do, some > don't. (I am using a file:// URL as the starting point). > I am using 3.2.0b6. > Looks like htdig is geared mainly towards web site indexing and I am > trying to bend it too much... >=20 > thx > afx > --=20 > atsec information security GmbH Phone: +49-89-44249830 > Steinstrasse 70 Fax: +49-89-44249831 > D-81667 Muenchen, Germany Web: atsec.com > May the Source be with you! >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: Tell us your software development plans! > Take this survey and enter to win a one-year sub to SourceForge.net > Plus IDC's 2005 look-ahead and a copy of this survey > Click here to start! = http://www.idcswdc.com/cgi-bin/survey?id=3D105hix > _______________________________________________ > ht://Dig general mailing list: <htd...@li...> > ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html > List information (subscribe/unsubscribe, etc.) > https://lists.sourceforge.net/lists/listinfo/htdig-general >=20 |
From: <mic...@bt...> - 2005-04-28 08:44:29
|
I recall doing something very similar to this as a College project, as a DOS batch file, so I know that this is also possible if you are not running a *NIX system. Mike > -----Original Message----- > From: Douglas Kline [mailto:kl...@he...]=20 > Sent: 27 April 2005 19:11 > To: Brockington,MJ,Mike,IQ D > Cc: htd...@li... > Subject: Re: [htdig] indexing a directory tree in the file system=20 >=20 >=20 > > If you want all files indexed within that tree, then you=20 > could use some > > sort of script to dump out a recursive directory listing to=20 > a file, then > > use that file as the source for your Start_URL. > >=20 > > If you only want a subset then that technique might not be=20 > practical. > >=20 > > Mike > >=20 > > > -----Original Message----- > > > From: htd...@li...=20 > > > [mailto:htd...@li...] On Behalf=20 >=20 > > > Hi, > > >=20 > > > is there a simple incantation of htdig that would allow=20 > me to index a > > > file tree (not via the web server)? > > > I have a hodgepodge of files, all text, some with and some without > > > extensions that I would like to have full text search=20 > capabilities on. > > > But I can not figure out how to get them all indexed.=20 > Some do, some > > > don't. (I am using a file:// URL as the starting point). > > > I am using 3.2.0b6. > > > Looks like htdig is geared mainly towards web site=20 > indexing and I am > > > trying to bend it too much... >=20 >=20 >=20 > If this is under Unix, you could use the "find" command to=20 > write out all the > files in a tree. If you want to select some of them, you=20 > could use options to > the "find" command which are more plentiful with the Gnu=20 > version or you could > pipe the output to a command like a grep or sed to select=20 > some files. The > argument to ht-Dig should be a URL or list of URL's. So=20 > somewhere a URL has to > be used to address the files. Once you have a URL you can=20 > use, the various > files can be addressed by pathnames starting from the URL. =20 > You could convert > the list of files into a list of URL's with substitutions=20 > which could be > executed in the same pipe from the find command. >=20 > If you're not running Unix, there might be some parallel=20 > operations you could > use. >=20 > Douglas >=20 > =3D=3D=3D=3D=3D=3D=3D=3D > Douglas Kline > kl...@he... >=20 >=20 >=20 |
From: Andreas S. <af...@at...> - 2005-04-27 17:02:42
|
Quoting mic...@bt... (mic...@bt...) on Wed, Apr 27, 2005 at 05:24:51PM +0100: > If you want all files indexed within that tree, then you could use some > sort of script to dump out a recursive directory listing to a file, then > use that file as the source for your Start_URL. Just a find or do I need to htmlify that? What I really do not understand is that it works partly, not linked to the depth of the tree... I suspect this is more a mime type problem. Is there a way to set a default mime type (text)? thx afx > > is there a simple incantation of htdig that would allow me to index a > > file tree (not via the web server)? > > I have a hodgepodge of files, all text, some with and some without > > extensions that I would like to have full text search capabilities on. > > But I can not figure out how to get them all indexed. Some do, some > > don't. (I am using a file:// URL as the starting point). > > I am using 3.2.0b6. > > Looks like htdig is geared mainly towards web site indexing and I am > > trying to bend it too much... -- atsec information security GmbH Phone: +49-89-44249830 Steinstrasse 70 Fax: +49-89-44249831 D-81667 Muenchen, Germany Web: atsec.com May the Source be with you! |
From: Douglas K. <kl...@he...> - 2005-04-27 18:11:34
|
> If you want all files indexed within that tree, then you could use some > sort of script to dump out a recursive directory listing to a file, then > use that file as the source for your Start_URL. > > If you only want a subset then that technique might not be practical. > > Mike > > > -----Original Message----- > > From: htd...@li... > > [mailto:htd...@li...] On Behalf > > Hi, > > > > is there a simple incantation of htdig that would allow me to index a > > file tree (not via the web server)? > > I have a hodgepodge of files, all text, some with and some without > > extensions that I would like to have full text search capabilities on. > > But I can not figure out how to get them all indexed. Some do, some > > don't. (I am using a file:// URL as the starting point). > > I am using 3.2.0b6. > > Looks like htdig is geared mainly towards web site indexing and I am > > trying to bend it too much... If this is under Unix, you could use the "find" command to write out all the files in a tree. If you want to select some of them, you could use options to the "find" command which are more plentiful with the Gnu version or you could pipe the output to a command like a grep or sed to select some files. The argument to ht-Dig should be a URL or list of URL's. So somewhere a URL has to be used to address the files. Once you have a URL you can use, the various files can be addressed by pathnames starting from the URL. You could convert the list of files into a list of URL's with substitutions which could be executed in the same pipe from the find command. If you're not running Unix, there might be some parallel operations you could use. Douglas ======== Douglas Kline kl...@he... |
From: Andreas S. <af...@at...> - 2005-04-29 06:21:01
|
Quoting Douglas Kline (kl...@he...) on Wed, Apr 27, 2005 at 02:11:25PM -0400: > > If you want all files indexed within that tree, then you could use some > > sort of script to dump out a recursive directory listing to a file, then > > use that file as the source for your Start_URL. > > > > If you only want a subset then that technique might not be practical. > > > is there a simple incantation of htdig that would allow me to index a > > > file tree (not via the web server)? > > > I have a hodgepodge of files, all text, some with and some without > > > extensions that I would like to have full text search capabilities on. > > > But I can not figure out how to get them all indexed. Some do, some > > > don't. (I am using a file:// URL as the starting point). > > > I am using 3.2.0b6. > > > Looks like htdig is geared mainly towards web site indexing and I am > > > trying to bend it too much... > > If this is under Unix, you could use the "find" command to write out all the > files in a tree. If you want to select some of them, you could use options to > the "find" command which are more plentiful with the Gnu version or you could > pipe the output to a command like a grep or sed to select some files. The > argument to ht-Dig should be a URL or list of URL's. So somewhere a URL has to > be used to address the files. Once you have a URL you can use, the various > files can be addressed by pathnames starting from the URL. You could convert > the list of files into a list of URL's with substitutions which could be > executed in the same pipe from the find command. > > If you're not running Unix, there might be some parallel operations you could > use. Even when I feed it a list of all files (find>awk>html), ht_dig would still produce an index that missed lots of keywords. I have no idea why. A string like "msgrcv" would show up in some files but not others. Meanwhile searching for alternatives I turned to id-tools, which does not generate a web searchable index, but indexes all files by just pointing at the root of the tree and could be made to include any text file with a simple config change. thx afx -- atsec information security GmbH Phone: +49-89-44249830 Steinstrasse 70 Fax: +49-89-44249831 D-81667 Muenchen, Germany Web: atsec.com May the Source be with you! |