You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Geoff H. <ghu...@us...> - 2002-06-30 17:08:52
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b4: In progress
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
SHOWSTOPPERS:
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug)
* Not all htsearch input parameters are handled properly: PR#648. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set
correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can
we fix this?)
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#859)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* MySQL patches to 3.1.x to be forward-ported and cleaned up.
(Should really only attempt to use SQL for doc_db and related, not word_db)
NEEDED FEATURES:
* Field-restricted searching.
* Return all URLs.
* Handle noindex_start & noindex_end as string lists.
* Handle local_urls through file:// handler, for mime.types support.
* Handle directory redirects in RetrieveLocal.
* Merge with mifluz
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient.
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#648.) Also make sure these config
attributes are all documented in defaults.cc, even if they're only set by
input parameters and never in the config file.
* Split attrs.html into categories for faster loading.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
* TODO.html has not been updated for current TODO list and completions.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
(Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch
* URL.cc tries to parse malformed URLs (which causes further problems)
(It should probably just set everything to empty) This relates to
PR#348.
|
|
From: Geoff H. <ghu...@ws...> - 2002-06-28 17:25:44
|
On 27 Jun 2002, Scott Gifford wrote: > Saw this on BugTraq just now. http://www.anyhost.com/cgi-bin/htsearch.cgi?words=%22%3E%3Cscript%3Ealert%28document.cookie%29%3B%3C%2Fscript%3E I don't think this is actually a problem with properly-formatted templates under 3.1.6 or 3.2.0b3-3.2.0b4. The question is the $(WORDS) variable substitution--where this query would "hijack" it and put in the <script> attack. e.g. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html><head><title>Search results for '$&(WORDS)'</title></head> <body bgcolor="#eef7ff"> But the current header.html files have $&(WORDS) which is HTML-escaped, so the <script> would be rendered in the file as <script> which shouldn't cause any problems. Unfortunately Gilles is on vacation, but I don't see any vulnerability in the default-installed 3.1.6. Now, it's probably worth pointing this out to users that they should upgrade to a recent version and make sure your templates use $&(VAR) where necessary to avoid XSS attacks. But I don't see this as a hole in htsearch. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Geoff H. <ghu...@ws...> - 2002-06-28 17:00:44
|
> I am wondering whether it is possible to do a sort of reverse lookup > into the database, to look up all words in the found documents (from an > original query). .... > 3. Extract all words from any doc found in 2. from the search database Not from htsearch. It's really not how the database is organized. You have a document DB and a database of: word1 -> DocID ... word2 -> DocID ... So there's no collection of: DocID -> word1 word2... However, given the URL of a document, you could probably whip up a Perl script to generate a list of all unique non-HTML words in the document after fetching it. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Geoff H. <ghu...@ws...> - 2002-06-28 16:58:51
|
On Fri, 28 Jun 2002, Rian Kruger wrote: > precompiled binaries. What other information do you need? Do you think > it's at all possible that because the gcc was precompiled to be in > /usr/local, that make might be using an older version of gcc elsewhere > in the path? How can I test this. Take a look at what the compile line says. It will either have no path to the gcc/g++ executable or you'll see the exact one that's used. If there's no path, try "which g++" to see what part of your path is chosen. It'd also help if you could give us about 10-15+ lines of the compilation here. It seems like there should be more messages from the compiler before it died. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Charles H. <cha...@ho...> - 2002-06-28 13:58:10
|
Hello All, I am wondering whether it is possible to do a sort of reverse lookup into the database, to look up all words in the found documents (from an original query). The process would be. 1. Do a normal htdig query 2. This would find a number of docs containing the words searched for. 3. Extract all words from any doc found in 2. from the search database As ever, sorry if this is an obvious question! Thanks in Advance Charles |
|
From: Rian K. <ri...@th...> - 2002-06-28 10:01:59
|
Geoff Hutchison wrote: > On Thu, 27 Jun 2002, Rian Kruger wrote: > > >>I'm trying to build htdig on a sparc sun solaris2.5. Please find the > > ... > >>-c -O -I. -I./../include -D_REENTRANT ../os/os_dir.c >>Current working directory /export/home/httpd/htdigtmp/htdig-3.1.6/db/dist > > > Sorry, but this isn't very helpful. You have provided the htdig version > indirectly, but you haven't said anything about the compiler you're using > (and version), or other aspects of your build environment. The code is > known to compile with any of the newer versions of gcc/g++ for SPARC > (e.g. gcc 2.95.2 or newer). > > A little more information would be quite helpful here. > Sorry about the lack of information, I'm using gcc-2.95.3 I installed precompiled binaries. What other information do you need? Do you think it's at all possible that because the gcc was precompiled to be in /usr/local, that make might be using an older version of gcc elsewhere in the path? How can I test this. Thanks |
|
From: Geoff H. <ghu...@ws...> - 2002-06-27 15:03:13
|
On Thu, 27 Jun 2002, Rian Kruger wrote: > I'm trying to build htdig on a sparc sun solaris2.5. Please find the ... > -c -O -I. -I./../include -D_REENTRANT ../os/os_dir.c > Current working directory /export/home/httpd/htdigtmp/htdig-3.1.6/db/dist Sorry, but this isn't very helpful. You have provided the htdig version indirectly, but you haven't said anything about the compiler you're using (and version), or other aspects of your build environment. The code is known to compile with any of the newer versions of gcc/g++ for SPARC (e.g. gcc 2.95.2 or newer). A little more information would be quite helpful here. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Rian K. <ri...@th...> - 2002-06-27 07:54:09
|
Hi I'm trying to build htdig on a sparc sun solaris2.5. Please find the error I get below. Is this the correct list? Interestingly, I tried to compile gnu make on the machine and I also got a 'dereferencing pointer to incomplete type' when trying to compile the 'dir.c' file. Here is the error: -c -O -I. -I./../include -D_REENTRANT ../os/os_dir.c ../os/os_dir.c: In function `__os_dirlist': ../os/os_dir.c:70: dereferencing pointer to incomplete type *** Error code 1 make: Fatal error: Command failed for target `os_dir.o' Current working directory /export/home/httpd/htdigtmp/htdig-3.1.6/db/dist *** Error code 1 make: Fatal error: Command failed for target `all' Please any help will be much appreciated, I've already invested 3 nights trying to get this to work. Thanks Rian |
|
From: Sankaranarayanan M.G. <san...@ce...> - 2002-06-27 04:57:01
|
Dear All, Greetings. I am currently using htdig-3.2.0b420020505 version.=20 In my /var/www/html/docstosearch/ directory, I have many sub folders ( = for example sf1, sf2, sf3 etc....) wherein I have placed my files which = are to be indexed. Also from the rundig script I have removed the -i = option used for htdig, so that the database is not created from scratch = every time. In the htdig.conf file for the start_url I have given = http://servername/docstosearch/ With this configuration, I am able to index all the files in = docstosearch directory and its subdirectories. The problem started when = I add or remove files or directories from docstosearch directory or any = of its sub-directories. 1) It is always taking the same time for the rundig script... even when = there are no changes in the docstosearch directory or any of its = sub-directories.=20 2) Suppose I change only one file in one of the sub-directories, How do = I re-index that directory alone. I don't want to waste time trying to = index the other directories. I tried changing the htdig.conf file to point to that directory in the = start_url and executed the runscript. Most of the times its throwing = some errors from WordDB saying PANIC or CDB__ or WordCompress line = number 827... Can someone kindly help me out of this.. Cheers, Sankaranarayanan M.G. |
|
From: Lachlan A. <lh...@ee...> - 2002-06-24 01:54:32
|
On Fri, Jun 14, 2002 at 06:03:09PM -0500, Gilles Detillieux wrote:
> I'd recommend two changes:
> 1) Grab the most recent 3.2.0b4 snapshot
> 2) The HtFile::Request() and Document::RetrieveLocal() methods both
> have some hardcoded extensions, which should probably be kept in the
> new HtFile::Ext2Mime() method. HtFile::Request() currently falls back
> on these when it can't open mime.types.
Greetings,
Below is the patch against 3.2.0b4-20020616. This includes the
hardcoded types, and bad_local_extensions to allow .php etc. not
to be parsed locally. If bad_local_extensions is explicitly set
empty, do you think it would be good to allow *all* files to be parsed
locally (even those with no extensions)? Of course, ones for which no
MIME type is known would have to be treated as text/plain but it
would be good if a site has a lot of text files with no extensions.
Also, would there be any demand to index compressed files? If someone
has a lot of .ps.gz files, for example, it could be useful to include
them in the index.
Finally, I think someon has been editing the files with a tab size other
than 8... Is there a policy on that?
Cheers,
Lachlan
*** htdig/Document.cc Sun Jan 13 19:13:13 2002
--- htdig/Document.cc.lha Mon Jun 24 01:06:48 2002
***************
*** 72,78 ****
FileConnect = 0;
NNTPConnect = 0;
externalConnect = 0;
! HtConfiguration* config= HtConfiguration::config();
// We probably need to move assignment of max_doc_size, according
// to a server or url configuration value. The same is valid for
--- 72,78 ----
FileConnect = 0;
NNTPConnect = 0;
externalConnect = 0;
! HtConfiguration* config= HtConfiguration::config();
// We probably need to move assignment of max_doc_size, according
// to a server or url configuration value. The same is valid for
***************
*** 549,555 ****
Transport::DocStatus
Document::RetrieveLocal(HtDateTime date, StringList *filenames)
{
! HtConfiguration* config= HtConfiguration::config();
struct stat stat_buf;
String *filename;
--- 549,555 ----
Transport::DocStatus
Document::RetrieveLocal(HtDateTime date, StringList *filenames)
{
! HtConfiguration* config= HtConfiguration::config();
struct stat stat_buf;
String *filename;
***************
*** 558,564 ****
// Loop through list of potential filenames until the list is exhausted
// or a suitable file is found to exist as a regular file.
while ((filename = (String *)filenames->Get_Next()) &&
! ((stat((char*)*filename, &stat_buf) == -1) || !S_ISREG(stat_buf.st_mode)))
if (debug > 1)
cout << " tried local file " << *filename << endl;
--- 558,564 ----
// Loop through list of potential filenames until the list is exhausted
// or a suitable file is found to exist as a regular file.
while ((filename = (String *)filenames->Get_Next()) &&
! ((stat((char*)*filename, &stat_buf) == -1) || !S_ISREG(stat_buf.st_mode)))
if (debug > 1)
cout << " tried local file " << *filename << endl;
***************
*** 572,593 ****
if (modtime <= date)
return Transport::Document_not_changed;
- // Process only HTML files (this could be changed if we read
- // the server's mime.types file).
- // (...and handle a select few other types for now... this should
- // eventually be handled by the "file://..." handler, which uses
- // mime.types to determine the file type.) -- FIXME!!
char *ext = strrchr((char*)*filename, '.');
if (ext == NULL)
return Transport::Document_not_local;
! if ((mystrcasecmp(ext, ".html") == 0) || (mystrcasecmp(ext, ".htm") == 0))
! contentType = "text/html";
! else if ((mystrcasecmp(ext, ".txt") == 0) || (mystrcasecmp(ext, ".asc") == 0))
! contentType = "text/plain";
! else if ((mystrcasecmp(ext, ".pdf") == 0))
! contentType = "application/pdf";
! else if ((mystrcasecmp(ext, ".ps") == 0) || (mystrcasecmp(ext, ".eps") == 0))
! contentType = "application/postscript";
else
return Transport::Document_not_local;
--- 572,585 ----
if (modtime <= date)
return Transport::Document_not_changed;
char *ext = strrchr((char*)*filename, '.');
+ if (ext && strchr(ext,'/')) // Ignore a dot if it's not in the
+ ext = NULL; // final component of the path.
if (ext == NULL)
return Transport::Document_not_local;
! const String *type = HtFile::Ext2Mime (ext + 1);
! if (type != NULL)
! contentType = *type;
else
return Transport::Document_not_local;
*** htnet/HtFile.h Mon Jun 24 01:02:42 2002
--- htnet/HtFile.h.lha Mon Jun 24 01:02:51 2002
***************
*** 64,69 ****
--- 64,73 ----
// manages a Transport request (method inherited from Transport class)
virtual DocStatus Request ();
+ // Determine Mime type of file
+ // (Does it belong here??)
+ static const String *Ext2Mime (const char *);
+
///////
// Interface for resource retrieving
///////
*** htnet/HtFile.cc Sun Dec 23 19:13:14 2001
--- htnet/HtFile.cc.lha Mon Jun 24 00:48:34 2002
***************
*** 76,96 ****
}
! ///////
! // Manages the requesting process
! ///////
!
! HtFile::DocStatus HtFile::Request()
{
- HtConfiguration* config= HtConfiguration::config();
static Dictionary *mime_map = 0;
if (!mime_map)
{
mime_map = new Dictionary();
ifstream in(config->Find("mime_types").get());
if (in)
{
String line;
while (in >> line)
{
--- 76,110 ----
}
! // Return mime type indicated by extension ext (which is assumed not
! // to contain the '.'), or NULL if ext is not a know mime type, or
! // is listed in bad_local_extensions.
! const String *HtFile::Ext2Mime (const char *ext)
{
static Dictionary *mime_map = 0;
if (!mime_map)
{
+ HtConfiguration* config= HtConfiguration::config();
mime_map = new Dictionary();
+ if (!mime_map)
+ return NULL;
+
+ if (debug > 2)
+ cout << "MIME types: " << config->Find("mime_types").get() << endl;
ifstream in(config->Find("mime_types").get());
if (in)
{
+ // Set up temporary dictionary of extensions not to parse locally
+ Dictionary bad_local_exts;
+ StringList split_exts(config->Find("bad_local_extensions"), "\t .");
+ for (int i = 0; i < split_exts.Count(); i++)
+ {
+ if (debug > 3)
+ cout << "Bad local extension: " << split_exts[i] << endl;
+ bad_local_exts.Add(split_exts[i], 0);
+ }
+
String line;
while (in >> line)
{
***************
*** 99,114 ****
if ((cmt = line.indexOf('#')) >= 0)
line = line.sub(0, cmt);
StringList split_line(line, "\t ");
! // Let's cache mime type to lesser the number of
! // operator [] callings
String mime_type = split_line[0];
// Fill map with values.
for (int i = 1; i < split_line.Count(); i++)
! mime_map->Add(split_line[i], new String(mime_type));
}
}
}
// Reset the response
_response.Reset();
--- 113,161 ----
if ((cmt = line.indexOf('#')) >= 0)
line = line.sub(0, cmt);
StringList split_line(line, "\t ");
! // cache mime type to lessen the number of operator [] callings
String mime_type = split_line[0];
// Fill map with values.
for (int i = 1; i < split_line.Count(); i++)
! {
! const char *ext = split_line [i];
! if (bad_local_exts.Exists(ext))
! {
! if (debug > 3)
! cout << "Bad local extension: " << ext << endl;
! continue;
! }
!
! if (debug > 3)
! cout << "MIME: " << ext << "\t-> " << mime_type << endl;
! mime_map->Add(ext, new String(mime_type));
! }
}
}
+ else
+ {
+ if (debug > 2)
+ cout << "MIME types file not found. Using default types.\n";
+ mime_map->Add(String("html"), new String("text/html"));
+ mime_map->Add(String("htm"), new String("text/html"));
+ mime_map->Add(String("txt"), new String("text/plain"));
+ mime_map->Add(String("asc"), new String("text/plain"));
+ mime_map->Add(String("pdf"), new String("application/pdf"));
+ mime_map->Add(String("ps"), new String("application/postscript"));
+ mime_map->Add(String("eps"), new String("application/postscript"));
+ }
}
+ // return MIME type, or NULL if not found
+ return (String *)mime_map->Find(ext);
+ }
+
+ ///////
+ // Manages the requesting process
+ ///////
+
+ HtFile::DocStatus HtFile::Request()
+ {
// Reset the response
_response.Reset();
***************
*** 166,191 ****
return Transport::Document_not_changed;
char *ext = strrchr(_url.path(), '.');
if (ext == NULL)
return Transport::Document_not_local;
! if (mime_map && mime_map->Count())
! {
! String *mime_type = (String *)mime_map->Find(ext + 1);
! if (mime_type)
! _response._content_type = *mime_type;
! else
! return Transport::Document_not_local;
! }
else
! {
! if ((mystrcasecmp(ext, ".html") == 0) || (mystrcasecmp(ext, ".htm") == 0))
! _response._content_type = "text/html";
! else if (mystrcasecmp(ext, ".txt") == 0)
! _response._content_type = "text/plain";
! else
! return Transport::Document_not_local;
! }
_response._modification_time = new HtDateTime(stat_buf.st_mtime);
--- 213,228 ----
return Transport::Document_not_changed;
char *ext = strrchr(_url.path(), '.');
+ if (ext && strchr(ext,'/')) // Ignore a dot if it's not in the
+ ext = NULL; // final component of the path.
if (ext == NULL)
return Transport::Document_not_local;
! const String *mime_type = Ext2Mime(ext + 1);
! if (mime_type)
! _response._content_type = *mime_type;
else
! return Transport::Document_not_local;
_response._modification_time = new HtDateTime(stat_buf.st_mtime);
*** htcommon/defaults.cc Sun Jun 23 23:55:41 2002
--- htcommon/defaults.cc.lha Mon Jun 24 01:01:09 2002
***************
*** 145,151 ****
documents as text while they are some binary format. \
If the list is empty, then all extensions are acceptable, \
provided they pass other criteria for acceptance or rejection. \
! See also <a href=\"#valid_extensions\">valid_extensions</a>. \
" }, \
{ "bad_querystr", "", \
"pattern list", "htdig", "URL", "3.1.0", "Indexing:Where", "bad_querystr: forum=private section=topsecret&passwd=required", " \
--- 145,165 ----
documents as text while they are some binary format. \
If the list is empty, then all extensions are acceptable, \
provided they pass other criteria for acceptance or rejection. \
! See also <a href=\"#valid_extensions\">valid_extensions</a> and \
! <a href=\"#bad_local_extensions\">bad_local_extensions</a>. \
! " }, \
! { "bad_local_extensions", ".php .shtml", \
! "string list", "htdig", "URL", "all", "Indexing:Where", "bad_local_extensions: .php .foo .bar", " \
! This is a list of extensions on URLs which are \
! considered active, that is, the content delivered by the web \
! server is not simply the text of the file, but is generated \
! on-the-fly. This list is used mainly to allow URLs on the local \
! machine to be read using the local filesystem, rather than \
! through HTTP. \
! If the list is empty, then all extensions are acceptable, \
! provided they pass other criteria for acceptance or rejection. \
! See also <a href=\"#valid_extensions\">valid_extensions</a> and \
! <a href=\"#bad_extensions\">bad_extensions</a>. \
" }, \
{ "bad_querystr", "", \
"pattern list", "htdig", "URL", "3.1.0", "Indexing:Where", "bad_querystr: forum=private section=topsecret&passwd=required", " \
--
Lachlan Andrew lh...@ee... Phone: +613 8344-3816 Fax: +613 8344-6678
Department of Electrical and Electronic Engineering CRICOS Provider Code
University of Melbourne, Victoria, 3010 AUSTRALIA 00116K
|
|
From: Geoff H. <ghu...@us...> - 2002-06-23 07:13:56
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b4: In progress
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
SHOWSTOPPERS:
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug)
* Not all htsearch input parameters are handled properly: PR#648. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set
correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can
we fix this?)
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#859)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* MySQL patches to 3.1.x to be forward-ported and cleaned up.
(Should really only attempt to use SQL for doc_db and related, not word_db)
NEEDED FEATURES:
* Field-restricted searching.
* Return all URLs.
* Handle noindex_start & noindex_end as string lists.
* Handle local_urls through file:// handler, for mime.types support.
* Handle directory redirects in RetrieveLocal.
* Merge with mifluz
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient.
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#648.) Also make sure these config
attributes are all documented in defaults.cc, even if they're only set by
input parameters and never in the config file.
* Split attrs.html into categories for faster loading.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
* TODO.html has not been updated for current TODO list and completions.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
(Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch
* URL.cc tries to parse malformed URLs (which causes further problems)
(It should probably just set everything to empty) This relates to
PR#348.
|
|
From: Gilles D. <gr...@sc...> - 2002-06-18 14:49:42
|
According to Geoff Hutchison:
> It seems like we could save some time here and just return if the word
> is short and do this even sooner in the method, e.g.
> if (trackWords)
> {
> if (strlen(word) < minimumWordLength)
> return;
> String w = word;
> ...
>
> This saves updating the wordRef, the word component code, etc. Anything
> obviously wrong with this?
For the sake of tidiness, I'd write it as:
if (trackWords && strlen(word) >= minimumWordLength)
{
String w = word;
...
but yes, this seems quite reasonable to me. I simply missed that
opportunity for optimisation when I added the compound word handling code.
--
Gilles R. Detillieux E-mail: <gr...@sc...>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada)
|
|
From: Geoff H. <ghu...@ws...> - 2002-06-18 04:13:10
|
I've been working on the mifluz merge, which required some changes to
Retriever::got_word and I saw this towards the top.
if (w.length() >= minimumWordLength) {
wordRef.Word(w);
words.Replace(...);
}
// Check for compound words...
It seems like we could save some time here and just return if the word
is short and do this even sooner in the method, e.g.
if (trackWords)
{
if (strlen(word) < minimumWordLength)
return;
String w = word;
...
This saves updating the wordRef, the word component code, etc. Anything
obviously wrong with this?
-Geoff
|
|
From: Geoff H. <ghu...@us...> - 2002-06-16 07:14:00
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b4: In progress
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
SHOWSTOPPERS:
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug)
* Not all htsearch input parameters are handled properly: PR#648. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* If exact isn't specified in the search_algorithms, $(WORDS) is not set
correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can
we fix this?)
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#859)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* MySQL patches to 3.1.x to be forward-ported and cleaned up.
(Should really only attempt to use SQL for doc_db and related, not word_db)
NEEDED FEATURES:
* Field-restricted searching.
* Return all URLs.
* Handle noindex_start & noindex_end as string lists.
* Handle local_urls through file:// handler, for mime.types support.
* Handle directory redirects in RetrieveLocal.
* Merge with mifluz
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient.
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#648.) Also make sure these config
attributes are all documented in defaults.cc, even if they're only set by
input parameters and never in the config file.
* Split attrs.html into categories for faster loading.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
* TODO.html has not been updated for current TODO list and completions.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
(Does Loic's new database code make this work?)
* The code needs a security audit, esp. htsearch
* URL.cc tries to parse malformed URLs (which causes further problems)
(It should probably just set everything to empty) This relates to
PR#348.
|
|
From: Gilles D. <gr...@sc...> - 2002-06-14 23:03:23
|
According to Lachlan Andrew: > On Fri, May 10, 2002 at 06:19:11PM -0400, Geoff Hutchison wrote: > > On Fri, 10 May 2002, Lachlan Andrew wrote: > > > KDE help used to use ht://Dig to provide a search capability. > > > They changed the format of their files from HTML to docbook (XML). > > > For some reason, ht://Dig refuses to call the parser that one of > > > the KDE developers wrote. The response was that it was not a bug, > > > but a calculated feature, because ht://Dig didn't know that no server > > > parsing was necessary. > > > > For 3.2, the best approach is to either: > > a) Index using file:// URLs, which should use the appropriate mime.types > > file: <http://www.htdig.org/dev/htdig-3.2/attrs.html#mime_types> > > b) Code the RetrieveLocal method to produce temporary file:// URLs that > > are retrieved using the htnet/HtFile methods. (which again should use the > > appropriate mime.types file) > > I don't understand why there needs to be a temporary file:// URL. > I've attached a patch (against the latest beta, b3) in which > RetrieveLocal explicitly calls the method from HtFile which checks > the MIME type. > > Please let me know if this patch is unsuitable, and if so how I can > fix it. If it is OK, I'll go ahead and implement bad_local_ext etc. I think your approach is a simple and elegant solution to this problem. I'd recommend two changes: 1) Grab the most recent 3.2.0b4 snapshot from http://www.htdig.org/files/snapshots/ and adapt your code to that version. Some of the mime.types handling code in HtFile::Request() has changed subtly between 3.2.0b3 and now, so it would be better to work with the current code base. Also, your patch to htsearch/Display.cc isn't needed with 3.2.0b4. 2) The HtFile::Request() and Document::RetrieveLocal() methods both have some hardcoded extensions, which should probably be kept in the new HtFile::Ext2Mime() method. HtFile::Request() currently falls back on these when it can't open mime.types. Thanks! -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gilles D. <gr...@sc...> - 2002-06-14 22:37:47
|
According to Jim Cole: > Budd, Sinclair's bits of Tue, 11 Jun 2002 translated to: > >I am using Dig 3.2.0b4-20020505 . > >I do not seem to be able to order the results by the use of > >search_results_order configuration attribute. > >such as search_results_order: www.cts * > >Is there anything I can do to rectify this. > > It appears that a bit of the code associated with that > functionality broke at some point. I am attaching a patch that > seems to fix the problem. The patch is against the > htdig-3.2.0b4-20020609 snapshot. It is just my best guess at a > reasonable fix, so use at your own risk and all that ;) > > The patch addresses two problems that I found. The first was that > in Display::buildMatchList(), the call to matches.Add() was > always passing an empty string for the URL. The second was that > the value returned from MatchArea::Match() was always opposite of > what it should have been. Your best guess seems right to me. Thanks for tracking this down. I've committed it to CVS. The error in MatchArea::Match() also seemed to be in ScoreAdjustItem::Match() in htsearch/HtURLSeedScore.cc, so I fixed it too. I think all of these problems arose from the port of Hans-Peter's patch for 3.1.4 over to 3.2.0b2. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gilles D. <gr...@sc...> - 2002-06-13 20:39:29
|
According to Gabriele Bartolini: > I came up with a doubt (I don't know but I never paid attention to this > before). If I am not *wrong*, when requesting a URL through HTTP, we should > encode the string (URL encoding), by using the encodeURL functions of the > URL class (URLTrans.cc file). > > Essentially, two fields of the HTTP request IMHO need to be encoded > prior requesting the resource. > > First case in the request line, for instance GET encoded_URL HTTP/1.0. > > The second one, when given, the referer. > > Any suggestion? Do you think we can just ignore this and keep on sending > the plain URL? Well, htdig never does actually try to decode URLs, does it? Apart from when it tries to match local_urls, I believe htdig always keeps URLs in a hex-encoded form. If it does get URLs that aren't properly encoded, it's because they weren't properly encoded in the source HTML documents that it indexes. If you were to add an extra encoding step, I think the danger would be that you'd end up doubly encoding URLs that were already properly encoded in the documents in which they were found. Maybe someone can correct me if I'm falsely assuming what the code is doing, or what HTML documents are supposed to contain in their hrefs, but I think htdig is behaving correctly. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Lachlan A. <lh...@ee...> - 2002-06-13 02:10:38
|
On Fri, May 10, 2002 at 06:19:11PM -0400, Geoff Hutchison wrote: > On Fri, 10 May 2002, Lachlan Andrew wrote: > > KDE help used to use ht://Dig to provide a search capability. > > They changed the format of their files from HTML to docbook (XML). > > For some reason, ht://Dig refuses to call the parser that one of > > the KDE developers wrote. The response was that it was not a bug, > > but a calculated feature, because ht://Dig didn't know that no server > > parsing was necessary. > > For 3.2, the best approach is to either: > a) Index using file:// URLs, which should use the appropriate mime.types > file: <http://www.htdig.org/dev/htdig-3.2/attrs.html#mime_types> > b) Code the RetrieveLocal method to produce temporary file:// URLs that > are retrieved using the htnet/HtFile methods. (which again should use the > appropriate mime.types file) I don't understand why there needs to be a temporary file:// URL. I've attached a patch (against the latest beta, b3) in which RetrieveLocal explicitly calls the method from HtFile which checks the MIME type. Please let me know if this patch is unsuitable, and if so how I can fix it. If it is OK, I'll go ahead and implement bad_local_ext etc. Regards, Lachlan *** htdig/Document.cc Wed Jun 12 22:48:25 2002 --- htdig/Document.cc.lha Wed Jun 12 22:46:02 2002 *************** *** 494,507 **** char *ext = strrchr((char*)*filename, '.'); if (ext == NULL) return Transport::Document_not_local; ! if ((mystrcasecmp(ext, ".html") == 0) || (mystrcasecmp(ext, ".htm") == 0)) ! contentType = "text/html"; ! else if ((mystrcasecmp(ext, ".txt") == 0) || (mystrcasecmp(ext, ".asc") == 0)) ! contentType = "text/plain"; ! else if ((mystrcasecmp(ext, ".pdf") == 0)) ! contentType = "application/pdf"; ! else if ((mystrcasecmp(ext, ".ps") == 0) || (mystrcasecmp(ext, ".eps") == 0)) ! contentType = "application/postscript"; else return Transport::Document_not_local; --- 494,502 ---- char *ext = strrchr((char*)*filename, '.'); if (ext == NULL) return Transport::Document_not_local; ! const String *type = HtFile::Ext2Mime (ext + 1); ! if (type != NULL) ! contentType = *type; else return Transport::Document_not_local; *** htnet/HtFile.cc Wed Jun 12 22:48:50 2002 --- htnet/HtFile.cc.lha Wed Jun 12 22:46:15 2002 *************** *** 77,92 **** } ! /////// ! // Manages the requesting process ! /////// ! ! HtFile::DocStatus HtFile::Request() { static Dictionary *mime_map = 0; if (!mime_map) { ifstream in(config["mime_types"].get()); if (in) { --- 77,92 ---- } ! // Return mime type indicated by extension ext (which is assumed not ! // to contain the '.'), or NULL if ext is not a know mime type. ! const String *HtFile::Ext2Mime (const char *ext) { static Dictionary *mime_map = 0; if (!mime_map) { + if (debug > 2) + cout << "MIME types: " << config ["mime_types"].get() << endl; ifstream in(config["mime_types"].get()); if (in) { *************** *** 104,114 **** --- 104,138 ---- String mime_type = split_line[0]; // Fill map with values. for (int i = 1; i < split_line.Count(); i++) + { + if (debug > 3) + cout << "MIME: " << split_line[i] + << "\t-> " << mime_type << endl; mime_map->Add(split_line[i], new String(mime_type)); + } } } } + if (debug > 4) + cout << "Checking extension: " << ext << endl; + if (mime_map) // is this 'if' needed? + { + const String *mime_type = (String *)mime_map->Find(ext); + if (mime_type) + return mime_type; + else + return NULL; + } + else + return NULL; + } + + /////// + // Manages the requesting process + /////// + HtFile::DocStatus HtFile::Request() + { // Reset the response _response.Reset(); *************** *** 169,184 **** if (ext == NULL) return Transport::Document_not_local; ! if (mime_map) ! { ! String *mime_type = (String *)mime_map->Find(ext + 1); ! if (mime_type) ! _response._content_type = *mime_type; ! else ! return Transport::Document_not_local; ! } else { if ((mystrcasecmp(ext, ".html") == 0) || (mystrcasecmp(ext, ".htm") == 0)) _response._content_type = "text/html"; else if (mystrcasecmp(ext, ".txt") == 0) --- 193,205 ---- if (ext == NULL) return Transport::Document_not_local; ! const String *mime_type = Ext2Mime (ext + 1); ! if (mime_type) ! // if (bad_local_ext (ext)) return Transport::Document_not_local; else ! _response._content_type = *mime_type; else { + if (debug > 2) cout << "Extension " << ext+1 << " not found\n"; if ((mystrcasecmp(ext, ".html") == 0) || (mystrcasecmp(ext, ".htm") == 0)) _response._content_type = "text/html"; else if (mystrcasecmp(ext, ".txt") == 0) *** htnet/HtFile.h Wed Jun 12 22:48:57 2002 --- htnet/HtFile.h.lha Wed Jun 12 22:46:19 2002 *************** *** 63,68 **** --- 63,72 ---- // manages a Transport request (method inherited from Transport class) virtual DocStatus Request (); + + // Determine Mime type of file + // (Does it belong here??) + static const String *Ext2Mime (const char *); /////// // Interface for resource retrieving *** htsearch/Display.cc Wed Jun 12 22:49:28 2002 --- htsearch/Display.cc.lha Wed Jun 12 22:46:32 2002 *************** *** 35,40 **** --- 35,41 ---- #include <ctype.h> #include <syslog.h> #include <locale.h> + #include <float.h> // for DBL_MAX on Mandrake 8.2, gcc 2.96 #include <math.h> #if !defined(DBL_MAX) && defined(MAXFLOAT) *** installdir/mime.types Wed Jun 12 22:49:45 2002 --- installdir/mime.types.lha Wed Jun 12 22:46:44 2002 *************** *** 264,269 **** --- 264,270 ---- text/vnd.latex-z text/x-setext etx text/xml xml + text/docbook docbook video/mpeg mpeg mpg mpe video/quicktime qt mov video/vnd.motorola.video -- Lachlan Andrew lh...@ee... Phone: +613 8344-3816 Fax: +613 8344-6678 Department of Electrical and Electronic Engineering CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Budd, S. <s....@ic...> - 2002-06-12 14:24:11
|
The patch which was enclosed produced the desired results .. So good! -----Original Message----- From: Jim Cole [mailto:gre...@yg...] Sent: Wednesday, June 12, 2002 12:50 PM To: Budd, Sinclair Cc: htd...@li... Subject: Re: [htdig-dev] 3.2.0B4 and search_results_order Budd, Sinclair's bits of Tue, 11 Jun 2002 translated to: >I am using Dig 3.2.0b4-20020505 . >I do not seem to be able to order the results by the use of >search_results_order configuration attribute. >such as search_results_order: www.cts * >Is there anything I can do to rectify this. It appears that a bit of the code associated with that functionality broke at some point. I am attaching a patch that seems to fix the problem. The patch is against the htdig-3.2.0b4-20020609 snapshot. It is just my best guess at a reasonable fix, so use at your own risk and all that ;) The patch addresses two problems that I found. The first was that in Display::buildMatchList(), the call to matches.Add() was always passing an empty string for the URL. The second was that the value returned from MatchArea::Match() was always opposite of what it should have been. Jim |
|
From: Gabriele B. <g.b...@co...> - 2002-06-12 14:11:11
|
Ciao guys,
I came up with a doubt (I don't know but I never paid attention to this
before). If I am not *wrong*, when requesting a URL through HTTP, we should
encode the string (URL encoding), by using the encodeURL functions of the
URL class (URLTrans.cc file).
Essentially, two fields of the HTTP request IMHO need to be encoded
prior requesting the resource.
First case in the request line, for instance GET encoded_URL HTTP/1.0.
The second one, when given, the referer.
Any suggestion? Do you think we can just ignore this and keep on sending
the plain URL?
Ciao
-Gabriele
--
Gabriele Bartolini - Computer Programmer
U.O. Rete Civica - Comune di Prato - Prato - Italia - Europa
g.b...@co... | http://www.po-net.prato.it/
The nice thing about Windows is - It does not just crash,
it displays a dialog box and lets you press 'OK' first.
|
|
From: Jim C. <gre...@yg...> - 2002-06-12 11:51:04
|
Budd, Sinclair's bits of Tue, 11 Jun 2002 translated to: >I am using Dig 3.2.0b4-20020505 . >I do not seem to be able to order the results by the use of >search_results_order configuration attribute. >such as search_results_order: www.cts * >Is there anything I can do to rectify this. It appears that a bit of the code associated with that functionality broke at some point. I am attaching a patch that seems to fix the problem. The patch is against the htdig-3.2.0b4-20020609 snapshot. It is just my best guess at a reasonable fix, so use at your own risk and all that ;) The patch addresses two problems that I found. The first was that in Display::buildMatchList(), the call to matches.Add() was always passing an empty string for the URL. The second was that the value returned from MatchArea::Match() was always opposite of what it should have been. Jim |
|
From: Jim C. <gre...@yg...> - 2002-06-12 04:11:53
|
Hi - I encountered the following problem while working with htdig-3.2.0b4-20020609 (though it is not specific to this version) under OS X (10.1.4) with the April Developer Tools configured for GCC3 (3.1). I understand that the compiler system is still considered beta under OS X, but from what I have read this problem is due to a GCC3 feature. The problem is that ht://Dig cannot be configured to use zlib. When configure tries to compile conftest.c for the zlib.h check, the compile line includes a -I/usr/include. This results in the following warning cpp0: warning: changing search order for system directory "/usr/include" cpp0: warning: as it has already been specified as a non-system directory which in turn causes the check to fail. The cause of the warning, as I understand it, is that use of -I for a system include directory now both changes the implied ordering of the search path and cause that directory to be considered a non-system include directory. This is discussed in the GCC documentation for -I at http://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Directory-Options.html#Directory%20Options A few other things I noticed... There are a lot of warnings about the use of depreciated headers. If there is a way to slip in a -Wno-deprecated for GCC3 builds, I think these warnings will go away. A consequence of deprecated header warnings is that the check for fstream.h fails. A few other warnings that I caught are listed below. I realize this is a beta, and I am not trying to nitpick :) I just figured I would pass along everything else I noticed since it is all sitting in front of me right now. In the end, I was able build, install, rundig, and perform a few searches. Jim ------------- mp_cmpr.c: In function `CDB___memp_cmpr_read': mp_cmpr.c:200: warning: implicit declaration of function `memcpy' mp_cmpr.c: In function `CDB___memp_cmpr_page': mp_cmpr.c:480: warning: implicit declaration of function `memset' mp_cmpr.c: In function `CDB___memp_cmpr_open': mp_cmpr.c:730: warning: implicit declaration of function `strlen' ------------- WordDBPage.h: In member function `void WordDBPage::insert_btikey(WordDBKey&, BINTERNAL&, int)': WordDBPage.h:267: warning: int format, long unsigned int arg (arg 2) WordDBPage.h: In member function `void WordDBPage::compress_key(Compressor&, int)': WordDBPage.h:323: warning: int format, long unsigned int arg (arg 3) ------------- HtConfiguration.cc: In member function `void HtConfiguration::Add(const char*, const char*, Configuration*)': HtConfiguration.cc:39: warning: suggest parentheses around assignment used as truth value HtConfiguration.cc: In member function `const String HtConfiguration::Find(URL*, const char*) const': HtConfiguration.cc:129: warning: suggest parentheses around assignment used as truth value ------------- conf_lexer.cxx: In function `int yylex()': conf_lexer.cxx:704: warning: label `find_rule' defined but not used /usr/include/gcc/darwin/3.1/g++-v3/bits/stl_threads.h: At top level: conf_lexer.cxx:1790: warning: `void* yy_flex_realloc(void*, unsigned int)' defined but not used |
|
From: Geoff H. <ghu...@ws...> - 2002-06-11 23:22:16
|
On Tue, 11 Jun 2002, Gilles Detillieux wrote: > Rather than traversing the whole database looking for a URL, though, > it may be more expedient to look up the ID for a URL in the index. Right, though of course htpurge is written this way since it's primarily written to "clean up" after indexing. But yes, for removing arbitrary URLs, this would be faster. Maybe it'd be useful to write a separate routine for htpurge. > > As a note I will try to submit a patch and set of makefiles to do > > a native WIN32 port of HtDig & libhtdig this week. This would be great. I may have another announcement in the "big patch" department later this week as well. The mifluz merge is almost finished. -Geoff |
|
From: Gilles D. <gr...@sc...> - 2002-06-11 19:51:53
|
According to Neal Richter: > Is there a method somewhere to remove a document from an index? > > I though I found on but can't locate it again.. htpurge is the tool that does this. In recent 3.2.0b4 snapshots anyway, there is a -u option to htpurge to feed it a single URL argument. You can also use "htpurge -" and feed it a list of URLs via stdin. > This will hopefully be a method added to the libhtdig API and is > usefull in the contect of large indexes where rebuilding a huge index to > remove 1 document is inefficient. htpurge has two main functions, purgeDocs and purgeWords. The former is called first, given an optional Dictionary object of URLs to be purged. It traverses db.docdb looking for records to be purged for a number of reasons (not indexed, obsolete entry, not found, no excerpt, not retrieved) as well as those explicitly requested for purging, if any. It then produces a list of DocIDs, which purgeWords then uses to get rid of any words in the word database that are associated with these document IDs. It should be reasonably easy to extract from htpurge.cc the code you need for the API function for this operation. purgeDocs is pretty easy to follow, and the actual work of deleting records is pretty straightforward. Rather than traversing the whole database looking for a URL, though, it may be more expedient to look up the ID for a URL in the index. purgeWords is harder to follow, because it uses the rather cryptic mifluz database "walking" API, but you can likely use that code as-is. > As a note I will try to submit a patch and set of makefiles to do > a native WIN32 port of HtDig & libhtdig this week. > > The way this will work is that using a WIN32 system with cygwin > and MSVC installed, a seperate set of makefiles are used to build a fully > native WIN32 set of binaries (no cygwin dll needed). > > Cygwin is needed since these makefiles use GNU make instead of windows > make, but MSVC's command line compiler is used. > > A ./configure is not necessary, but a setup script will > replace/modify any the existing db/db_config.h & include/htconfig.h > > Any feedback in how you would like this to work better let me > know! > > There are free native windows compilers out there.. Borland & > Watcom have them for download.. Watcom's is Open Source as well.. > > http://www.openwatcom.org Thanks for all your work on that! -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Budd, S. <s....@ic...> - 2002-06-11 16:44:06
|
hello I am using Dig 3.2.0b4-20020505 . I do not seem to be able to order the results by the use of search_results_order configuration attribute. such as search_results_order: www.cts * Is there anything I can do to rectify this. Thanks |