You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Budd, S. <s....@ic...> - 2002-12-04 15:07:50
|
Hello using Server: Apache/2.0.43 (Unix) mod_ssl/2.0.43 OpenSSL/0.9.6g with htdig-3.2.0b4-20021201 . I am unable to index a SSL protected page. I notice earlier snapshots reported the problem but I cannot find any resolution. I also seem to remember that there was a problem at one time with the different versions of ssl. The ssl protection is <IfDefine SSL> Listen 443 ## SSL Global Context ## ## All SSL configuration in this context applies both to ## the main server and all SSL-enabled virtual hosts. ## AddType application/x-x509-ca-cert .crt AddType application/x-pkcs7-crl .crl SSLPassPhraseDialog builtin SSLSessionCache dbm:logs/ssl_scache SSLSessionCacheTimeout 300 # Semaphore: # Configure the path to the mutual exclusion semaphore the # SSL engine uses internally for inter-process synchronization. SSLMutex file:logs/ssl_mutex # Pseudo Random Number Generator (PRNG): SSLRandomSeed startup egd:/etc/entropy 512 ## ## SSL Virtual Host Context ## ###<VirtualHost _default_:443> <VirtualHost 155.198.62.62:443> # General setup for the virtual host DocumentRoot "/home/ppp/htdig-test-3.2.0b4.1110.play/web_base/sslified/" ServerName bartoks.cc.ic.ac.uk:443 ServerAdmin s....@ic... ErrorLog /home/ppp/htdig-test-3.2.0b4.1110.play/logs/ssl.error_log TransferLog /home/ppp/htdig-test-3.2.0b4.1110.play/logs/ssl.access_log # SSL Engine Switch: # Enable/Disable SSL for this virtual host. SSLEngine on # SSL Cipher Suite: # List the ciphers that the client is permitted to negotiate. # See the mod_ssl documentation for a complete list. SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL # Server Certificate: # SSLCertificateFile /usr/local/apache2/conf/ssl.crt/server.crt # Server Private Key: # If the key is not combined with the certificate, use this # directive to point at the key file. Keep in mind that if # you've both a RSA and a DSA private key you can configure # both in parallel (to also allow the use of DSA ciphers, etc.) SSLCertificateKeyFile /usr/local/apache2/conf/ssl.key/server.key # Client Authentication (Type): # Client certificate verification type and depth. Types are # none, optional, require and optional_no_ca. Depth is a # number which specifies how deeply to verify the certificate # issuer chain before deciding the certificate is not valid. SSLVerifyClient none #SSLVerifyDepth 10 SSLOptions +FakeBasicAuth <Files ~ "\.(cgi|shtml|phtml|php3?)$"> SSLOptions +StdEnvVars </Files> <Directory "/usr/local/apache2/cgi-bin"> SSLOptions +StdEnvVars </Directory> <Directory "/home/ppp/htdig-test-3.2.0b4.1110.play/web_base/sslified/"> AuthName "Test" AuthType basic AuthUserFile /usr/local/apache2/etc/authorization/Passwd require valid-user </Directory> SetEnvIf User-Agent ".*MSIE.*" \ nokeepalive ssl-unclean-shutdown \ downgrade-1.0 force-response-1.0 # Per-Server Logging: # The home of a custom SSL log file. Use this when you want a # compact non-error SSL logfile on a virtual host basis. CustomLog logs/ssl_request_log \ "%t %h %{SSL_PROTOCOL}x %{SSL_CIPHER}x \"%r\" %b" </VirtualHost> </IfDefine> the response htdig gets is: 5:7:1:https://bartok.cc.ic.ac.uk/sslified/sslprotectedpage.html: Making HTTP request on https://bartok.cc.ic.ac.uk/sslified/ sslprotectedpage.html Header line: HTTP/1.1 400 Bad Request Header line: Date: Wed, 04 Dec 2002 14:36:44 GMT Header line: Server: Apache/2.0.43 (Unix) mod_ssl/2.0.43 OpenSSL/0.9.6g Header line: Content-Length: 546 Header line: Connection: close Header line: Content-Type: text/html; charset=iso-8859-1 Request time: 0 secs not found pick: bartok.cc.ic.ac.uk, # servers = 2 |
|
From: Gabriele B. <g.b...@co...> - 2002-12-04 08:12:16
|
Ciao guys, > Indexing the stems is a good suggestion. It would=20 > certainly give faster searching. If it replaced the=20 > unstemmed inverted file then it would also save on storage=20 > requirements, but it would mean we couldn't search on the=20 > unstemmed version (if that is of concern). Alternatively,=20 > indexing both the stemmed and unstemmed versions may be a=20 > bit extravagant... IMHO, the ultimate goal for a search process is to get a set of document satisfying a semantic criteria, better a context criteria. If we can do more in this direction, that'd be great. Usually, for this purpose, the stem part of words should be enough, as it is more powerful in representing the meaning of a word. Do you have any exact statistics showing the difference in storage of the two indexes? As Lachlan say, storing both could be a bit extravagant and, again, in my opinion could lead out of the tracks. Also, the problem is less important in an english language; I don't know any other languages except italian and latin languages, but they are for sure more complex, having different affix rules and lots more of different tenses. You know better than me that this would lead us far away from user's first goal: the search of documents about 'something'. As Geoff wisely suggest though, we could implement a different fuzzy algorithm for the 'Porter stemming' which builds a new index (a stemmed one). If you have some reference and suggestion, I'd be happy to offer coding it; that'd be a great chance for me to get into the 'word' module of the new ht://Dig system. Geoff, Gilles, Neal and Lachlan, I expect some news from you about this! :-) > > I have also been wondering if it is possible to turn off=20 > word-level indexing, to give (much) smaller inverted files=20 > if people don't need phrase searching. Does anybody know?=20 > That would be a compelling reason to store word attributes=20 > in a pure bit-map format, rather than using the more=20 > compact formats we were discussing recently. I guess customisation is our goal. In a retrieval phase, we'd want to store almost *anything* we can, then maybe with different fuzzy algorithms build alternative indexes (smaller or bigger, depending on users' settings). So ... I vote for an additional algorithm as Geoff suggests! :-) Ciao ciao -Gabriele --=20 Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy g.b...@co... | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ; |
|
From: Adam N. <ad...@co...> - 2002-12-03 19:21:24
|
Errr, minor typo there. *THIS* is the patch:
[adamn@fbucks4 htsearch 11:27:49]$ cvs diff -c3 Display.cc
Index: Display.cc
===================================================================
RCS file: /cvsroot/htdig/htdig/htsearch/Display.cc,v
retrieving revision 1.105
diff -c -3 -r1.105 Display.cc
*** Display.cc 14 Jun 2002 22:31:39 -0000 1.105
--- Display.cc 3 Dec 2002 19:20:33 -0000
***************
*** 696,702 ****
QuotedStringList sep(config->Find("page_number_separator"), "
\t\r\n");
if (nPages > config->Value("maximum_page_buttons", 10))
nPages = config->Value("maximum_page_buttons", 10);
! for (i = 1; i <= nPages; i++)
{
if (i == pageNumber)
{
--- 696,707 ----
QuotedStringList sep(config->Find("page_number_separator"), "
\t\r\n");
if (nPages > config->Value("maximum_page_buttons", 10))
nPages = config->Value("maximum_page_buttons", 10);
! int pageSpan = config->Value("results_page_span", 10);
! int pageStart = pageNumber - pageSpan;
! int pageEnd = pageNumber + pageSpan;
! if ( pageStart < 1 ) pageStart = 1;
! if ( pageEnd > nPages ) pageEnd = nPages;
! for (i = pageStart; i <= pageEnd; i++)
{
if (i == pageNumber)
{
|
|
From: Adam N. <ad...@co...> - 2002-12-03 19:12:18
|
actually, after some thought I cleaned up the code a bit and optomized some
silliness. Here's the final CVS diff versus the current CVS branch. The
same patch works in 3.1.6, except config is an object, not a pointer, so you
change the "->" to a ".".
[adamn@fbucks4 htsearch 11:15:57]$ cvs diff -c3 Display.cc
Index: Display.cc
===================================================================
RCS file: /cvsroot/htdig/htdig/htsearch/Display.cc,v
retrieving revision 1.105
diff -c -3 -r1.105 Display.cc
*** Display.cc 14 Jun 2002 22:31:39 -0000 1.105
--- Display.cc 3 Dec 2002 19:08:55 -0000
***************
*** 696,702 ****
QuotedStringList sep(config->Find("page_number_separator"), "
\t\r\n");
if (nPages > config->Value("maximum_page_buttons", 10))
nPages = config->Value("maximum_page_buttons", 10);
! for (i = 1; i <= nPages; i++)
{
if (i == pageNumber)
{
--- 696,707 ----
QuotedStringList sep(config->Find("page_number_separator"), "
\t\r\n");
if (nPages > config->Value("maximum_page_buttons", 10))
nPages = config->Value("maximum_page_buttons", 10);
! int pageSpan = config->Value("results_page_span", 10);
! int pageStart = pageNumber - pageSpan;
! int pageEnd = pageNumber + pageSpan;
! if ( pageStart < 1 ) pageStart = 1;
! if ( pageEnd > nPages ) pageEnd = 1;
! for (i = pageStart; i <= pageEnd; i++)
{
if (i == pageNumber)
{
----- Original Message -----
From: "Geoff Hutchison" <ghu...@ws...>
To: "Adam Ness" <ad...@co...>
Cc: <htd...@li...>
Sent: Monday, December 02, 2002 8:53 PM
Subject: Re: [htdig-dev] Google-style paging
On Monday, December 2, 2002, at 04:42 PM, Adam Ness wrote:
> We wanted the search engine for our site to do paging in the google
> style, so I've made the following changes to our htsearch/Display.cc
> that sets up another config variable (results_page_span) that allows
> you to display a set number of pages before and after the current
> page. You still have to come up with graphics, or just use the text
> links, but it's great for large search result sets
Could you provide a bit more information about your patch? What release
of ht://Dig was this patch written for? Could you provide the patch as a
"diff -c3p" so we have more context to go with the patched lines?
We appreciate the patch, and I'm sure there are plenty of other users
who'd be interested!
Thanks,
-Geoff
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/
-------------------------------------------------------
This SF.net email is sponsored by: Get the new Palm Tungsten T
handheld. Power & Color in a compact size!
http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en
_______________________________________________
htdig-dev mailing list
htd...@li...
https://lists.sourceforge.net/lists/listinfo/htdig-dev
|
|
From: Lachlan A. <lh...@ee...> - 2002-12-03 05:39:13
|
Greetings Geoff, On Tue, 3 Dec 2002 16:01, Geoff Hutchison wrote: > > I have also been wondering if it is possible to turn > > off word-level indexing, to give (much) smaller > > inverted files if people don't need phrase searching. > > Does anybody know? > > Not at the moment. > > But you lose a lot more than phrase searching. You lose > field-restricted searching. You lose scoring by proximity > (like Google). You lose the ability to score "on the > fly"--not to be discounted since many users wonder why > they change their scoring factors and the results don't > change. Thanks for raising those points. These are all enhancements that came with 3.2.0's database restructure, but I think that only phrase searching actually *needs* word-level inverted files. As I said, document-level indexing is a strong motivation for the word attributes to be pure bitmaps. The index could store the "OR" of each field set for any occurrence, so you could still say "If this word occurs in the title AND that word occurs in a heading". I agree that on-the-fly scoring is the way to go, but again I can't see why it couldn't be done based on the OR of the flags (although I could be missing something). Even (very coarse) proximity searching can be done fairly efficiently by, for example, dividing each document into eight regions and specifying (in one byte) which regions contain the word. I'm trying to avoid the "progress = bloat" phenomenon. Although I don't want to change htDig://'s course, my original interest in it was my aim of all Linux boxes having all their documentation searchable. That is one application which requires a minimal-overhead option, albeit with reduced performance. If I get enthusiastic, I'll look at writing a patch... Cheers, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Geoff H. <ghu...@ws...> - 2002-12-03 05:36:24
|
On Friday, November 29, 2002, at 01:02 AM, Jim Cole wrote: > The results of my comparison are at > http://www.yggdrasill.net/htdig/cl.txt The file is essentially the > 3.1.6 ChangeLog from Feb 25 10:11:50 2000 forward, with all forward I wouldn't focus on the contrib/ or htdoc/ changes. Obviously these need to be made, but especially in the case of htdoc/ some changes are sync'ed at the very end. (FAQ.html comes to mind.) The changes to Makefile.in files don't really apply since everything is generated by automake now. I'll take a look at the various configure.in changes. -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2002-12-03 05:06:03
|
On Monday, December 2, 2002, at 07:52 PM, Lachlan Andrew wrote: > This should be a lesson to me about posting bug reports > from memory... (One day I'll get my development machine on > line :) Yes, another good lesson is to post bug reports to SF.net so we can also track them. ;-) > The backward-compatible fix is to fix htsearch (a five > line patch) Let's do this ASAP. We can back it out later, but at least the code will be temporarily consistent. > but the more elegand solution would be to > change htdig to treat the two consistently. My > preference would be to specify the *true* location of all > words, even counting short words. That way we could > distinguish between a true phrase match and a "phrase > match, but possibly with short words in between". Yes, I agree that this is The Right Way(tm). It's definitely a bug that it's not currently specifying the *true* location of words. BTW, do you have a SourceForge account? We should get you CVS access. Cheers, -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2002-12-03 05:01:35
|
> Indexing the stems is a good suggestion. It would > certainly give faster searching. If it replaced the > unstemmed inverted file then it would also save on storage > requirements, but it would mean we couldn't search on the > unstemmed version (if that is of concern). The general strategy used by ht://Dig is post-indexing fuzzy matching. Certainly a Porter stemming fuzzy algorithm would be quite useful. But I'd say if we intend on indexing stems, it should definitely be optional. I can think of several instances where I'd want to search on one particular word, and *not* stemmed variants. So I'd much rather see work into innovative fuzzy algorithms. Anyone want to add a real "spelling" fuzzy? What about a Porter endings fuzzy to replace/augment endings? > I have also been wondering if it is possible to turn off > word-level indexing, to give (much) smaller inverted files > if people don't need phrase searching. Does anybody know? Not at the moment. But you lose a lot more than phrase searching. You lose field-restricted searching. You lose scoring by proximity (like Google). You lose the ability to score "on the fly"--not to be discounted since many users wonder why they change their scoring factors and the results don't change. If you look at other search products, the basic strategy now is "index everything" and let the search frontend filter if needed. Yes, some even index words like the, and, not, etc. Just my $0.02, -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2002-12-03 04:54:00
|
On Monday, December 2, 2002, at 04:42 PM, Adam Ness wrote: > We wanted the search engine for our site to do paging in the google=20 > style, so I've made the following changes to our htsearch/Display.cc=20= > that sets up another config variable (results_page_span) that allows=20= > you to display a set number of pages before and after the current=20 > page.=A0 You still have to come up with graphics, or just use the text=20= > links, but it's=A0great for large search result sets Could you provide a bit more information about your patch? What release=20= of ht://Dig was this patch written for? Could you provide the patch as a=20= "diff -c3p" so we have more context to go with the patched lines? We appreciate the patch, and I'm sure there are plenty of other users=20 who'd be interested! Thanks, -Geoff -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Lachlan A. <lh...@ee...> - 2002-12-03 02:04:06
|
Greetings Neal, I'll leave this to you. Indexing the stems is a good suggestion. It would certainly give faster searching. If it replaced the unstemmed inverted file then it would also save on storage requirements, but it would mean we couldn't search on the unstemmed version (if that is of concern). Alternatively, indexing both the stemmed and unstemmed versions may be a bit extravagant... I have also been wondering if it is possible to turn off word-level indexing, to give (much) smaller inverted files if people don't need phrase searching. Does anybody know? That would be a compelling reason to store word attributes in a pure bit-map format, rather than using the more compact formats we were discussing recently. Cheers, Lachlan On Tue, 3 Dec 2002 08:08, Neal Richter wrote: > This is on my list of things to work on.. > > An alternative is to have a separate word stemmer which > stores the words in the index in stemmed form. > > The Porter Stemming algorithm is good for this, and I > have code to do it. -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Lachlan A. <lh...@ee...> - 2002-12-03 01:52:35
|
Greetings, This should be a lesson to me about posting bug reports from memory... (One day I'll get my development machine on line :) On Fri, 29 Nov 2002 10:09, Lachlan Andrew wrote: > - Phrase searching doesn't work if the phrase contains a > stop word. A search for "Something is happening" will > fail, because "is" will be ignored. (I haven't yet > checked if it matches a document containing "Something > happening".) It turns out that *short* words are handled correctly (ignored by htdig and htsearch), but "bad" words are not. They are not inserted into the inverted file, but they *are* counted by HTML.cc (and presumably by other parsers) when determining the location of subsequent words. The backward-compatible fix is to fix htsearch (a five line patch) but the more elegand solution would be to change htdig to treat the two consistently. My preference would be to specify the *true* location of all words, even counting short words. That way we could distinguish between a true phrase match and a "phrase match, but possibly with short words in between". However, we completely ignoring both short and "bad" words would be a step in the right direction. > - I think I may have broken the code to stop "index.html" > being stripped from "file:///" URLs. I'm not sure if the > bug has been committed yet, but my next patch will fix > it. That was my bad memory of the testing I'd done. All is well on that count. Cheers, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Adam N. <ad...@co...> - 2002-12-03 00:57:35
|
We wanted the search engine for our site to do paging in the google =
style, so I've made the following changes to our htsearch/Display.cc =
that sets up another config variable (results_page_span) that allows you =
to display a set number of pages before and after the current page. You =
still have to come up with graphics, or just use the text links, but =
it's great for large search result sets
Adam Ness
diff -r1.105 Display.cc
698a699
> int pageSpan =3D config->Value("results_page_span", 10);
701,706c702,724
< if (i =3D=3D pageNumber)
< {
< p =3D npnt[i - 1];
< if (!p)
< p =3D form("%d", i);
< *str << p;
---
> if ( i > ( pageNumber - pageSpan ) &&
> i < ( pageNumber + pageSpan ) ) {
> if (i =3D=3D pageNumber)
> {
> p =3D npnt[i - 1];
> if (!p)
> p =3D form("%d", i);
> *str << p;
> }
> else
> {
> p =3D pnt[i - 1];
> if (!p)
> p =3D form("%d", i);
> *str << "<a href=3D\"";
> tmp =3D 0;
> createURL(tmp, i);
> *str << tmp << "\">" << p << "</a>";
> }
> if (i !=3D nPages && sep.Count() > 0)
> *str << sep[(i-1)%sep.Count()];
> else if (i !=3D nPages && sep.Count() <=3D 0)
> *str << " ";
708,722c726
< else
< {
< p =3D pnt[i - 1];
< if (!p)
< p =3D form("%d", i);
< *str << "<a href=3D\"";
< tmp =3D 0;
< createURL(tmp, i);
< *str << tmp << "\">" << p << "</a>";
< }
< if (i !=3D nPages && sep.Count() > 0)
< *str << sep[(i-1)%sep.Count()];
< else if (i !=3D nPages && sep.Count() <=3D 0)
< *str << " ";
< }
---
> }
|
|
From: Neal R. <ne...@ri...> - 2002-12-02 21:11:29
|
This is on my list of things to work on.. An alternative is to have a separate word stemmer which stores the words in the index in stemmed form. The Porter Stemming algorithm is good for this, and I have code to do it. Thanks. On Fri, 29 Nov 2002, Lachlan Andrew wrote: > On Fri, 29 Nov 2002 02:21, Gilles Detillieux wrote: > > if someone wants a dictionary > > for some other language and finds that such a dictionary > > is better supported or more complete/correct in aspell > > Ahh... That makes sense. Thanks. However I still don't > understand why we wouldn't want the English dictionary to > stem unrealised and realises together (which implicitly > allows the non-word "unrealises"). > > > A quick grep shows there are a lot of *hs words in there > > that htfuzzy can't make use of. The quick fix would be > > to grab one of the available flags (BCEFKLOQW) and use > > that for th->ths, but it might be more logical to keep > > the S flag for th->ths pluralizations, and use something > > like E for th->thes conjugations. > > Yes. I've suggested a new rule to the ispell maintainer > ([^cst]h -> s, [cst]h -> es) which fixes most problems > (*gh,*ph), while maintaining compatibility with ispell as > much as possible. You're right that we can add lots more > rules to improve stemming in lots of ways. > > Cheers, > Lachlan > > -- > Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 > Dept of Electrical and Electronic Engg CRICOS Provider Code > University of Melbourne, Victoria, 3010 AUSTRALIA 00116K > > > ------------------------------------------------------- > This SF.net email is sponsored by: Get the new Palm Tungsten T > handheld. Power & Color in a compact size! > http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en > _______________________________________________ > htdig-dev mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: J. op d. B. <ht...@op...> - 2002-12-02 09:07:45
|
Hi all, I've set up an experimental rsync server for the maindocs, files and patches site/repository. I just wanted to see if it was difficult, but it wasn't. Transcription: [jesse@juliette2 jesse]$ rsync www.opdenbrouw.nl:: Welcome to the ht://dig Rsync Archive! maindocs The ht://dig web site (3.0 MB) files The ht://dig files site (86 MB) patches The ht://dig patch site (6.6 MB) [jesse@juliette2 jesse]$ To copy the maindocs to directory website: rsync -aP www.opdenbrouw.nl::maindocs website Have fun! |
|
From: Geoff H. <ghu...@us...> - 2002-12-01 08:13:56
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b5: Next release, tentatively 1 Dec 2002.
3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease.
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
(Please note that everything added here should have a tracker PR# so
we can be sure they're fixed. Geoff is currently trying add PR#s for
what's currently here.)
SHOWSTOPPERS:
* Mifluz database errors are a severe problem (PR#428295)
-- Does Neal's new zlib patch solve this for now?
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug) PR#618737.
* Not all htsearch input parameters are handled properly: PR#405278. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#618738)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* Mifluz merge.
NEEDED FEATURES:
* Field-restricted searching. (e.g. PR#460833)
* Return all URLs. (PR#618743)
* Handle noindex_start & noindex_end as string lists.
* Quim's new htsearch/qtest query parser framework.
* File/Database locking. PR#405764.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient. (PR#405279)
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#405278.)
Should we make sure these config attributes are all documented in
defaults.cc, even if they're only set by input parameters and never
in the config file?
* Split attrs.html into categories for faster loading.
* Turn defaults.cc into an XML file for generating documentation and
defaults.cc.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
PRs# 405280 #405281.
* TODO.html has not been updated for current TODO list and
completions.
* Htfuzzy could use more documentation on what each fuzzy algorithm
does. PR#405714.
* Document the list of all installed files and default
locations. PR#405715.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
* The code needs a security audit, esp. htsearch. PR#405765.
|
|
From: Jim C. <gre...@yg...> - 2002-11-29 07:02:04
|
On Wednesday, October 30, 2002, at 01:56 PM, Gilles Detillieux wrote: > Comparing ChangeLog entries between 3.1.6 and the 3.2 cvs would be the > first step, and would find most of the missing stuff. Note that there The results of my comparison are at http://www.yggdrasill.net/htdig/cl.txt The file is essentially the 3.1.6 ChangeLog from Feb 25 10:11:50 2000 forward, with all forward ported items removed. I used the htdig-3.2.0b4-20021117 snapshot for the final comparison; for the most part, all comparisons were made against the related files, rather than the ChangeLog. There are a few of my own notes inserted along the way. I am 99.9% certain that a number of entries correspond to places where functionality has been intentionally removed or reimplemented in a different location/manner; however I figured it might be best to play dumb and let others with more knowledge make the final decision on those items. Jim |
|
From: Lachlan A. <lh...@ee...> - 2002-11-28 23:10:13
|
On Tue, 26 Nov 2002 17:08, Geoff Hutchison wrote: > I haven't heard from much of anyone, which isn't > necessarily what I want just before a release. So I think > we'll have to prod some people into confirming that the > current snapshots are working, bugs have been fixed, etc. OK, from my little bit of testing: - Phrase searching doesn't work if the phrase contains a stop word. A search for "Something is happening" will fail, because "is" will be ignored. (I haven't yet checked if it matches a document containing "Something happening".) I'll try to fix it, but any tips would be welcome. - I think I may have broken the code to stop "index.html" being stripped from "file:///" URLs. I'm not sure if the bug has been committed yet, but my next patch will fix it. (Also, FWIW, http://wso.williams.edu/search returns Unable to read document index file Did you run htmerge? Just keeping you on your toes :) Is there (or could readers of this list post) a list of sites running recent snapshots? Thanks, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Lachlan A. <lh...@ee...> - 2002-11-28 22:37:34
|
On Fri, 29 Nov 2002 02:21, Gilles Detillieux wrote: > if someone wants a dictionary > for some other language and finds that such a dictionary > is better supported or more complete/correct in aspell Ahh... That makes sense. Thanks. However I still don't understand why we wouldn't want the English dictionary to stem unrealised and realises together (which implicitly allows the non-word "unrealises"). > A quick grep shows there are a lot of *hs words in there > that htfuzzy can't make use of. The quick fix would be > to grab one of the available flags (BCEFKLOQW) and use > that for th->ths, but it might be more logical to keep > the S flag for th->ths pluralizations, and use something > like E for th->thes conjugations. Yes. I've suggested a new rule to the ispell maintainer ([^cst]h -> s, [cst]h -> es) which fixes most problems (*gh,*ph), while maintaining compatibility with ispell as much as possible. You're right that we can add lots more rules to improve stemming in lots of ways. Cheers, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Gilles D. <gr...@sc...> - 2002-11-28 15:21:42
|
According to Lachlan Andrew: > Greetings, > > On Wed, 27 Nov 2002 15:53, Gilles Detillieux wrote: > > According to Geoff Hutchison: > > > > Is the policy to have all possible stemmings, even if > > > > they are "non-words", like "unrealises"? > > > No, and I'd expect that ispell doesn't want them > > > either. Of course many people have moved away from > > > ispell too... > > Does that mean we'll end up having to add support for > > aspell dictionaries to htfuzzy endings? > > Does it matter that the list originally came from ispell? > Its role here is fundamentally different. For a spell > checker, you only care what combinations of letters are > valid words. For a stemmer, you only want to know which > "words" are derived from a common stem. Unless the same > actual file is used for spell checking, it is not clear why > it matters what spell checker people use. Am I missing > something? It only matters in that it may have an impact on available dictionaries. We can tweak the english dictionary all we want, but if someone wants a dictionary for some other language and finds that such a dictionary is better supported or more complete/correct in aspell or some other spell checker, than it is with ispell, then they may start asking for support in htfuzzy for these other dictionary formats. > > I've made the changes to english.0 and synonyms, with one > > minor addition (adding D & S flags to birth). > > Thanks for that. You might want to reconsider the '/S' > flag; it produces 'birthes', not 'births' as you might > expect. (The '*h -> es' rule suits words like 'wreath'.) > That rule and its use are among the things I hope to clean > up after 3.2.0b5 is out... It's out of there. Thanks for the heads-up. I've reinserted "births" instead, for the sake of completeness, even though htfuzzy won't use it. A quick grep shows there are a lot of *hs words in there that htfuzzy can't make use of. The quick fix would be to grab one of the available flags (BCEFKLOQW) and use that for th->ths, but it might be more logical to keep the S flag for th->ths pluralizations, and use something like E for th->thes conjugations. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Lachlan A. <lh...@ee...> - 2002-11-28 02:05:09
|
Greetings, On Wed, 27 Nov 2002 15:53, Gilles Detillieux wrote: > According to Geoff Hutchison: > > > Is the policy to have all possible stemmings, even if > > > they are "non-words", like "unrealises"? > > No, and I'd expect that ispell doesn't want them > > either. Of course many people have moved away from > > ispell too... > Does that mean we'll end up having to add support for > aspell dictionaries to htfuzzy endings? Does it matter that the list originally came from ispell? Its role here is fundamentally different. For a spell checker, you only care what combinations of letters are valid words. For a stemmer, you only want to know which "words" are derived from a common stem. Unless the same actual file is used for spell checking, it is not clear why it matters what spell checker people use. Am I missing something? > I've made the changes to english.0 and synonyms, with one > minor addition (adding D & S flags to birth). Thanks for that. You might want to reconsider the '/S' flag; it produces 'birthes', not 'births' as you might expect. (The '*h -> es' rule suits words like 'wreath'.) That rule and its use are among the things I hope to clean up after 3.2.0b5 is out... > Let's get the doc changes in. Then, if we can > get the xml stuff in for 3.2.0b5, great. OK :) -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Gabriele B. <g.b...@co...> - 2002-11-27 12:02:00
|
Ciao guys, I noticed that in 3.1.x documentation 'maximum_word_length' is said to be used by htdig and htsearch. However, by using the accents algorithm it is used by htfuzzy too. I think this must be updated in defaults.cc and Brian's defaults.xml (also in 3.1.x code, eventually, and the online website). Ciao ciao -Gabriele --=20 Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy g.b...@co... | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ; |
|
From: Gilles D. <gr...@sc...> - 2002-11-27 04:53:53
|
According to Geoff Hutchison: > > Is the policy to have all possible stemmings, even if they > > are "non-words", like "unrealises"? If so, we can really > > go to town on the affixes :) > > No, and I'd expect that ispell doesn't want them either. Of course many > people have moved away from ispell too... Does that mean we'll end up having to add support for aspell dictionaries to htfuzzy endings? > > Is the release still scheduled for 1 December? I think > > I've now got almost all of the 3.1.6 functionaliy in, > > except for the wildcard. > > I haven't heard from much of anyone, which isn't necessarily what I want > just before a release. So I think we'll have to prod some people into > confirming that the current snapshots are working, bugs have been fixed, > etc. > > My guess is that the release will have to move back a small amount, but > that's not too surprising. Still, I'd like to get it out before I head > to my parents for the holidays. Sorry for my silence, but things got a bit hectic last week, so I never did get to committing patches as I hoped. I also haven't done much testing of recent snapshots, though I don't expect the tests I run on my small site contribute a whole lot to the overall testing effort anyway (how many times does it need to be tested on a small site running Red Hat?). I fully expected some schedule slippage, but 4 days to the tentative release date, I think we'll need to allow for a lot of slippage! Anyway, for what it's worth, I've made the changes to english.0 and synonyms, with one minor addition (adding D & S flags to birth). Thanks, Lachlan! Once again, I'll try to get to some of the other changes later this week, but can't promise anything because things are still rather busy for me. > > (I'm not too keen to port all of the documentation over until > > defaults.xml becomes the official reference...) > > I think it seems easy enough to generate the defaults.xml from any > changes. So why don't we get the documentation over first. (It seems to > me that it'd be easier to patch defaults.cc, then re-generate > defaults.xml?) I agree. Let's get the doc changes in. Then, if we can get the xml stuff in for 3.2.0b5, great, but if not, it's not the end of the world. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Budd, S. <s....@ic...> - 2002-11-26 10:34:43
|
Hello. Is there any intention of incorporating the AdjustableLoggingPatch into Htdig-3.2.0b4(5) ftp://ftp.ccsf.org/htdig-patches/3.1.6/AdjustableLoggingPatch.tar.gz Regards Sinclair Budd s....@ic... |
|
From: Geoff H. <ghu...@ws...> - 2002-11-26 06:09:08
|
> Is the policy to have all possible stemmings, even if they > are "non-words", like "unrealises"? If so, we can really > go to town on the affixes :) No, and I'd expect that ispell doesn't want them either. Of course many people have moved away from ispell too... > Is the release still scheduled for 1 December? I think > I've now got almost all of the 3.1.6 functionaliy in, > except for the wildcard. I haven't heard from much of anyone, which isn't necessarily what I want just before a release. So I think we'll have to prod some people into confirming that the current snapshots are working, bugs have been fixed, etc. My guess is that the release will have to move back a small amount, but that's not too surprising. Still, I'd like to get it out before I head to my parents for the holidays. > (I'm not too keen to port all of the documentation over until > defaults.xml becomes the official reference...) I think it seems easy enough to generate the defaults.xml from any changes. So why don't we get the documentation over first. (It seems to me that it'd be easier to patch defaults.cc, then re-generate defaults.xml?) -Geoff |
|
From: Geoff H. <ghu...@ws...> - 2002-11-26 06:03:55
|
On Monday, November 25, 2002, at 12:18 PM, Andrea Capiluppi wrote: > i was analyzing htDig, but my problem is that i don't have the sequence > of > the different versions. If you put together http://www.htdig.org/RELEASE.html http://www.htdig.org/dev/htdig-3.2/RELEASE.html (in reverse chronological order) 3.2.0b5 "Real Soon Now" 3.1.6 1-Feb-2002 3.2.0b3 22-Feb-2001 3.2.0b2 11-Apr-2000 3.1.5 25-Feb-2000 3.2.0b1 14-Feb-2000 3.1.4 9-Dec-1999 3.1.3 22-Sep-1999 3.1.2 21-Apr-1999 3.1.1 17-Feb-1999 3.1.0 9-Feb-1999 3.1.0b4 22-Dec-1998 3.1.0b3 15-Dec-1998 3.1.0b2 1-Nov-1998 3.1.0b1 8-Sep-1998 ... (these files go back to 3.0, 17-Jul-1996, with mention of an undated 3.0b5.) The 3.2.0bX releases were supposed to come at a point that 3.1.x releases were stable, but: a) There has been a need to fix bugs and add minor features to the 3.1.x releases. b) There hasn't been as much development effort on the 3.2 code as we'd like. (We're currently in the process of improving the situation and getting 3.2 "unstuck.") > Last but not least, if you are interested in what is this study about, > tell > me, and i will explain in more details. I guess we're at least a little curious. Ciao! -Geoff |