You can subscribe to this list here.
| 2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
| 2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
| 2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
| 2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
| 2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
| 2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
| 2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
| 2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
| 2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
| 2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
|
From: Geoff H. <ghu...@us...> - 2002-12-22 08:13:56
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b5: Next release, tentatively 1 Dec 2002.
3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease.
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
(Please note that everything added here should have a tracker PR# so
we can be sure they're fixed. Geoff is currently trying add PR#s for
what's currently here.)
SHOWSTOPPERS:
* Mifluz database errors are a severe problem (PR#428295)
-- Does Neal's new zlib patch solve this for now?
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug) PR#618737.
* Not all htsearch input parameters are handled properly: PR#405278. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#618738)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* Mifluz merge.
NEEDED FEATURES:
* Field-restricted searching. (e.g. PR#460833)
* Return all URLs. (PR#618743)
* Handle noindex_start & noindex_end as string lists.
* Quim's new htsearch/qtest query parser framework.
* File/Database locking. PR#405764.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient. (PR#405279)
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#405278.)
Should we make sure these config attributes are all documented in
defaults.cc, even if they're only set by input parameters and never
in the config file?
* Split attrs.html into categories for faster loading.
* Turn defaults.cc into an XML file for generating documentation and
defaults.cc.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
PRs# 405280 #405281.
* TODO.html has not been updated for current TODO list and
completions.
* Htfuzzy could use more documentation on what each fuzzy algorithm
does. PR#405714.
* Document the list of all installed files and default
locations. PR#405715.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
* The code needs a security audit, esp. htsearch. PR#405765.
|
|
From: Lachlan A. <lh...@ee...> - 2002-12-20 00:57:45
|
On Fri, 20 Dec 2002 02:13, Geoff Hutchison wrote: > > - Why was regex.h renamed gregex.h in 3.1.6? It > > seems to break the configure script, so that it > > always reports HAVE_BROKEN_REGEX. > > Strange, it wasn't doing that for me, but perhaps that's > because I was using gcc-3.x? OK. Perhaps I've broken something in my copy... The problem I had was that configure tries to #include "HtRegex.c" in the top directory, and it couldn't find gregex.h (included from HtRegex.h). 3.2.0b4-cvs seems to happily read the system regex.h... I take it that the preferred behaviour is to use the included regex code, with the fall back to use the system code, rather than the other way around. Is that right? > > class, but that will mean that the sysadmin can't use > > sym-links to do things like make machines with > > different directory structures appear to have the same > > structure. Is that likely to be a problem? > > Yes, I'd think that's a problem. Would it be easier to > handle a symlink as a redirect? Hmm... IIRC, redirects do essentially the same as the patch. The situation I was meaning to describe is: Computer A: /usr/foo/help.html /usr/local -> /usr Computer B: /usr/local/foo/help.html Using *either* redirects or my patch, a dig on A starting from <file:///usr/local/> would cause the database entry <file:///usr/foo/help.html> rather than <file:///usr/local/foo/help.html>. (Otherwise, it would also have to include /usr/local/local/.) That is fine as long as the search *client* (i.e., the browser, not htsearch) is on computer A, but it points to a non-existent file on computer B. (Of course, the system could still be set up this way and the aliases would work for other applications, but to ht://Dig the filesystems would not look the same.) The semantics of a file:/// URL say that it only refers to the local computer, so using sym-links to try to make them the same on different computers is technically "not kosher" anyway, as I understand it. > the newest versions of automake > need to use the most current versions of autoconf > As well, you need to run > "aclocal" usually. Thanks. > > - Geoff, do you still plan to give me CVS access? It > > would be great if you could. > > Yes. I'm wading through a huge e-mail backlog. :-( Thanks again. My account name is lha. (There is no rush. I was just hoping that you hadn't changed your mind when you saw what a newbie I am :) Merry Christmas :) Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Geoff H. <ghu...@ws...> - 2002-12-19 15:14:00
|
> - Why was regex.h renamed gregex.h in 3.1.6? It seems > to break the configure script, so that it always reports > HAVE_BROKEN_REGEX. Strange, it wasn't doing that for me, but perhaps that's because I was using gcc-3.x? The change was made because certain systems have serious problems between the included regex and the system regex. We were told that there were segfaults with the included regex.h (prob. because it was receiving precedence in the INCLUDES) and the system libc's regex code. Their solution was to move regex.h to gregex.h. Since it doesn't matter much to *us*, that seemed fine. > .../foo/bar/fred/ instead of .../foo/fred/. That could > lead to loops (although I haven't tested that) Loops are quite possible with symlinks (though what you describe is a bug). This is why the checksumming code is important. > class, but that will mean that the sysadmin can't use > sym-links to do things like make machines with different > directory structures appear to have the same structure. Is > that likely to be a problem? Yes, I'd think that's a problem. Would it be easier to handle a symlink as a redirect? > - The HtFTP class seems to be basically the same as the > HtFile class, with name changes. (From the name, I would The HtFTP class was started by a contributor and I haven't seen any further evidence that it's been completed. So no, it doesn't handle ftp:// URLs, nor should it be "hooked up" in Document.cc. > - I've changed some Makefile.am files, but when I > automake (version 1.5), it generates files using the macro > $(OBJEXT) where .o is produced by the current > Makefile.in, and OBJEXT isn't defined. My understanding is that the newest versions of automake really need to use the most current versions of autoconf as well. I don't believe the configure.in has been updated for autoconf-2.5x. As well, you need to run "aclocal" usually to update that file with a variety of automake/autoconf macros. > - Geoff, do you still plan to give me CVS access? It would > be great if you could. Yes, sorry. I'm wading through a huge e-mail backlog. :-( Send me an e-mail with your SourceForge account and I can turn it on. -Geoff |
|
From: Lachlan A. <lh...@ee...> - 2002-12-18 23:48:45
|
Greetings Bernd, Thanks! Could you please post the patch? Which version of ht://Dig are you running? Thanks, Lachlan On Thu, 19 Dec 2002 07:10, you wrote: > I had troubles with accented characters an a host running > solaris9. > > I changed htlib/Configuration.cc to use > setlocale(LC_CTYPE) as well. Bingo, htdig now works as > expected. -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Lachlan A. <lh...@ee...> - 2002-12-18 23:36:46
|
Greetings, A few questions for those with more experience than me: - Why was regex.h renamed gregex.h in 3.1.6? It seems to break the configure script, so that it always reports HAVE_BROKEN_REGEX. This gave me troubles under gcc 2.90, which was fixed by changing #ifdef HAVE_BROKEN_REGEX to #ifndef... and renaming all the functions in regex.[ch]. Does anyone know the "right" solution? - HtFile.cc doesn't handle sybmolic links to directories very well. For example, a sym-link fred/ -> ../fred/ in directory .../foo/bar/ leads to the path .../foo/bar/fred/ instead of .../foo/fred/. That could lead to loops (although I haven't tested that) and makes anchors with the path component '../' point to the wrong place. It also causes the documents to be indexed multiple times. I've written a patch to resolve sym-links using the URL class, but that will mean that the sysadmin can't use sym-links to do things like make machines with different directory structures appear to have the same structure. Is that likely to be a problem? - The HtFTP class seems to be basically the same as the HtFile class, with name changes. (From the name, I would have thought that HtFTP would handle ftp:// requests, but it doesn't seem to...) I have made changes to HtFile (sym-links, determining MIME types from file content). Should I mirror these changes in HtFTP? - I've changed some Makefile.am files, but when I automake (version 1.5), it generates files using the macro $(OBJEXT) where .o is produced by the current Makefile.in, and OBJEXT isn't defined. How can I fix that? The configure script determines the object extension, but that doesn't seem to be used anywhere... - Geoff, do you still plan to give me CVS access? It would be great if you could. Thanks, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Geoff H. <ghu...@ws...> - 2002-12-18 18:04:37
|
> And it only responded with "Unable to read configuration file"...it did not > return back the .conf file location. ... > Can you please tell me where to fix this. Yes. You will need to update to htdig-3.1.6. <http://www.htdig.org/where.html> -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
|
From: Bill B. <Bi...@ko...> - 2002-12-18 16:31:27
|
Howdy, I cannot find a configuration to change the output when I query my server with the security problem.. I tried your server with: http://www.htdig.org/cgi-bin/htsearch?config=aaa And it only responded with "Unable to read configuration file"...it did not return back the .conf file location. Can you please tell me where to fix this. I thank you. Cheers, Bill Bill Bowerman Product Manager KOM Networks P: 613-599-7205 F: 613-599-7206 Bi...@KO... http://www.KOMNetworks.com |
|
From: Bernd L. <ber...@rz...> - 2002-12-18 16:09:54
|
Hi, I had troubles with accented characters an a host running solaris9. FAQ item 5.8 recommends running testlocale.c The system has only very basic locale support installed: $ locale -a POSIX C iso_8859_1 running $ ./testlocale iso_8859_1 indeed shows characters 192 through 255 as accented letters. However htdig didn't index words with accented letters. Comparing the sources of testlocale.c and htlib/Configuration.cc I found that testlocale.c ist using setlocale(LC_CTYPE) but htlib/Configuration.cc is using setlocale(LC_ALL) this call fails with the iso_8859_1 locale. I changed htlib/Configuration.cc to use setlocale(LC_CTYPE) as well. Bingo, htdig now works as expected. -- Bernd 'Bing' Leibing Computing Center, University of Ulm, Germany Email: <ber...@rz...> Tel. 0731-50-22516 Homepage (PGP-Key): http://www.uni-ulm.de/~leibing O26/5215 |
|
From: Geoff H. <ghu...@us...> - 2002-12-15 08:14:24
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b5: Next release, tentatively 1 Dec 2002.
3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease.
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
(Please note that everything added here should have a tracker PR# so
we can be sure they're fixed. Geoff is currently trying add PR#s for
what's currently here.)
SHOWSTOPPERS:
* Mifluz database errors are a severe problem (PR#428295)
-- Does Neal's new zlib patch solve this for now?
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug) PR#618737.
* Not all htsearch input parameters are handled properly: PR#405278. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#618738)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* Mifluz merge.
NEEDED FEATURES:
* Field-restricted searching. (e.g. PR#460833)
* Return all URLs. (PR#618743)
* Handle noindex_start & noindex_end as string lists.
* Quim's new htsearch/qtest query parser framework.
* File/Database locking. PR#405764.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient. (PR#405279)
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#405278.)
Should we make sure these config attributes are all documented in
defaults.cc, even if they're only set by input parameters and never
in the config file?
* Split attrs.html into categories for faster loading.
* Turn defaults.cc into an XML file for generating documentation and
defaults.cc.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
PRs# 405280 #405281.
* TODO.html has not been updated for current TODO list and
completions.
* Htfuzzy could use more documentation on what each fuzzy algorithm
does. PR#405714.
* Document the list of all installed files and default
locations. PR#405715.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
* The code needs a security audit, esp. htsearch. PR#405765.
|
|
From: Gilles D. <gr...@sc...> - 2002-12-13 17:10:05
|
Hi, folks. I just discovered that geocrawler.com stopped archiving htdig-general and htdig-dev as of Mon. Nov. 25, 2002. I think we need to switch the automatically generated links to the archives on http://www.htdig.org/mailarchive.html and http://www.htdig.org/dev/devmailarchives.html from geocrawler.com to the new sourceforge.net archives. If any of you can think of an intelligent way of indexing the new archives, please kindly make your suggestions here. The trick would be to do month by month updates (on a daily basis) as we did with geocrawler, and skip over any superfluous stuff we don't want. This is complicated by the fact that individual message links don't carry the forum ID in all views, so we'd have to limit indexing to a particular view of the 2 forums (Ultimate view seems like the best pick for now) and make sure we get all the individual message links from this view. Hopefully the style of URLs won't change on us too frequently. It seems that when Geoff and I looked into this in August, you couldn't get the forum ID in links to messages in any view, which would have made it very difficult to index all and only messages in our forums. Anyone feel up to solving this htdig configuration puzzle? -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Gilles D. <gr...@sc...> - 2002-12-13 16:49:35
|
Things have been exceptionally busy for me at work lately. I began a project in mid October that I hoped to finish in 2-3 weeks, and it's taken over 2 months. I'm seeing the light at the end of the tunnel finally, but I don't expect to be able to make any contributions to the ht://Dig effort before January. Sorry about that. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
|
From: Bogdan P. <b.p...@at...> - 2002-12-13 08:50:10
|
Hello, We setup an mirror for ht://Dig web site. host: http://mirrors.atn.ro/htdig/maindocs/ maintainter: mi...@at... Please tell us if we are qualified ofr mirroring your site. Thank you. -- Best regards, Bogdan Parcalab mailto:b.p...@at... |
|
From: Lachlan A. <lh...@ee...> - 2002-12-11 01:53:21
|
On Wed, 11 Dec 2002 11:34, you wrote: > > three-value flag: > > index unstemmed words, index stems, index both? > > Sure. Geoff wants a default of 'unstemmed words' > (the current method), which I agree with. I agree that compatibility should be the default. > > - The format you describe sounds like a "half-inverted" > > file -- listing locations *within* a document by > > word, but listing *document* locations by document. Is > > that correct? > > In the proposed index only the word+document are the > 'key', the remaining parts are in the 'value'. I'm not > sure what you mean by 'document locations' here.. please > clarify! By "listing *document* location by document", I essentially meant that the document ID was (part of) the key. Essentially my question was: Why is the document ID part of the key? Isn't searching more efficient if you don't need to scan through all documents for a word which only occurs in 1% of them (which a good query term would)? Is there some operation which needs words to be looked up by document? (This was probably all discussed when the new format was chosen. Feel free to point me to an archive instead of answering every question.) I don't quite understand what you said about morphological analysis, but I'll do some reading before asking too many questions :) Thanks for your explanations. Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Lachlan A. <lh...@ee...> - 2002-12-11 00:16:30
|
Greetings, No, but it is attached. Cheers, Lachlan On Wed, 11 Dec 2002 02:30, Budd, Sinclair wrote: > Did this patch make it into a snapshot? > -----Original Message----- > From: Geoff Hutchison [mailto:ghu...@ws...] > On Monday, December 2, 2002, at 07:52 PM, Lachlan Andrew wrote: > > The backward-compatible fix is to fix htsearch (a > > five line patch) -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Budd, S. <s....@ic...> - 2002-12-10 15:32:12
|
Did this patch make it into a snapshot? -----Original Message----- From: Geoff Hutchison [mailto:ghu...@ws...] Sent: Tuesday, December 03, 2002 5:06 AM To: lh...@ee... Cc: htd...@li... Subject: Re: [htdig-dev] Re: testing On Monday, December 2, 2002, at 07:52 PM, Lachlan Andrew wrote: > This should be a lesson to me about posting bug reports > from memory... (One day I'll get my development machine on > line :) Yes, another good lesson is to post bug reports to SF.net so we can also track them. ;-) > The backward-compatible fix is to fix htsearch (a five > line patch) Let's do this ASAP. We can back it out later, but at least the code will be temporarily consistent. > but the more elegand solution would be to > change htdig to treat the two consistently. My > preference would be to specify the *true* location of all > words, even counting short words. That way we could > distinguish between a true phrase match and a "phrase > match, but possibly with short words in between". Yes, I agree that this is The Right Way(tm). It's definitely a bug that it's not currently specifying the *true* location of words. BTW, do you have a SourceForge account? We should get you CVS access. Cheers, -Geoff ------------------------------------------------------- This SF.net email is sponsored by: Get the new Palm Tungsten T handheld. Power & Color in a compact size! http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en _______________________________________________ htdig-dev mailing list htd...@li... https://lists.sourceforge.net/lists/listinfo/htdig-dev |
|
From: Neal R. <ne...@ri...> - 2002-12-10 00:18:22
|
> - Given the flag to disable stemming, what is the
> dissadvantage of simply making it a three-value flag:
> index unstemmed words, index stems, index both?
Sure. Geoff wants a default of 'unstemmed words' (the current
method), which I agree with.
> - The format you describe sounds like a "half-inverted"
> file -- listing locations *within* a document by word, but
> listing *document* locations by document. Is that
> correct?
In the proposed index only the word+document are the 'key', the
remaining parts are in the 'value'. I'm not sure what you mean by
'document locations' here.. please clarify!
> - You said that the approach currently taken by fuzzy
> endings is uncharted waters. I assume you are talking
> about the approach of simply creating a disjunction of
> the derived words. What is hard to "get right" about
> that? In terms of the documents returned, it sounds the
> same as what you have proposed. In terms of
> implementation, it sounds like what 'fuzzy endings' does
> now, except for fixing the stemming.
Our 'fuzzy endings' algorithm is in the class of "morphological
analysis" algorithms. These algorithms are frequently studied, and there
are many good packages. Most of them cost big money and are very complex,
very language specific and took many years of research.. and are not open
source.
Morphological analysis is pretty cutting edge in NLP, and still mostly
unsolved.
What is hard to 'get right' is the general idea of generating
correct variants of a given word. The stemming algorithms are quite
complex with many rules for generating stems. My gut feeling is
that the number of rules in the 'fuzzy endings' algorithm needs
to be on par with or exceed the same-language stemmer.
1. Stemming is a known quantity with known performance
2. We have 10+ languages available NOW
Morphological analysis is a promising approach as it's possible to
generate better endings and avoid the situation you detail below. It
would take alot of work to make this algorithm as good or exceed the
generalization ability of the stemmers.
I would actually encourage whoever has worked on the
algorithm to consider writing an academic paper for publication on it.
There would need to be a comparative study done on it vs stemming.. but if
you can show that it outperforms stemming in IR for precision/recall
great!
SEE BELOW FOR SOME REFERENCES!
The AI researcher in me wants to explore the fuzzy-endings algorithm.. the
conservative software engineer side wants to go with proven IR &
NLP techniques first.
> - With stemming in general, what is done about negating
> affixes? If I searched for 'mercy', I wouldn't want
> results about 'merciless' (although I would want results
> about 'merciful').
That is part of the downside of stemming. The hope is that the
combination of unstemmed + stemmed words in the index and combined in the
score would get the correct result most of the time.
Thanks!
REFERENCES:
David A. Hull
Stemming Algorithms A Case Study for Detailed Evaluation (1996)
http://citeseer.nj.nec.com/hull96stemming.html
David A. Hull, Gregory Grefenstette
A Detailed Analysis of English Stemming Algorithms (1996)
http://citeseer.nj.nec.com/hull96detailed.html
Wessel Kraaij
Viewing Stemming as Recall Enhancement (1996)
http://citeseer.nj.nec.com/kraaij96viewing.html
Wessel Kraaij, Rene Pohlmann
Using Linguistic Knowledge in Information Retrieval (1996)
http://citeseer.nj.nec.com/kraaij96using.html
There is also this frequently cited paper which I can't find on the web.
Harman, Donna (1991). How Effective is Suffixing? Journal of the American
Society for Information Science, 42(1), 7-15.
Neal Richter
Knowledgebase Developer
RightNow Technologies, Inc.
Customer Service for Every Web Site
Office: 406-522-1485
|
|
From: Gabriele B. <g.b...@co...> - 2002-12-09 08:12:38
|
Ciao guys, again, sorry if I will certainly make mistakes. I love to get to know more in this area, which is pretty new for me too. So please be patient. :-) Il lun, 2002-12-09 alle 02:14, Lachlan Andrew ha scritto: > - The format you describe sounds like a "half-inverted" > file -- listing locations *within* a document by word, but > listing *document* locations by document. Is that > correct? I think that was a flat representation of the index file, just an example. Am I right, Neal? In a simple scenario, we'll have - (please consider it is a very very draft!): - a word index (word id, stemmed/unstemmed flag, maybe language?) - a document index (document id, info regarding the document, pretty much as now: title, modification date, etc.) - an inverted index (word id, document id, locations) Words ----- ID Word S/U Lang -- ---- --- ---- 1 traveling 0 en 3 casa 0 it 12 travel 0 en 23 travels 0 en 45 pasta 0 it 60 travel 1 en ... Documents --------- ID URL Other info -- --- ---------- 1 http://www.pippo.it/ ..... 2 http://www.htdig.org/ ... Index ----- ID W ID D Locations and related info (position and markup) ---- ---- ------------------------------------------------ 1 2 1 Value_location 3 Value_location Value_Location is the value given to the location of the word Am I right? Of course it's just an example ... :-) Any comments about the language? > - With stemming in general, what is done about negating > affixes? If I searched for 'mercy', I wouldn't want > results about 'merciless' (although I would want results > about 'merciful'). Good point, are there any plans to include negative words too? Ciao ciao -Gabriele --=20 Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy g.b...@co... | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ; |
|
From: Gabriele B. <g.b...@co...> - 2002-12-09 07:50:03
|
Il ven, 2002-12-06 alle 21:09, Neal Richter ha scritto: > According to the literature, if you go with a stemmed index exclusively= , > the index efficiency goes up by ABOUT 20-30%. This estimate is very data > and language dependent. >=20 > I research and implement this kind of stuff at work... I'd be happy to > post links to a couple research papers if people are interested. Yes, please do Neal. I am really interested, and what you said so far is interesting as well! :-) > Here's a proposal for 'intelligent stemming' in HtDig: >=20 > 1. Fix index efficiency. Yep > 2. Add a configuration switch to disable stemming ;-) Good. > 3. Implement the stemming algorithm to ADD additional rows to the inde= x > with stemmed versions of the words (with a row flag to signify > this). Perfect > 4. During result ranking we rank the results with an algorithm like > this: >=20 > If num documents is LARGE > unstemmed rows are 80%, stemmed rows are 20% of the 'score' >=20 > If num documents is MEDIUM > unstemmed rows are 60%, stemmed rows are 40% of the 'score' >=20 > If num documents is SMALL > unstemmed rows are 30%, stemmed rows are 70% of the 'score' I like it, even though I think that giving users the chance to set those values somehow, by choosing a more general or specific index wouldn't be bad in my opinion. > I also don't support doing anything about stemming until we fix the index > (which I'm working on). It will negatively impact the size too much for > large indexes. I agree ... Babysteps. :-) Thanks for your message. Please can you point us some reference or resources to read. I'd love that! Ciao and thanks, -Gabriele --=20 Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy g.b...@co... | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ; |
|
From: Lachlan A. <lh...@ee...> - 2002-12-09 01:15:27
|
Greetings Neal, Your suggestion sounds good, especially steps 1 and 2... I have some beginner's questions: - Given the flag to disable stemming, what is the dissadvantage of simply making it a three-value flag: index unstemmed words, index stems, index both? - The format you describe sounds like a "half-inverted" file -- listing locations *within* a document by word, but listing *document* locations by document. Is that correct? - You said that the approach currently taken by fuzzy endings is uncharted waters. I assume you are talking about the approach of simply creating a disjunction of the derived words. What is hard to "get right" about that? In terms of the documents returned, it sounds the same as what you have proposed. In terms of implementation, it sounds like what 'fuzzy endings' does now, except for fixing the stemming. - With stemming in general, what is done about negating affixes? If I searched for 'mercy', I wouldn't want results about 'merciless' (although I would want results about 'merciful'). Thanks! Lachlan On Sat, 7 Dec 2002 07:09, Neal Richter wrote: > I agree with Geoff in that we don't want to go with > stemming exclusively.. > Here's a proposal for 'intelligent stemming' in HtDig: > > 1. Fix index efficiency. > 2. Add a configuration switch to disable stemming ;-) > 3. Implement the stemming algorithm to ADD additional > rows to the index with stemmed versions of the words > (with a row flag to signify this). > This system does add duplicate rows in a sense to the > index. > > traveling -> travel > travel -> travel > travels -> travel > traveler -> travel > traveled -> travel > > Document Word StemFlag Locations > > 20 traveling 0 24 36 110 > 20 travel 0 52 98 220 > 20 travels 0 10 75 340 > 20 traveler 0 13 180 > 20 traveled 0 200 > 20 travel 1 10 13 24 36 52 75 > 98 110 180 200 220 340 > > FEEDBACK PLEASE!! -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
|
From: Geoff H. <ghu...@us...> - 2002-12-08 08:13:46
|
STATUS of ht://Dig branch 3-2-x
RELEASES:
3.2.0b5: Next release, tentatively 1 Dec 2002.
3.2.0b4: "In progress" -- snapshots called "3.2.0b4" until prerelease.
3.2.0b3: Released: 22 Feb 2001.
3.2.0b2: Released: 11 Apr 2000.
3.2.0b1: Released: 4 Feb 2000.
(Please note that everything added here should have a tracker PR# so
we can be sure they're fixed. Geoff is currently trying add PR#s for
what's currently here.)
SHOWSTOPPERS:
* Mifluz database errors are a severe problem (PR#428295)
-- Does Neal's new zlib patch solve this for now?
KNOWN BUGS:
* Odd behavior with $(MODIFIED) and scores not working with
wordlist_compress set but work fine without wordlist_compress.
(the date is definitely stored correctly, even with compression on
so this must be some sort of weird htsearch bug) PR#618737.
* Not all htsearch input parameters are handled properly: PR#405278. Use a
consistant mapping of input -> config -> template for all inputs where
it makes sense to do so (everything but "config" and "words"?).
* META descriptions are somehow added to the database as FLAG_TITLE,
not FLAG_DESCRIPTION. (PR#618738)
PENDING PATCHES (available but need work):
* Additional support for Win32.
* Memory improvements to htmerge. (Backed out b/c htword API changed.)
* Mifluz merge.
NEEDED FEATURES:
* Field-restricted searching. (e.g. PR#460833)
* Return all URLs. (PR#618743)
* Handle noindex_start & noindex_end as string lists.
* Quim's new htsearch/qtest query parser framework.
* File/Database locking. PR#405764.
TESTING:
* httools programs:
(htload a test file, check a few characteristics, htdump and compare)
* Turn on URL parser test as part of test suite.
* htsearch phrase support tests
* Tests for new config file parser
* Duplicate document detection while indexing
* Major revisions to ExternalParser.cc, including fork/exec instead of popen,
argument handling for parser/converter, allowing binary output from an
external converter.
* ExternalTransport needs testing of changes similar to ExternalParser.
DOCUMENTATION:
* List of supported platforms/compilers is ancient. (PR#405279)
* Add thorough documentation on htsearch restrict/exclude behavior
(including '|' and regex).
* Document all of htsearch's mappings of input parameters to config attributes
to template variables. (Relates to PR#405278.)
Should we make sure these config attributes are all documented in
defaults.cc, even if they're only set by input parameters and never
in the config file?
* Split attrs.html into categories for faster loading.
* Turn defaults.cc into an XML file for generating documentation and
defaults.cc.
* require.html is not updated to list new features and disk space
requirements of 3.2.x (e.g. phrase searching, regex matching,
external parsers and transport methods, database compression.)
PRs# 405280 #405281.
* TODO.html has not been updated for current TODO list and
completions.
* Htfuzzy could use more documentation on what each fuzzy algorithm
does. PR#405714.
* Document the list of all installed files and default
locations. PR#405715.
OTHER ISSUES:
* Can htsearch actually search while an index is being created?
* The code needs a security audit, esp. htsearch. PR#405765.
|
|
From: Neal R. <ne...@ri...> - 2002-12-06 20:12:32
|
> Usually, for this purpose, the stem part of words should be enough, as > it is more powerful in representing the meaning of a word. Yes, but there is a tradeoff... Using stemming negatively impacts precision while improving generalization. In the information retrieval research community there is disagrement about the utility of stemming. Here's the basic feeling: If the document index is LARGE, don't use stemming because it is important to be precise so that you are 'drinking from a firehose' by getting back to many results. If the document index is SMALL, use stemming because the increased word generalization helps avoid queries with no results. Most large internet search engines don't use stemming for this reason... you would get back MANY more results with less precision. > Do you have any exact statistics showing the difference in storage of > the two indexes? As Lachlan say, storing both could be a bit extravagant > and, again, in my opinion could lead out of the tracks. Also, the Currently our index stores one row per word-document-location. As discussed before, this is inefficient and we've got a plan to change it. Moving to stemming without changing this would result in NO efficiency improvement in WordDB. At the point we change it to be [word-document]/[loc1,loc2,...locn] We'll have a significant WordDB efficiency savings. Even after this change we will still have a row+ PER UNIQUE WORD. Stemming would provide an efficiency improvement by having one row PER STEM. According to the literature, if you go with a stemmed index exclusively, the index efficiency goes up by ABOUT 20-30%. This estimate is very data and language dependent. I research and implement this kind of stuff at work... I'd be happy to post links to a couple research papers if people are interested. > problem is less important in an english language; I don't know any other > languages except italian and latin languages, but they are for sure more > complex, having different affix rules and lots more of different tenses. Here's the link to Martin Porter's (BSD Licensed) stemmers: http://snowball.tartarus.org/ There are 10 languages supported. Some languages are very difficult to stem, such a finnish... but in general stemming is beneficial for word generalization. ------- I agree with Geoff in that we don't want to go with stemming exclusively.. Here's a proposal for 'intelligent stemming' in HtDig: 1. Fix index efficiency. 2. Add a configuration switch to disable stemming ;-) 3. Implement the stemming algorithm to ADD additional rows to the index with stemmed versions of the words (with a row flag to signify this). 4. During result ranking we rank the results with an algorithm like this: If num documents is LARGE unstemmed rows are 80%, stemmed rows are 20% of the 'score' If num documents is MEDIUM unstemmed rows are 60%, stemmed rows are 40% of the 'score' If num documents is SMALL unstemmed rows are 30%, stemmed rows are 70% of the 'score' These percentages are gut-feeling ball-park numbers based on my experience and research on the topic. The meaning of Large/Medium/Small needs to be decided. It also leans toward preferring unstemmed words because of their higher 'precision'. This system does add duplicate rows in a sense to the index. Here's an example traveling -> travel travel -> travel travels -> travel traveler -> travel traveled -> travel Document Word StemFlag Locations 20 traveling 0 24 36 110 20 travel 0 52 98 220 20 travels 0 10 75 340 20 traveler 0 13 180 20 traveled 0 200 20 travel 1 10 13 24 36 52 75 98 110 180 200 220 340 The last row is a kind of duplicate, and this impacts efficiency negatively, but does get us some increased word generalization. ------- The other idea thrown around is to improve the 'fuzzy endings' algorithm. I agree that this needs doing, and Porter's stemmers will give us many ideas on how to do this. Note however that this is an unproven and less studied technique in NLP & IR circles, so we would be blazing some new ground... which tends to be a slow process to get correct. If we do both we can leave it up to users to 'tune' HtDig to their liking. Flexibility is good. I also don't support doing anything about stemming until we fix the index (which I'm working on). It will negatively impact the size too much for large indexes. FEEDBACK PLEASE!! Thanks! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
|
From: Paul D. <pa...@po...> - 2002-12-06 16:44:13
|
* Paul Downs (pa...@po...) wrote :
Hi,
> close(7) = 0
> wait4(11263, 0xbffff228, 0, NULL) = -1 ECHILD (No child processes)
> munmap(0x40014000, 4096) = 0
> fstat(1, {st_mode=S_IFREG|0644, st_size=1055675, ...}) = 0
Tracked it down a bit. The fault is the pclose line. Nothing untoward
seems to be happening, I haven't checked wait4 out yet. Odd thing is
everything seems to work fine, in the interests of getting the searching
back online I commented out the error check on the pclose line and it is
happy so it will have to do for now.
Paul
|
|
From: Philippe R. <Phi...@ep...> - 2002-12-06 15:31:56
|
Hello, I installed a htdig mirror: http://lbdpc15.epfl.ch/mirror/htdig/ But I already configured this in httpd.conf: <VirtualHost ch.htdig.org> ServerAdmin Phi...@ep... DocumentRoot /server/htdocs/mirror/htdig/ ServerName ch.htdig.org ErrorLog logs/htdig/error_log TransferLog logs/htdig/access_log DirectoryIndex index.php manual.php index.html index.htm index.phtml </VirtualHost> <VirtualHost www.ch.htdig.org> ServerAdmin Phi...@ep... DocumentRoot /server/htdocs/mirror/htdig/ ServerName ch.htdig.org ErrorLog logs/htdig/error_log TransferLog logs/htdig/access_log DirectoryIndex index.php manual.php index.html index.htm index.phtml </VirtualHost> It's also open by ftp: ftp://lbdpc15.epfl.ch/mirror/htdig/ (and by that way you have access fo files and htdig_patches. For those two I'm running an update script at same time as cvs update) I'm running multi-dig on this server, I can configure it for htdig mirror if you want. Regards, Ph.R. _____________________________________________________________________ Philippe Rochat (Yahoo-id:philipperochat) Professional: Database Laboratory, Swiss Federal Institute of Technology (EPFL) EPFL-IC-LBD tel:++41 21 693 52 53 CH-1015 LAUSANNE fax:++41 21 693 51 95 Private: Grammont, 9 1007 LAUSANNE tel:++41 21 617 03 05 mailto:Phi...@ep... GSM:++41 76 384 52 53 |
|
From: Paul D. <pa...@po...> - 2002-12-06 13:13:51
|
Hi,
I have an issue whereby htmerge is analysing a fairly hefty set of data.
It is not being run from cron and there is masses (Gb's) of /tmp space. The
machine does have 256Mb of ram though. The output is here:
htmerge: Discarding yahoocouk in doc #1635
htmerge: Discarding yahoocouk in doc #1635
htmerge: 22000:ymca
htmerge: Word sort failed
and the strace:
pwrite(6, "\0\0\0\0\0\0\0\0\256\37\0\0\244\37\0\0\274\37\0\0(\0D\1"...,
1024, 8304640) = 1024
read(7, "", 4096) = 0
close(7) = 0
wait4(11263, 0xbffff228, 0, NULL) = -1 ECHILD (No child processes)
munmap(0x40014000, 4096) = 0
fstat(1, {st_mode=S_IFREG|0644, st_size=1055675, ...}) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
= 0x40014000
write(5, "7\tl:614\tw:386\nzones\ti:938\tl:660\t"..., 308) = 308
write(1, "htmerge: Word sort failed\n\n", 27htmerge: Word sort failed
) = 27
munmap(0x40015000, 4096) = 0
munmap(0x40014000, 4096) = 0
_exit(1) = ?
Can anyone enlighten me? The box is a old ish SuSE install with source
compiled latest stable of htdig and :
Linux eco92 2.2.18 #1 Fri Jan 26 18:16:52 GMT 2001 i686 unknown
Should I try a 2.4.x kernel?
Paul
|
|
From: Lachlan A. <lh...@ee...> - 2002-12-05 08:01:42
|
On Wed, 4 Dec 2002 19:12, Gabriele Bartolini wrote:
> IMHO, the ultimate goal for a search process is to get a
> set of document satisfying a semantic criteria, better a
> context criteria.
The ultimate is the singular value decomposition approach
that someone (Geoff?) was suggesting using for a "similar
documents" search. I'd really like to see this in HtDig
eventually. It moves away from indexing on words at all,
and instead indexes on an abstract notion of "how often are
the search words used in similar contexts to the words of
the target document?"
> ... italian and latin languages, but they
> are for sure more complex, having different affix rules
> and lots more of different tenses.
Good point. Am I also correct in believing that some
languages like German have a lot of changes to the stems
themselves ("schwimen, schwam, geschwomen", "trinken,
truank, getrunken")? Is there an approach that can handle
that much generality?
> As Geoff suggested, we could implement a
> different fuzzy algorithm for the 'Porter stemming' which
> builds a new index (a stemmed one).
Yes, it would be good to have a fuzzy stemming algorithm
which doesn't simply return a query with (variant1 OR
variant2 OR ...), but actually searches a stemmed index.
It would be more efficient if there are lots of different
forms.
> > word-level indexing, to give (much) smaller inverted
> > files if people don't need phrase searching.
>
> I guess customisation is our goal. In a retrieval phase,
> we'd want to store almost *anything* we can, then maybe
> with different fuzzy algorithms build alternative indexes
> (smaller or bigger, depending on users' settings).
Yes, a document-level inverted file could be generated from
the word-level one after the whole dig. I don't know much
about htdig's fuzzy mechanism yet; is it possible to delete
the main inverted file and just rely on a "fuzzy" one? If
so, the only other disadvantages would be speed and the
amount of temporary space required (RAM and disk space).
Cheers,
Lachlan
--
Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678
Dept of Electrical and Electronic Engg CRICOS Provider Code
University of Melbourne, Victoria, 3010 AUSTRALIA 00116K
|