From: Lachlan A. <lh...@ee...> - 2002-11-03 22:55:02
|
Greetings Jim, Can I suggest that you post the list of changes to be made and tell everyone the order you plan to make them? That way others of us will be able to work in parallel without duplicating effort. I've found that the ChangeLog entries are sufficiently different that it is easier to work with ChangeLog for 3.1.6, htdig-3.1.5-3.1.6.diff and the 3.2 source. I've been going through and deleting the changes that have already been made (or obviated by other changes), and a very preliminary list is at <http://www.ee.mu.oz.au/staff/lha/htdig-3.1.6-3.2-ports> If you have a more comprehensive list, I'd love to see it. Cheers, Lachlan > From: Gilles Detillieux <gr...@sc...> > Subject: Re: 3.2.0b4 release > To: gre...@yg... (Jim Cole) > Date: Wed, 30 Oct 2002 14:56:27 -0600 (CST) > > Comparing ChangeLog entries between 3.1.6 and the 3.2 cvs > would be the first step, and would find most of the > missing stuff. Note that there may be differences in > wording between the two, especially if someone other than > me make the entry in the 3.2 ChangeLog. If you can't > find anything close in the 3.2 ChangeLog to a given 3.1.6 > entry, then comparing the specific changes in the source > against the 3.2 source would be the next step. I'd be > glad to answer any questions you have at that stage, > including punting a few ChangeLog entries my way for my > verification. > > Potentially all entries since 3.1.5 was released would > need to be checked. That may seem like a lot, given that > 3.1.5 was released almost 2 full years before 3.1.6, but > the CVS tree for 3.1.x was dormant for a long time after > 3.1.5. -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Lachlan A. <lh...@ee...> - 2002-11-07 00:14:16
|
Thanks for your annotations, Gilles. There were lots of things not yet on the TODO list -- that preliminary list was just to say "these are the files I've checked through"... The list at <http://www.ee.mu.oz.au/staff/lha/htdig-3.1.6-3.2-ports> should now be roughly complete (including the one in your next post) but no promises. I bags all the htdig/* changes. (Or "Dibs me the...", depending on which primary school you went to :) Cheers, Lachlan On Thu, 7 Nov 2002 07:26, you wrote: > Here are my annotations to your TODO list... > [snip] > > Here are additional items, which > didn't make it above... -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Gilles D. <gr...@sc...> - 2002-11-05 23:45:33
|
According to Lachlan Andrew: > Can I suggest that you post the list of changes to be made > and tell everyone the order you plan to make them? That > way others of us will be able to work in parallel without > duplicating effort. > > I've found that the ChangeLog entries are sufficiently > different that it is easier to work with ChangeLog for > 3.1.6, htdig-3.1.5-3.1.6.diff and the 3.2 source. I've > been going through and deleting the changes that have > already been made (or obviated by other changes), and a > very preliminary list is at > <http://www.ee.mu.oz.au/staff/lha/htdig-3.1.6-3.2-ports> > If you have a more comprehensive list, I'd love to see it. Good work! Thanks for your efforts. Here are my annotations to your TODO list... TODO: htcommon/HTML.cc: metadatetags, descriptionMatch -- latter is for description_meta_tag_names attribute strlen(skip_start) taken out of loop -- no longer applicable, as skip_start now String type ignore_alt_text remove 'which = -1; //What does it do?' on line 948... htlib/Dictionary.cc: add 'e->next = NULL' after line 251 -- I also wonder if a more careful audit of the Dictionary class isn't needed. It seems some changes were made to it in 3.2, which in retrospect may be unnecessary after this fix is implemented, and may lead to memory leaks. (e.g. Release no longer releases anything.) Apart from the addition of "const" keywords in 3.2, these two versions of the class should be the same, but which one is correct? htsearch/Display.cc: displayHTTPheaders -- for search_results_contenttype attribute Remove "ANCHOR" save instance of URL for star_patterns and template_paterns -- I think this bit is unique to 3.1.6 URL handling HtURLRewriter -- part of search_rewrite_rules handling max_excerpts attribute anchor_target attribute relative dates -- i.e. make startyear et al. handle relative date ranges in Display.cc htsearch/parser.cc: boolean_syntax_errors and boolean_keywords internationalisation multimatch_factor (listed below as 'multimatch_method') -- oops! that typo made it into RELEASE.html too. -- note: still VERY buggy in 3.1.6 - check changes to prefix_match_character, now in QueryLexer.cc -- related to "list-all" feature below htsearch/htsearch.cc: boolean_keywords internationalisation Here are additional items, which were mentioned in the list I sent to Jessica in August and Jim in October, but didn't make it above... htcommon/defaults.cc: > - add startyear et al. to defaults.cc -- need full descriptions of these installdir/english.0: > - fuzzy endings patch and updated english.0 file -- fixes to htfuzzy/EndingsDB.cc, htfuzzy/Endings.cc already done, but 2 rounds of changes to english.0 missing contrib/conv_doc.pl, contrib/doc2html/*: > - get updated external parser scripts into contrib directory > (fix eof handling bug in .pl scripts) -- if a newer doc2html release is available, we should use it htsearch/htsearch.cc, htsearch/parser.cc: > - list-all feature in htsearch for a query of * or prefix_match_character -- this will be tricky for 3.2, because of the changes in the databases htdig/Retriever.cc: > - ignore_dead_servers attribute ... plus the two I said I'd handle... htcommon/HtSGMLCodec.cc: > - translate_latin1 attribute, with hooks into SGMLCodec class htdig/htdig.cc: > - better handling of htdig -m option It should also go without saying that any new attribute should also be added and described in defaults.cc in 3.2. I hope to do a more thorough ChangeLog audit before 3.2.0b5 is released, but this TODO list should give us a lot to chew on in the meantime. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Gilles D. <gr...@sc...> - 2002-11-06 15:59:45
|
I found another change that's needed still in 3.2... * htsearch/htsearch.cc (main): Fixed to only show file names in error messages when REQUEST_METHOD not set and -v option given, for security. This is the "filenameok" stuff in 3.1.6's htsearch.cc -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Jim C. <gre...@yg...> - 2002-11-10 23:20:18
|
On Sunday, November 3, 2002, at 03:54 PM, Lachlan Andrew wrote: > Greetings Jim, > > Can I suggest that you post the list of changes to be made > and tell everyone the order you plan to make them? That > way others of us will be able to work in parallel without > duplicating effort. Hi - While I have no problem with helping in whatever way I can, I do not have any plans for making changes. The initial task offered up was limited to building a list of 3.1.x changes that might need to be carried over to 3.2.x. It sounds as if you already had a jump on this and have since completed the list. Working at my current snail pace, I have accomplished very little in this regard. Do you consider your list to be complete? If so, I would just as well move on to something else, rather than reinvent this particular wheel. Jim |
From: Lachlan A. <lh...@ee...> - 2002-11-11 00:40:10
|
Greetings Jim, No, the list at <http://www.ee.mu.oz.au/staff/lha/htdig-3.1.6-3.2-ports> is by no means complete. I only looked at changes from 3.1.5 to 3.1.6, but I think that even some changes from 3.1.4 to 3.1.5 may not be in 3.2. More importantly, I focused on the .cc and .h files, since they are what I was planning to change first. That leaves lots of changes to documentation and configuration files. Finally, there are some changes that I flagged with "====" which I haven't checked thoroughly. Typically there had been changes to the 3.2 stream which had changed the context around the patch, or there were related changes elsewhere in the file, which made it unclear what should be done. Thanks a lot for your help! I hope I haven't taken all the "fun" bits and left you with the hard parts :) Cheers, Lachlan On Mon, 11 Nov 2002 10:20, Jim Cole wrote: > The task offered up was limited to building a list of > 3.1.x changes that might need to be carried over to > 3.2.x. It sounds as if you already had a jump on this and > have since completed the list. Do you consider your list > to be complete? -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Geoff H. <ghu...@ws...> - 2002-11-11 04:58:23
|
On Sunday, November 10, 2002, at 06:40 PM, Lachlan Andrew wrote: > they are what I was planning to change first. That leaves > lots of changes to documentation and configuration files. The documentation changes are, of course, a bit tricky. After all, you can't directly compare attrs.html since it's "expanded" in 3.2. Anyone tackling these should also ignore RELEASE.html, TODO.html, uses.html and isp.html. (In short, it's not a direct "diff.") > Finally, there are some changes that I flagged with "====" > which I haven't checked thoroughly. Typically there had A few notes: * does rewrite at start of got_href suffice? No. Remember that you might want to rewrite the redirect. Also, remember that got_redirect and got_href handle things differently. A redirect doesn't count in hopcounts, etc. Also redirect needs to modify pointers to the old URL. * "check changes to prefix_match_character, now in QueryLexer.cc" Nope. QueryLexer (and parts of Quim's parser framework) will need to be changed. But currently, it's unused. So you'll want to check elsewhere for prefix_match_character. * A more careful audit of Dictionary.cc is probably useful. As for the ChangeLog, the "3.1 and prior" ChangeLog is in ChangeLog.0 *==== Done by Loic 20000229? No, that's my change and ChangeLog entry. *========== Do these files still exist? Where has the functionality gone? These are in httools/htpurge.cc. *========== Replaced by NOSTREAM ?? Yes, IIRC. Are those all of the "===" tags? -Geoff |
From: Lachlan A. <lh...@ee...> - 2002-11-11 05:29:07
|
Thanks for your answers, Geoff :) I think my notes were too cryptic (they were originally notes to me...). The "Done by Loic 20000229?" was to remind me to see if the bug you fixed was also addressed by the patch Loic made to the 3.2 tree, with the entry: Tue Feb 29 11:31:53 2000 Loic Dachary <lo...@ce...> * htnet/Connection.cc (Connect): Added SIGALRM signal handler, Connect() always allow EINTR to occur. Thanks again, Lachlan On Mon, 11 Nov 2002 15:57, Geoff Hutchison wrote: > On Sunday, November 10, 2002, at 06:40 PM, Lachlan Andrew wrote: > > Finally, there are some changes that I flagged with > > "====" which I haven't checked thoroughly. > > A few notes: > *==== Done by Loic 20000229? > No, that's my change and ChangeLog entry. -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Lachlan A. <lh...@ee...> - 2002-11-19 00:56:58
|
Greetings Gilles, Most of the forward ports that I feel confident to do are now at <http://www.ee.mu.oz.au/staff/lha/patch.forward-ports> with the list of changes at <http://www.ee.mu.oz.au/staff/lha/htdig-3.1.6-3.2-ports>. Could you please apply them? The patches are relative to the status at 2002-11-07. Both defaults.cc and defaults.xml (in Brian White's format) are up-to-date with respect to the new attributes. The multimatch_factor has been cleaned up a bit compared with 3.1.6. It now counts how many "OR" terms a document matches. (For Boolean queries, the result of an "AND" is taken to match the average of the two arguments.) Currently, it simply multiplies by multimatch_factor if there is more than one match, but the originally documented functionality can be implemented by changing the line 822 parser.cc (in Parser::parse) from dm->score *= multimatch_factor; to dm->score *= pow (multimatch_factor, orMatches - 1.0); if you really want that exponential growth. On Mon, 11 Nov 2002 15:57, Geoff Hutchison wrote: > A few notes: > * does rewrite at start of got_href suffice? > No. Remember that you might want to rewrite the redirect. My question was more about the placement within got_href. 3.2.0 rewrites *before* normalisation, while 3.1.6 rewrites *after*. These are not equivalent, and we can't apply the rules twice (or else "htm->html" would become "htm->htmll"). Which one do we break compatibility with? -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Gilles D. <gr...@sc...> - 2002-11-19 21:43:04
|
According to Lachlan Andrew: > Greetings Gilles, > > Most of the forward ports that I feel confident to do are > now at > <http://www.ee.mu.oz.au/staff/lha/patch.forward-ports> > with the list of changes at > <http://www.ee.mu.oz.au/staff/lha/htdig-3.1.6-3.2-ports>. > > Could you please apply them? The patches are relative to > the status at 2002-11-07. Thanks, Lachlan. Unfortunately I won't have time to tackle this in the next day or two, but I'll try to get them in before week's end, so that they make it into the next snapshot. > Both defaults.cc and defaults.xml (in Brian White's > format) are up-to-date with respect to the new attributes. > > The multimatch_factor has been cleaned up a bit compared > with 3.1.6. It now counts how many "OR" terms a document > matches. (For Boolean queries, the result of an "AND" is > taken to match the average of the two arguments.) > Currently, it simply multiplies by multimatch_factor if > there is more than one match, but the originally documented > functionality can be implemented by changing the line 822 > parser.cc (in Parser::parse) from > dm->score *= multimatch_factor; > to > dm->score *= pow (multimatch_factor, orMatches - 1.0); > if you really want that exponential growth. Sounds good. If it's simple enough to do so, I may backport this to 3.1.7 when I finally get around to working on that. > On Mon, 11 Nov 2002 15:57, Geoff Hutchison wrote: > > A few notes: > > * does rewrite at start of got_href suffice? > > No. Remember that you might want to rewrite the redirect. > > My question was more about the placement within got_href. > 3.2.0 rewrites *before* normalisation, while 3.1.6 rewrites > *after*. These are not equivalent, and we can't apply the > rules twice (or else "htm->html" would become "htm->htmll"). > Which one do we break compatibility with? In retrospect, I think it makes more sense to do the rewrites first off, before anything else, as 3.2 does. I'll likely change 3.1.7 to do likewise. But 3.2 does need a rewrite call in got_redirect as well. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Lachlan A. <lh...@ee...> - 2002-11-26 01:07:39
|
Greetings Geoff, Could you please apply the patch at <http://www.ee.mu.oz.au/staff/lha/patch.english.0> and copy installdir/synonyms directly over from 3.1.6? These files will need a lot of work at some stage. There are a lot of misspellings in english.0 which should be in synonyms instead, and a lot of words missing many endings. Is the policy to have all possible stemmings, even if they are "non-words", like "unrealises"? If so, we can really go to town on the affixes :) I plan to go through and clean them up, after the 3.2.0b5 release. Is the release still scheduled for 1 December? I think I've now got almost all of the 3.1.6 functionaliy in, except for the wildcard. (I'm not too keen to port all of the documentation over until defaults.xml becomes the official reference...) Thanks, Lachlan On Wed, 20 Nov 2002 08:42: > According to Lachlan Andrew: > > <http://www.ee.mu.oz.au/staff/lha/patch.forward-ports> > > Could you please apply them? The patches are relative > > to the status at 2002-11-07. > > I won't have time in the next day or two, but I'll try to > get them in before week's end, so that they make it into > the next snapshot. -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Geoff H. <ghu...@ws...> - 2002-11-26 06:09:08
|
> Is the policy to have all possible stemmings, even if they > are "non-words", like "unrealises"? If so, we can really > go to town on the affixes :) No, and I'd expect that ispell doesn't want them either. Of course many people have moved away from ispell too... > Is the release still scheduled for 1 December? I think > I've now got almost all of the 3.1.6 functionaliy in, > except for the wildcard. I haven't heard from much of anyone, which isn't necessarily what I want just before a release. So I think we'll have to prod some people into confirming that the current snapshots are working, bugs have been fixed, etc. My guess is that the release will have to move back a small amount, but that's not too surprising. Still, I'd like to get it out before I head to my parents for the holidays. > (I'm not too keen to port all of the documentation over until > defaults.xml becomes the official reference...) I think it seems easy enough to generate the defaults.xml from any changes. So why don't we get the documentation over first. (It seems to me that it'd be easier to patch defaults.cc, then re-generate defaults.xml?) -Geoff |
From: Gilles D. <gr...@sc...> - 2002-11-27 04:53:53
|
According to Geoff Hutchison: > > Is the policy to have all possible stemmings, even if they > > are "non-words", like "unrealises"? If so, we can really > > go to town on the affixes :) > > No, and I'd expect that ispell doesn't want them either. Of course many > people have moved away from ispell too... Does that mean we'll end up having to add support for aspell dictionaries to htfuzzy endings? > > Is the release still scheduled for 1 December? I think > > I've now got almost all of the 3.1.6 functionaliy in, > > except for the wildcard. > > I haven't heard from much of anyone, which isn't necessarily what I want > just before a release. So I think we'll have to prod some people into > confirming that the current snapshots are working, bugs have been fixed, > etc. > > My guess is that the release will have to move back a small amount, but > that's not too surprising. Still, I'd like to get it out before I head > to my parents for the holidays. Sorry for my silence, but things got a bit hectic last week, so I never did get to committing patches as I hoped. I also haven't done much testing of recent snapshots, though I don't expect the tests I run on my small site contribute a whole lot to the overall testing effort anyway (how many times does it need to be tested on a small site running Red Hat?). I fully expected some schedule slippage, but 4 days to the tentative release date, I think we'll need to allow for a lot of slippage! Anyway, for what it's worth, I've made the changes to english.0 and synonyms, with one minor addition (adding D & S flags to birth). Thanks, Lachlan! Once again, I'll try to get to some of the other changes later this week, but can't promise anything because things are still rather busy for me. > > (I'm not too keen to port all of the documentation over until > > defaults.xml becomes the official reference...) > > I think it seems easy enough to generate the defaults.xml from any > changes. So why don't we get the documentation over first. (It seems to > me that it'd be easier to patch defaults.cc, then re-generate > defaults.xml?) I agree. Let's get the doc changes in. Then, if we can get the xml stuff in for 3.2.0b5, great, but if not, it's not the end of the world. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Lachlan A. <lh...@ee...> - 2002-11-28 02:05:09
|
Greetings, On Wed, 27 Nov 2002 15:53, Gilles Detillieux wrote: > According to Geoff Hutchison: > > > Is the policy to have all possible stemmings, even if > > > they are "non-words", like "unrealises"? > > No, and I'd expect that ispell doesn't want them > > either. Of course many people have moved away from > > ispell too... > Does that mean we'll end up having to add support for > aspell dictionaries to htfuzzy endings? Does it matter that the list originally came from ispell? Its role here is fundamentally different. For a spell checker, you only care what combinations of letters are valid words. For a stemmer, you only want to know which "words" are derived from a common stem. Unless the same actual file is used for spell checking, it is not clear why it matters what spell checker people use. Am I missing something? > I've made the changes to english.0 and synonyms, with one > minor addition (adding D & S flags to birth). Thanks for that. You might want to reconsider the '/S' flag; it produces 'birthes', not 'births' as you might expect. (The '*h -> es' rule suits words like 'wreath'.) That rule and its use are among the things I hope to clean up after 3.2.0b5 is out... > Let's get the doc changes in. Then, if we can > get the xml stuff in for 3.2.0b5, great. OK :) -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Gilles D. <gr...@sc...> - 2002-11-28 15:21:42
|
According to Lachlan Andrew: > Greetings, > > On Wed, 27 Nov 2002 15:53, Gilles Detillieux wrote: > > According to Geoff Hutchison: > > > > Is the policy to have all possible stemmings, even if > > > > they are "non-words", like "unrealises"? > > > No, and I'd expect that ispell doesn't want them > > > either. Of course many people have moved away from > > > ispell too... > > Does that mean we'll end up having to add support for > > aspell dictionaries to htfuzzy endings? > > Does it matter that the list originally came from ispell? > Its role here is fundamentally different. For a spell > checker, you only care what combinations of letters are > valid words. For a stemmer, you only want to know which > "words" are derived from a common stem. Unless the same > actual file is used for spell checking, it is not clear why > it matters what spell checker people use. Am I missing > something? It only matters in that it may have an impact on available dictionaries. We can tweak the english dictionary all we want, but if someone wants a dictionary for some other language and finds that such a dictionary is better supported or more complete/correct in aspell or some other spell checker, than it is with ispell, then they may start asking for support in htfuzzy for these other dictionary formats. > > I've made the changes to english.0 and synonyms, with one > > minor addition (adding D & S flags to birth). > > Thanks for that. You might want to reconsider the '/S' > flag; it produces 'birthes', not 'births' as you might > expect. (The '*h -> es' rule suits words like 'wreath'.) > That rule and its use are among the things I hope to clean > up after 3.2.0b5 is out... It's out of there. Thanks for the heads-up. I've reinserted "births" instead, for the sake of completeness, even though htfuzzy won't use it. A quick grep shows there are a lot of *hs words in there that htfuzzy can't make use of. The quick fix would be to grab one of the available flags (BCEFKLOQW) and use that for th->ths, but it might be more logical to keep the S flag for th->ths pluralizations, and use something like E for th->thes conjugations. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) |
From: Lachlan A. <lh...@ee...> - 2002-11-28 22:37:34
|
On Fri, 29 Nov 2002 02:21, Gilles Detillieux wrote: > if someone wants a dictionary > for some other language and finds that such a dictionary > is better supported or more complete/correct in aspell Ahh... That makes sense. Thanks. However I still don't understand why we wouldn't want the English dictionary to stem unrealised and realises together (which implicitly allows the non-word "unrealises"). > A quick grep shows there are a lot of *hs words in there > that htfuzzy can't make use of. The quick fix would be > to grab one of the available flags (BCEFKLOQW) and use > that for th->ths, but it might be more logical to keep > the S flag for th->ths pluralizations, and use something > like E for th->thes conjugations. Yes. I've suggested a new rule to the ispell maintainer ([^cst]h -> s, [cst]h -> es) which fixes most problems (*gh,*ph), while maintaining compatibility with ispell as much as possible. You're right that we can add lots more rules to improve stemming in lots of ways. Cheers, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Neal R. <ne...@ri...> - 2002-12-02 21:11:29
|
This is on my list of things to work on.. An alternative is to have a separate word stemmer which stores the words in the index in stemmed form. The Porter Stemming algorithm is good for this, and I have code to do it. Thanks. On Fri, 29 Nov 2002, Lachlan Andrew wrote: > On Fri, 29 Nov 2002 02:21, Gilles Detillieux wrote: > > if someone wants a dictionary > > for some other language and finds that such a dictionary > > is better supported or more complete/correct in aspell > > Ahh... That makes sense. Thanks. However I still don't > understand why we wouldn't want the English dictionary to > stem unrealised and realises together (which implicitly > allows the non-word "unrealises"). > > > A quick grep shows there are a lot of *hs words in there > > that htfuzzy can't make use of. The quick fix would be > > to grab one of the available flags (BCEFKLOQW) and use > > that for th->ths, but it might be more logical to keep > > the S flag for th->ths pluralizations, and use something > > like E for th->thes conjugations. > > Yes. I've suggested a new rule to the ispell maintainer > ([^cst]h -> s, [cst]h -> es) which fixes most problems > (*gh,*ph), while maintaining compatibility with ispell as > much as possible. You're right that we can add lots more > rules to improve stemming in lots of ways. > > Cheers, > Lachlan > > -- > Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 > Dept of Electrical and Electronic Engg CRICOS Provider Code > University of Melbourne, Victoria, 3010 AUSTRALIA 00116K > > > ------------------------------------------------------- > This SF.net email is sponsored by: Get the new Palm Tungsten T > handheld. Power & Color in a compact size! > http://ads.sourceforge.net/cgi-bin/redirect.pl?palm0002en > _______________________________________________ > htdig-dev mailing list > htd...@li... > https://lists.sourceforge.net/lists/listinfo/htdig-dev > Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
From: Lachlan A. <lh...@ee...> - 2002-12-03 02:04:06
|
Greetings Neal, I'll leave this to you. Indexing the stems is a good suggestion. It would certainly give faster searching. If it replaced the unstemmed inverted file then it would also save on storage requirements, but it would mean we couldn't search on the unstemmed version (if that is of concern). Alternatively, indexing both the stemmed and unstemmed versions may be a bit extravagant... I have also been wondering if it is possible to turn off word-level indexing, to give (much) smaller inverted files if people don't need phrase searching. Does anybody know? That would be a compelling reason to store word attributes in a pure bit-map format, rather than using the more compact formats we were discussing recently. Cheers, Lachlan On Tue, 3 Dec 2002 08:08, Neal Richter wrote: > This is on my list of things to work on.. > > An alternative is to have a separate word stemmer which > stores the words in the index in stemmed form. > > The Porter Stemming algorithm is good for this, and I > have code to do it. -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Geoff H. <ghu...@ws...> - 2002-12-03 05:01:35
|
> Indexing the stems is a good suggestion. It would > certainly give faster searching. If it replaced the > unstemmed inverted file then it would also save on storage > requirements, but it would mean we couldn't search on the > unstemmed version (if that is of concern). The general strategy used by ht://Dig is post-indexing fuzzy matching. Certainly a Porter stemming fuzzy algorithm would be quite useful. But I'd say if we intend on indexing stems, it should definitely be optional. I can think of several instances where I'd want to search on one particular word, and *not* stemmed variants. So I'd much rather see work into innovative fuzzy algorithms. Anyone want to add a real "spelling" fuzzy? What about a Porter endings fuzzy to replace/augment endings? > I have also been wondering if it is possible to turn off > word-level indexing, to give (much) smaller inverted files > if people don't need phrase searching. Does anybody know? Not at the moment. But you lose a lot more than phrase searching. You lose field-restricted searching. You lose scoring by proximity (like Google). You lose the ability to score "on the fly"--not to be discounted since many users wonder why they change their scoring factors and the results don't change. If you look at other search products, the basic strategy now is "index everything" and let the search frontend filter if needed. Yes, some even index words like the, and, not, etc. Just my $0.02, -Geoff |
From: Lachlan A. <lh...@ee...> - 2002-12-03 05:39:13
|
Greetings Geoff, On Tue, 3 Dec 2002 16:01, Geoff Hutchison wrote: > > I have also been wondering if it is possible to turn > > off word-level indexing, to give (much) smaller > > inverted files if people don't need phrase searching. > > Does anybody know? > > Not at the moment. > > But you lose a lot more than phrase searching. You lose > field-restricted searching. You lose scoring by proximity > (like Google). You lose the ability to score "on the > fly"--not to be discounted since many users wonder why > they change their scoring factors and the results don't > change. Thanks for raising those points. These are all enhancements that came with 3.2.0's database restructure, but I think that only phrase searching actually *needs* word-level inverted files. As I said, document-level indexing is a strong motivation for the word attributes to be pure bitmaps. The index could store the "OR" of each field set for any occurrence, so you could still say "If this word occurs in the title AND that word occurs in a heading". I agree that on-the-fly scoring is the way to go, but again I can't see why it couldn't be done based on the OR of the flags (although I could be missing something). Even (very coarse) proximity searching can be done fairly efficiently by, for example, dividing each document into eight regions and specifying (in one byte) which regions contain the word. I'm trying to avoid the "progress = bloat" phenomenon. Although I don't want to change htDig://'s course, my original interest in it was my aim of all Linux boxes having all their documentation searchable. That is one application which requires a minimal-overhead option, albeit with reduced performance. If I get enthusiastic, I'll look at writing a patch... Cheers, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Gabriele B. <g.b...@co...> - 2002-12-04 08:12:16
|
Ciao guys, > Indexing the stems is a good suggestion. It would=20 > certainly give faster searching. If it replaced the=20 > unstemmed inverted file then it would also save on storage=20 > requirements, but it would mean we couldn't search on the=20 > unstemmed version (if that is of concern). Alternatively,=20 > indexing both the stemmed and unstemmed versions may be a=20 > bit extravagant... IMHO, the ultimate goal for a search process is to get a set of document satisfying a semantic criteria, better a context criteria. If we can do more in this direction, that'd be great. Usually, for this purpose, the stem part of words should be enough, as it is more powerful in representing the meaning of a word. Do you have any exact statistics showing the difference in storage of the two indexes? As Lachlan say, storing both could be a bit extravagant and, again, in my opinion could lead out of the tracks. Also, the problem is less important in an english language; I don't know any other languages except italian and latin languages, but they are for sure more complex, having different affix rules and lots more of different tenses. You know better than me that this would lead us far away from user's first goal: the search of documents about 'something'. As Geoff wisely suggest though, we could implement a different fuzzy algorithm for the 'Porter stemming' which builds a new index (a stemmed one). If you have some reference and suggestion, I'd be happy to offer coding it; that'd be a great chance for me to get into the 'word' module of the new ht://Dig system. Geoff, Gilles, Neal and Lachlan, I expect some news from you about this! :-) > > I have also been wondering if it is possible to turn off=20 > word-level indexing, to give (much) smaller inverted files=20 > if people don't need phrase searching. Does anybody know?=20 > That would be a compelling reason to store word attributes=20 > in a pure bit-map format, rather than using the more=20 > compact formats we were discussing recently. I guess customisation is our goal. In a retrieval phase, we'd want to store almost *anything* we can, then maybe with different fuzzy algorithms build alternative indexes (smaller or bigger, depending on users' settings). So ... I vote for an additional algorithm as Geoff suggests! :-) Ciao ciao -Gabriele --=20 Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy g.b...@co... | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ; |
From: Lachlan A. <lh...@ee...> - 2002-12-05 08:01:42
|
On Wed, 4 Dec 2002 19:12, Gabriele Bartolini wrote: > IMHO, the ultimate goal for a search process is to get a > set of document satisfying a semantic criteria, better a > context criteria. The ultimate is the singular value decomposition approach that someone (Geoff?) was suggesting using for a "similar documents" search. I'd really like to see this in HtDig eventually. It moves away from indexing on words at all, and instead indexes on an abstract notion of "how often are the search words used in similar contexts to the words of the target document?" > ... italian and latin languages, but they > are for sure more complex, having different affix rules > and lots more of different tenses. Good point. Am I also correct in believing that some languages like German have a lot of changes to the stems themselves ("schwimen, schwam, geschwomen", "trinken, truank, getrunken")? Is there an approach that can handle that much generality? > As Geoff suggested, we could implement a > different fuzzy algorithm for the 'Porter stemming' which > builds a new index (a stemmed one). Yes, it would be good to have a fuzzy stemming algorithm which doesn't simply return a query with (variant1 OR variant2 OR ...), but actually searches a stemmed index. It would be more efficient if there are lots of different forms. > > word-level indexing, to give (much) smaller inverted > > files if people don't need phrase searching. > > I guess customisation is our goal. In a retrieval phase, > we'd want to store almost *anything* we can, then maybe > with different fuzzy algorithms build alternative indexes > (smaller or bigger, depending on users' settings). Yes, a document-level inverted file could be generated from the word-level one after the whole dig. I don't know much about htdig's fuzzy mechanism yet; is it possible to delete the main inverted file and just rely on a "fuzzy" one? If so, the only other disadvantages would be speed and the amount of temporary space required (RAM and disk space). Cheers, Lachlan -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Neal R. <ne...@ri...> - 2002-12-06 20:12:32
|
> Usually, for this purpose, the stem part of words should be enough, as > it is more powerful in representing the meaning of a word. Yes, but there is a tradeoff... Using stemming negatively impacts precision while improving generalization. In the information retrieval research community there is disagrement about the utility of stemming. Here's the basic feeling: If the document index is LARGE, don't use stemming because it is important to be precise so that you are 'drinking from a firehose' by getting back to many results. If the document index is SMALL, use stemming because the increased word generalization helps avoid queries with no results. Most large internet search engines don't use stemming for this reason... you would get back MANY more results with less precision. > Do you have any exact statistics showing the difference in storage of > the two indexes? As Lachlan say, storing both could be a bit extravagant > and, again, in my opinion could lead out of the tracks. Also, the Currently our index stores one row per word-document-location. As discussed before, this is inefficient and we've got a plan to change it. Moving to stemming without changing this would result in NO efficiency improvement in WordDB. At the point we change it to be [word-document]/[loc1,loc2,...locn] We'll have a significant WordDB efficiency savings. Even after this change we will still have a row+ PER UNIQUE WORD. Stemming would provide an efficiency improvement by having one row PER STEM. According to the literature, if you go with a stemmed index exclusively, the index efficiency goes up by ABOUT 20-30%. This estimate is very data and language dependent. I research and implement this kind of stuff at work... I'd be happy to post links to a couple research papers if people are interested. > problem is less important in an english language; I don't know any other > languages except italian and latin languages, but they are for sure more > complex, having different affix rules and lots more of different tenses. Here's the link to Martin Porter's (BSD Licensed) stemmers: http://snowball.tartarus.org/ There are 10 languages supported. Some languages are very difficult to stem, such a finnish... but in general stemming is beneficial for word generalization. ------- I agree with Geoff in that we don't want to go with stemming exclusively.. Here's a proposal for 'intelligent stemming' in HtDig: 1. Fix index efficiency. 2. Add a configuration switch to disable stemming ;-) 3. Implement the stemming algorithm to ADD additional rows to the index with stemmed versions of the words (with a row flag to signify this). 4. During result ranking we rank the results with an algorithm like this: If num documents is LARGE unstemmed rows are 80%, stemmed rows are 20% of the 'score' If num documents is MEDIUM unstemmed rows are 60%, stemmed rows are 40% of the 'score' If num documents is SMALL unstemmed rows are 30%, stemmed rows are 70% of the 'score' These percentages are gut-feeling ball-park numbers based on my experience and research on the topic. The meaning of Large/Medium/Small needs to be decided. It also leans toward preferring unstemmed words because of their higher 'precision'. This system does add duplicate rows in a sense to the index. Here's an example traveling -> travel travel -> travel travels -> travel traveler -> travel traveled -> travel Document Word StemFlag Locations 20 traveling 0 24 36 110 20 travel 0 52 98 220 20 travels 0 10 75 340 20 traveler 0 13 180 20 traveled 0 200 20 travel 1 10 13 24 36 52 75 98 110 180 200 220 340 The last row is a kind of duplicate, and this impacts efficiency negatively, but does get us some increased word generalization. ------- The other idea thrown around is to improve the 'fuzzy endings' algorithm. I agree that this needs doing, and Porter's stemmers will give us many ideas on how to do this. Note however that this is an unproven and less studied technique in NLP & IR circles, so we would be blazing some new ground... which tends to be a slow process to get correct. If we do both we can leave it up to users to 'tune' HtDig to their liking. Flexibility is good. I also don't support doing anything about stemming until we fix the index (which I'm working on). It will negatively impact the size too much for large indexes. FEEDBACK PLEASE!! Thanks! Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site Office: 406-522-1485 |
From: Lachlan A. <lh...@ee...> - 2002-12-09 01:15:27
|
Greetings Neal, Your suggestion sounds good, especially steps 1 and 2... I have some beginner's questions: - Given the flag to disable stemming, what is the dissadvantage of simply making it a three-value flag: index unstemmed words, index stems, index both? - The format you describe sounds like a "half-inverted" file -- listing locations *within* a document by word, but listing *document* locations by document. Is that correct? - You said that the approach currently taken by fuzzy endings is uncharted waters. I assume you are talking about the approach of simply creating a disjunction of the derived words. What is hard to "get right" about that? In terms of the documents returned, it sounds the same as what you have proposed. In terms of implementation, it sounds like what 'fuzzy endings' does now, except for fixing the stemming. - With stemming in general, what is done about negating affixes? If I searched for 'mercy', I wouldn't want results about 'merciless' (although I would want results about 'merciful'). Thanks! Lachlan On Sat, 7 Dec 2002 07:09, Neal Richter wrote: > I agree with Geoff in that we don't want to go with > stemming exclusively.. > Here's a proposal for 'intelligent stemming' in HtDig: > > 1. Fix index efficiency. > 2. Add a configuration switch to disable stemming ;-) > 3. Implement the stemming algorithm to ADD additional > rows to the index with stemmed versions of the words > (with a row flag to signify this). > This system does add duplicate rows in a sense to the > index. > > traveling -> travel > travel -> travel > travels -> travel > traveler -> travel > traveled -> travel > > Document Word StemFlag Locations > > 20 traveling 0 24 36 110 > 20 travel 0 52 98 220 > 20 travels 0 10 75 340 > 20 traveler 0 13 180 > 20 traveled 0 200 > 20 travel 1 10 13 24 36 52 75 > 98 110 180 200 220 340 > > FEEDBACK PLEASE!! -- Lachlan Andrew Phone: +613 8344-3816 Fax: +613 8344-6678 Dept of Electrical and Electronic Engg CRICOS Provider Code University of Melbourne, Victoria, 3010 AUSTRALIA 00116K |
From: Gabriele B. <g.b...@co...> - 2002-12-09 08:12:38
|
Ciao guys, again, sorry if I will certainly make mistakes. I love to get to know more in this area, which is pretty new for me too. So please be patient. :-) Il lun, 2002-12-09 alle 02:14, Lachlan Andrew ha scritto: > - The format you describe sounds like a "half-inverted" > file -- listing locations *within* a document by word, but > listing *document* locations by document. Is that > correct? I think that was a flat representation of the index file, just an example. Am I right, Neal? In a simple scenario, we'll have - (please consider it is a very very draft!): - a word index (word id, stemmed/unstemmed flag, maybe language?) - a document index (document id, info regarding the document, pretty much as now: title, modification date, etc.) - an inverted index (word id, document id, locations) Words ----- ID Word S/U Lang -- ---- --- ---- 1 traveling 0 en 3 casa 0 it 12 travel 0 en 23 travels 0 en 45 pasta 0 it 60 travel 1 en ... Documents --------- ID URL Other info -- --- ---------- 1 http://www.pippo.it/ ..... 2 http://www.htdig.org/ ... Index ----- ID W ID D Locations and related info (position and markup) ---- ---- ------------------------------------------------ 1 2 1 Value_location 3 Value_location Value_Location is the value given to the location of the word Am I right? Of course it's just an example ... :-) Any comments about the language? > - With stemming in general, what is done about negating > affixes? If I searched for 'mercy', I wouldn't want > results about 'merciless' (although I would want results > about 'merciful'). Good point, are there any plans to include negative words too? Ciao ciao -Gabriele --=20 Gabriele Bartolini - Web Programmer Comune di Prato - Prato - Tuscany - Italy g.b...@co... | http://www.comune.prato.it > find bin/laden -name osama -exec rm {} ; |