You can subscribe to this list here.
2001 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(47) |
Nov
(74) |
Dec
(66) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2002 |
Jan
(95) |
Feb
(102) |
Mar
(83) |
Apr
(64) |
May
(55) |
Jun
(39) |
Jul
(23) |
Aug
(77) |
Sep
(88) |
Oct
(84) |
Nov
(66) |
Dec
(46) |
2003 |
Jan
(56) |
Feb
(129) |
Mar
(37) |
Apr
(63) |
May
(59) |
Jun
(104) |
Jul
(48) |
Aug
(37) |
Sep
(49) |
Oct
(157) |
Nov
(119) |
Dec
(54) |
2004 |
Jan
(51) |
Feb
(66) |
Mar
(39) |
Apr
(113) |
May
(34) |
Jun
(136) |
Jul
(67) |
Aug
(20) |
Sep
(7) |
Oct
(10) |
Nov
(14) |
Dec
(3) |
2005 |
Jan
(40) |
Feb
(21) |
Mar
(26) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
(23) |
Aug
(3) |
Sep
(1) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
2006 |
Jan
(2) |
Feb
(4) |
Mar
(4) |
Apr
(1) |
May
(11) |
Jun
(1) |
Jul
(4) |
Aug
(4) |
Sep
|
Oct
(4) |
Nov
|
Dec
(1) |
2007 |
Jan
(2) |
Feb
(8) |
Mar
(1) |
Apr
(1) |
May
(1) |
Jun
|
Jul
(2) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2008 |
Jan
(1) |
Feb
|
Mar
(1) |
Apr
(2) |
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2009 |
Jan
|
Feb
|
Mar
(2) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
(1) |
2011 |
Jan
|
Feb
|
Mar
(1) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
2012 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2016 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2017 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
From: Geoff H. <ghu...@ws...> - 2002-01-15 19:05:09
|
I've been in the middle of a few projects and haven't turned towards ht://Dig much. But I took a look at the checklist I had for 3.1.6 and I wasn't sure what needed to be done besides: * Release notes * Maindocs merges * Regex/RX issues And of course the usual sort of pre-release testing and beating... Is this all that's left? -Geoff |
From: Geoff H. <ghu...@us...> - 2002-01-13 08:13:47
|
STATUS of ht://Dig branch 3-2-x RELEASES: 3.2.0b4: In progress 3.2.0b3: Released: 22 Feb 2001. 3.2.0b2: Released: 11 Apr 2000. 3.2.0b1: Released: 4 Feb 2000. SHOWSTOPPERS: KNOWN BUGS: * Odd behavior with $(MODIFIED) and scores not working with wordlist_compress set but work fine without wordlist_compress. (the date is definitely stored correctly, even with compression on so this must be some sort of weird htsearch bug) * Not all htsearch input parameters are handled properly: PR#648. Use a consistant mapping of input -> config -> template for all inputs where it makes sense to do so (everything but "config" and "words"?). * If exact isn't specified in the search_algorithms, $(WORDS) is not set correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can we fix this?) * META descriptions are somehow added to the database as FLAG_TITLE, not FLAG_DESCRIPTION. (PR#859) PENDING PATCHES (available but need work): * Additional support for Win32. * Memory improvements to htmerge. (Backed out b/c htword API changed.) * MySQL patches to 3.1.x to be forward-ported and cleaned up. (Should really only attempt to use SQL for doc_db and related, not word_db) NEEDED FEATURES: * Field-restricted searching. * Return all URLs. * Handle noindex_start & noindex_end as string lists. * Handle local_urls through file:// handler, for mime.types support. * Handle directory redirects in RetrieveLocal. * Merge with mifluz TESTING: * httools programs: (htload a test file, check a few characteristics, htdump and compare) * Turn on URL parser test as part of test suite. * htsearch phrase support tests * Tests for new config file parser * Duplicate document detection while indexing * Major revisions to ExternalParser.cc, including fork/exec instead of popen, argument handling for parser/converter, allowing binary output from an external converter. * ExternalTransport needs testing of changes similar to ExternalParser. DOCUMENTATION: * List of supported platforms/compilers is ancient. * Add thorough documentation on htsearch restrict/exclude behavior (including '|' and regex). * Document all of htsearch's mappings of input parameters to config attributes to template variables. (Relates to PR#648.) Also make sure these config attributes are all documented in defaults.cc, even if they're only set by input parameters and never in the config file. * Split attrs.html into categories for faster loading. * require.html is not updated to list new features and disk space requirements of 3.2.x (e.g. phrase searching, regex matching, external parsers and transport methods, database compression.) * TODO.html has not been updated for current TODO list and completions. OTHER ISSUES: * Can htsearch actually search while an index is being created? (Does Loic's new database code make this work?) * The code needs a security audit, esp. htsearch * URL.cc tries to parse malformed URLs (which causes further problems) (It should probably just set everything to empty) This relates to PR#348. |
From: Geoff H. <ghu...@ws...> - 2002-01-13 04:37:10
|
At 4:38 PM -0600 1/12/02, Gilles Detillieux wrote: >SSI tag intended to generate an HTTP header MUST be at the very start >of the document, before any HTML that will be output. > >As Geoff suggested, some verbose output from htdig may be helpful. Yeah, even if it comes at the very start of the document, I'm a bit worried about how the result will look. If it "looks like another header" then there's no problem. (You can do some of this in some e-mail programs by putting X-Your-Header-Here: at the top of a message.) But if there's already the break after the headers and *then* the SSI header, that could be a bit harder to recognize. So it would be useful for either some verbose output from htdig or use curl or wget and ask for the complete server response including headers. -Geoff |
From: Gilles D. <gr...@sc...> - 2002-01-12 22:39:02
|
According to Greg Lepore: > If the last modified date is included in the document via: > <!--#echo var="LAST_MODIFIED"--> > should HTDIG be using this as the last modified date for a re-index? It > appears so according to: > http://www.htdig.org/mail/2000/10/0038.html > but does not appear to be the case in 3.2 (1/6 build). I believe in that thread, Torsten was suggesting 3 different SSI techniques to get the Last-Modified header out. If this one doesn't work, maybe try the other two, or try them in combination. I don't know if any of these are specific to particular versions of Apache or other web servers, so you should probably look at the documentation for the server you're using. Also, I believe we had concluded in that thread that any SSI tag intended to generate an HTTP header MUST be at the very start of the document, before any HTML that will be output. As Geoff suggested, some verbose output from htdig may be helpful. > The sysadmin wants to avoid XBitHack as a workaround and can't figure out > any other way to add Last-Modified to the http headers. I know this > questions was discussed in detail last year but I was unable to find a > solution. SSI is the reason the Last-Modified is not being included in the > http headers and that cannot be changed. > System is RedHat 7.2 on Intel, tons o' ram and ghz. Thanks, All the XBitHack does is enable SSI on .html files that have the execute bit turned on, treating them as .shtml files. If you're needing to use SSI tags to get the Last-Modified header to be output, presumably it's because the files are already server-parsed, so I don't see what the XBitHack would get you. If the files aren't server parsed, you should be getting the header based on the "mtime" of the file. If the files are server parsed, but not SSI, then it must be something else like ASP, JSP, PHP, or some other pre-processor, so you'd need to use whatever trick that pre-processor allows for getting the header out. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gabriele B. <an...@us...> - 2002-01-12 07:23:46
|
Ciao guys, as you probably noticed, I have added the code regarding the management of the 'accept-language' of the HTTP protocol. Gilles and Geoff in particular, please review the description of the attribute I wrote. I tried the attribute and it works (on a per server block). Ciao -Gabriele -- Gabriele Bartolini - Web Programmer Current Location: Prato, Tuscany, Italy an...@us... | http://www.prato.linux.it/~gbartolini | ICQ#129221447 > find bin/laden -name osama -exec rm {} \; |
From: Geoff H. <ghu...@ws...> - 2002-01-12 03:26:03
|
At 1:15 PM -0500 1/11/02, Greg Lepore wrote: >should HTDIG be using this as the last modified date for a re-index? >It appears so according to: >http://www.htdig.org/mail/2000/10/0038.html >but does not appear to be the case in 3.2 (1/6 build). If you can show me what the entire output (including headers and all blank lines), I might be able to help. The catch is figuring out when the headers (and/or the SSI headers) are finished and the document is supposed to start. >The sysadmin wants to avoid XBitHack as a workaround and can't >figure out any other way to add Last-Modified to the http headers. I don't think this would help--Apache won't send out a Last-Modified header for any server-parsed code, IIRC. -Geoff |
From: Greg L. <gr...@md...> - 2002-01-11 18:16:37
|
If the last modified date is included in the document via: <!--#echo var="LAST_MODIFIED"--> should HTDIG be using this as the last modified date for a re-index? It appears so according to: http://www.htdig.org/mail/2000/10/0038.html but does not appear to be the case in 3.2 (1/6 build). The sysadmin wants to avoid XBitHack as a workaround and can't figure out any other way to add Last-Modified to the http headers. I know this questions was discussed in detail last year but I was unable to find a solution. SSI is the reason the Last-Modified is not being included in the http headers and that cannot be changed. System is RedHat 7.2 on Intel, tons o' ram and ghz. Thanks, Gregory Lepore Webmaster, State of Maryland Supervisor, Archives of Maryland Online 410-260-6425 |
From: Gilles D. <gr...@sc...> - 2002-01-10 16:47:54
|
According to Jamie Anstice: > Seeing as people are talking about this, I thought I'd relate my hack at > displaying & searching on specific metadata fields. We had a requirement > to restrict queries to pages containing a specific meta-tag element or > elements, and also to display meta-tag content in the output. > > First I added a meta-data List to DocumentRef, with methods to return the > list, and add strings to the list (DocumentRef is what's stored in the > db.docdb for each document). Then I added a got_metatag method to > Retriever to store a meta-tag content in the DocumentRef. Then in the > HTML parser I modified do_tag to call Reteirver::got_metatag when it sees > an appropriate metatag (I made a config attribute meta_tag_store which can > take a StringList of metatag names to store, or 'all', and only matching > meta-tag NAMEs are stored - I only store the meta-tag content attribute, > indexed by the name attribute. This might be a bit non-robust, but it's > good enough for 2 hrs of hacking). This gives us the ability to store > arbitrary meta-data. > > Then I modified htsearch/Display::displayMatch to read another config > attribute meta_tag_display, which is a list of tag names to make into > variables suitable for inclusion in output templates (again 'all' is an > option too). For each meta-tag stored in the retrieved DocumentRef, if it > matches a name in the list of tags to display, it prefixes the name with > mt_ and puts the new name and the content in the vars structure to make it > available for the page writer. > > This means that we can surface meta-data, but we still can't search on it. > I decided that introducing new terms into the main index was the easiest > way - it's not flash, but it does us for now. As part of the HTML parser > do_tag, I read another config attribute meta_tag_index which has a list of > all the tag names which will be indexed. When a matching tag comes up, I > make up a keyword mt_<tag-name>_<tag-content> and add that to the index > for the current document (I use the existing word breaking code to break > up multi-word tag contents, so a tag <meta name="Platform" content="Linux, > Solaris, Irix"> would turn into three words mt_platform_linux, > mt_platform_solaris, mt_platform_irix - they're all forced to lower case). > Then I just use the keywords= CGI parameter to htsearch to include the > keywords I want to restrict - we've got a php advanced search page with a > bunch of list selects on it, and a redirect page which flattens the > multiple options into a single keywords entry (sometime I'd like to modify > the keywords parameter handling to allow it to take boolean queries, but > that can wait for a bit). I needed to play with the punctuation > characters to allow _ in words, and Bob's your uncle. > > I hope this is of use of interest to someone - I've implemented this on > our 3.2.x based tree (and I won't post a patch because our tree has > diverged too far - soon I'll have to make it based on the snapshots > again), but something similar should work on the 3.1.x too. This all sounds quite interesting. Most of it is somewhat similar to what I'd envisioned implementing. The main difference is that rather than adding prefixes to the words indexed in meta tags, I envisioned indexing the words as-is, but with new flag values to distinguish them in the word index, and have control over their scoring factors. In any case, I don't know how old your 3.2.x code was that you used as a starting point, but there have been a LOT of bug fixes since 3.2.0b3, and there will be undoubtedly more, so it would probably be wise to resync your code to the current CVS, and then it should be easier to provide the patches for the changes you describe. If we get these into CVS, that would be one less difference for you to maintain in your code. I don't see us putting these changes into 3.1.6, but they'd be a very valuable addition to 3.2! -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Quim S. <qs...@gt...> - 2002-01-10 10:33:58
|
> -----Mensaje original----- > De: htd...@li... > [mailto:htd...@li...]En nombre de Gilles > Detillieux > Enviado el: miercoles, 09 de enero de 2002 21:21 > Para: Neal Richter [snip] > > > > This method is pretty common in the IR research > > community. Although it is work menthining that there is > some controversy > > over whether stemming increases or decreases accuracy in general. > > Probably very dataset and desired results dependent... > True. Stemming can increase recall, but can also reduce precision in searches, especially boolean as it's our case with htdig. Vector-space search algorithms can benefit sometimes better from stemming -- but see below. BTW 'accuracy' is a concept under debate and research. Several different measures exist today, for different applications of IR. None of them - the measures I mean :) - seems to satisfy everybody. Accuracy is always subjective, some say. > All the more reason to stick with the existing htfuzzy > framework, so that > stemming can be turned on/off at search time. The actual > stemming technique > can be applied either way - whether it's done at indexing time (within > htdig) or just after indexing by htfuzzy. The framework used > by algorithms > like soundex, metaphone and accents could be applied equally > well to the > stemming algorithms, so I'd recommend studying these > algorithms to figure > out how to add the new stemming code to the mix. > I'd like to remark that stemming - just as other 'fuzzy' techniques - is by its nature language dependant. Thus, to apply it correctly either at index time or at search time, it is necessary to identify the content language of each document. You see the implications. Otherwise, it's more a source of noise than a help, in the case of multilingual document collections. Regards, -- Quim |
From: Geoff H. <ghu...@ws...> - 2002-01-10 04:29:16
|
At 2:52 PM -0600 1/9/02, Gilles Detillieux wrote: > > Just to be clear.. do you mean the release is moved back or that >> the feature is postponed until the next release? I'm guessing the latter. > >Yes, I believe Geoff meant the latter Yes, I did--but as Gilles mentions, we're not particularly rigid about enforcing a deadline or even a real code-freeze if there are obvious reasons otherwise. Certainly you have to draw the line somewhere, but important fixes/changes/features are more likely to get accepted by a developer vote than minor things (that could be added to the next release). >you use the latest 3.2.0b4 snapshot as your starting point, or better >yet the htdig-3-2-x branch of the htdig CVS tree on SourceForge. Actually, I'm about to expire the mainline tomorrow--this won't alter the htdig-3-2-x branch, but it will at least offer other options. It's obvious up to a vote whether we want to switch development CVS trees back to the mainline and revisit the htdig-3-2-x branch as we get to 3.2.0 final candidates. > > I'd be happy to assist in syncing with mifluz. This would certainly help in meeting your calendar goal. I'd be glad to bring you up to speed on what needs to happen for that. Fortunately some of the hooks are left from my previous attempt. (i.e. there shouldn't be much need for autoconf/autoheader changes.) I'd be glad to point towards what needs to happen. The first thing is to take a look at the mifluz example code to see how the word indexing and searching is currently performed. I'd then take a look at the htcommon/HtWord* classes and see what needs to be changed here. This is a large extent of the ht://Dig side of the mifluz API. Most of the htword/ and db/ directories are just the mifluz code itself. At 8:20 PM -0700 1/9/02, Neal Richter wrote: >I'm not the main internationalization developer. We don't have >access to the source code of the Basis Tech tools, we use it as an API for >transcoding, etc. Ah. As you're well aware, if we don't develop it ourselves, it must be GPL compatible. I'd be interested to know if there are any Asian-aware "spelling checkers" or Asian-aware word processors that parse words under the GPL or the like. -Geoff |
From: Geoff H. <ghu...@ws...> - 2002-01-10 04:29:16
|
At 8:23 PM -0700 1/9/02, Neal Richter wrote: > What is the default contents of the 'blurb' in long format? > > More clearly... what is returned if a user searches on a keyword >in the META data, but the search-word does not appear in the actual >document? It depends. See: <http://www.htdig.org/attrs.html#no_excerpt_text> <http://www.htdig.org/attrs.html#no_excerpt_show_top> I think most of us use no_excerpt_show_top, but there's no reason you couldn't just set the text to nothing or a simple blank line or something. -- -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
From: Neal R. <ne...@ri...> - 2002-01-10 03:25:56
|
Here's a softball: What is the default contents of the 'blurb' in long format? More clearly... what is returned if a user searches on a keyword in the META data, but the search-word does not appear in the actual document? Thanks again! -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
From: Neal R. <ne...@ri...> - 2002-01-10 03:22:23
|
On Wed, 9 Jan 2002, Gilles Detillieux wrote: > Parts of it are fully functional now, but some parts are still in early > development (e.g. Cookie support). Whether it's solid for you depends > on what parts you use and what parts you don't. I will be mostly treating htdig as an index/query API. I don't yet know just how much htdig calls vs direct mifluz calls.. > > > I'd be happy to assist in syncing with mifluz. > > Great, that's probably the most urgent change to 3.2 that's needed to > improve its reliability. I wouldn't say that it's impossible to get > 3.2 reasonably solid by mid-February, but it is an awfully tight > deadline! I'll look and see what files need to be updated. The mifluz site looks like it hasn't been updated since july.. the cvs tree is very quiet. > I'd caution that any htdig development should be done in "clean-room" > conditions, i.e. not by anyone who's signed the NDA and seen > the proprietary code, to avoid any licensing pitfalls. (See > http://www.linuxjournal.com/article.php?sid=5496 for a cautionary tale.) Absolutely. I'm not the main internationalization developer. We don't have access to the source code of the Basis Tech tools, we use it as an API for transcoding, etc. Thanks -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
From: Jamie A. <Jam...@sl...> - 2002-01-09 22:01:37
|
Seeing as people are talking about this, I thought I'd relate my hack at displaying & searching on specific metadata fields. We had a requirement to restrict queries to pages containing a specific meta-tag element or elements, and also to display meta-tag content in the output. First I added a meta-data List to DocumentRef, with methods to return the list, and add strings to the list (DocumentRef is what's stored in the db.docdb for each document). Then I added a got_metatag method to Retriever to store a meta-tag content in the DocumentRef. Then in the HTML parser I modified do_tag to call Reteirver::got_metatag when it sees an appropriate metatag (I made a config attribute meta_tag_store which can take a StringList of metatag names to store, or 'all', and only matching meta-tag NAMEs are stored - I only store the meta-tag content attribute, indexed by the name attribute. This might be a bit non-robust, but it's good enough for 2 hrs of hacking). This gives us the ability to store arbitrary meta-data. Then I modified htsearch/Display::displayMatch to read another config attribute meta_tag_display, which is a list of tag names to make into variables suitable for inclusion in output templates (again 'all' is an option too). For each meta-tag stored in the retrieved DocumentRef, if it matches a name in the list of tags to display, it prefixes the name with mt_ and puts the new name and the content in the vars structure to make it available for the page writer. This means that we can surface meta-data, but we still can't search on it. I decided that introducing new terms into the main index was the easiest way - it's not flash, but it does us for now. As part of the HTML parser do_tag, I read another config attribute meta_tag_index which has a list of all the tag names which will be indexed. When a matching tag comes up, I make up a keyword mt_<tag-name>_<tag-content> and add that to the index for the current document (I use the existing word breaking code to break up multi-word tag contents, so a tag <meta name="Platform" content="Linux, Solaris, Irix"> would turn into three words mt_platform_linux, mt_platform_solaris, mt_platform_irix - they're all forced to lower case). Then I just use the keywords= CGI parameter to htsearch to include the keywords I want to restrict - we've got a php advanced search page with a bunch of list selects on it, and a redirect page which flattens the multiple options into a single keywords entry (sometime I'd like to modify the keywords parameter handling to allow it to take boolean queries, but that can wait for a bit). I needed to play with the punctuation characters to allow _ in words, and Bob's your uncle. I hope this is of use of interest to someone - I've implemented this on our 3.2.x based tree (and I won't post a patch because our tree has diverged too far - soon I'll have to make it based on the snapshots again), but something similar should work on the 3.1.x too. Jamie Anstice Search Scientist, S.L.I. Systems, Inc jam...@sl... ph: 64 961 3262 mobile: 64 21 264 9347 |
From: Gilles D. <gr...@sc...> - 2002-01-09 21:13:46
|
According to Neal Richter: > Is the thinking that these changes (queriable index fields) would > take place in the mifluz indexing code? Is mifluz the core index code in > 3.2? Yes and no. mifluz is the core WORD indexing code in 3.2, but the additional fields would have to go into db.docdb, which is separate. The words from these fields would also go into the mifluz word database, but I don't think that would require any changes at all to mifluz, as long as it's open-ended as far as the flag values that are tied to words in the index. The changes would be more at the Retriever and document parser level, as well as the query parser, but not in the guts of the word database code. The extra fields in db.docdb would only be needed if you need the ability to extract those fields in search results, the way you can now with meta descriptions. That wouldn't strictly be needed if all you want to do is limit searches to specific fields. > > The release will happen when it's ready. The two main "chunks" that > > absolutely, positively need to be finished are a sync with the current > > mifluz code and the new htsearch framework. Beyond that, many things are > > up in the air--if people come along to do things like Unicode and they can > > be tested thoroughly, great. If not, it moves back. > > Just to be clear.. do you mean the release is moved back or that > the feature is postponed until the next release? I'm guessing the latter. Yes, I believe Geoff meant the latter, although we'd certainly consider holding off a release if some much-requested change is right on the way (this is usually done by a developers' vote during a pre-release code freeze, which we're not at yet in 3.2 development). > I'm trying to get an good idea of what feature goals the htdig > developers have 3.2 [other than what was just given]. > > Not to be pushy, but how complete does everyone consider 3.2 to be > currently? Well, how complete it is depends on what features we want in 3.2, as well as how solid it is. 3.2.0b4 is getting to be reasonably solid, but there are some major changes coming in before it's released, so I'd expect more debugging time still before it's ready even for beta release. Having said that, there are sites using 3.2 in a production environment already. > My main immediate task is to choose to use either 3.1.5 or > 3.2Beta as a starting point. I would obviously prefer to use 3.2Beta and > submit bug-fixes, feature improvements, & memory leak/corruption fixes > than do that effort on a previous version.. as well as test 3.2Beta on > very large data-sets.. For what you're proposing, 3.2 is the only viable option. 3.1.x is strictly in maintenance mode, so we're not making sweeping new changes to it. Anything to do with different search fields or word categories wouldn't fit into the word database scheme used in 3.1.x. I'd recommend you use the latest 3.2.0b4 snapshot as your starting point, or better yet the htdig-3-2-x branch of the htdig CVS tree on SourceForge. > A ball-park ETA is fine.. I'm looking at an early spring > release for the next version of the software that will use this as an > archiving tool. [See www.rightnow.com for more info on our software] > Note that this (our software) release does not require the archive to be > UTF-8 ready... just moving in that direction. > > With QA/Testing taking a big chunk of that time, I want to be > confident that the state of the 3.2Beta code as of mid-february is good > enough to send to QA... and of course get all the fixes from our QA back > to htdig ASAP ;-) ! > > I'm gathering implicitly that the changes envisioned before the > 3.2 release are improvements.. ie the Beta code is fully functional now. > Correct? Parts of it are fully functional now, but some parts are still in early development (e.g. Cookie support). Whether it's solid for you depends on what parts you use and what parts you don't. > I'd be happy to assist in syncing with mifluz. Great, that's probably the most urgent change to 3.2 that's needed to improve its reliability. I wouldn't say that it's impossible to get 3.2 reasonably solid by mid-February, but it is an awfully tight deadline! > The word-breaking problem for Asian languages like Japanese is a > hard one. There are no 'spaces' in most Japanese text, so a language > specific word-breaking algorithm is needed before any soft of searchable > index can be built. Yes, both this problem and the 8-bit limitation must be dealt with before htdig can deal with most Asian languages. > We've contracted with a large internationalization-software > company to assist us in our conversion to be UTF-8. We accomplish word > breaking for Japanese documents with software that they supply under > contract via NDA. I'll ask our in-house Japanese developer about free > tools for word breaking. I'd caution that any htdig development should be done in "clean-room" conditions, i.e. not by anyone who's signed the NDA and seen the proprietary code, to avoid any licensing pitfalls. (See http://www.linuxjournal.com/article.php?sid=5496 for a cautionary tale.) -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Gilles D. <gr...@sc...> - 2002-01-09 20:21:29
|
According to Neal Richter: > > But keep in mind that the way the current Endings algorithm works is > > slightly different than most "stemming" approaches. All words are indexed > > as-is and then at search time, the fuzzy algorithm can add additional > > "fuzzy query words" to the user query (at usually lower weight). > > Ah.. I was thinking of feeding the indexer pre-stemmed text as > well as the original text. User queries would by default be stemmed > before quering, unless "" or the + are used. > > This method is pretty common in the IR research > community. Although it is work menthining that there is some controversy > over whether stemming increases or decreases accuracy in general. > Probably very dataset and desired results dependent... All the more reason to stick with the existing htfuzzy framework, so that stemming can be turned on/off at search time. The actual stemming technique can be applied either way - whether it's done at indexing time (within htdig) or just after indexing by htfuzzy. The framework used by algorithms like soundex, metaphone and accents could be applied equally well to the stemming algorithms, so I'd recommend studying these algorithms to figure out how to add the new stemming code to the mix. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Neal R. <ne...@ri...> - 2002-01-09 07:41:32
|
> But keep in mind that the way the current Endings algorithm works is > slightly different than most "stemming" approaches. All words are indexed > as-is and then at search time, the fuzzy algorithm can add additional > "fuzzy query words" to the user query (at usually lower weight). Ah.. I was thinking of feeding the indexer pre-stemmed text as well as the original text. User queries would by default be stemmed before quering, unless "" or the + are used. This method is pretty common in the IR research community. Although it is work menthining that there is some controversy over whether stemming increases or decreases accuracy in general. Probably very dataset and desired results dependent... -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
From: Neal R. <ne...@ri...> - 2002-01-09 07:32:10
|
On Tue, 8 Jan 2002, Geoff Hutchison wrote: > I'm keeping this on-list so it's archived. I'm by far not the only active > member in the group. (esp. judging from # of CVS commits...) OK. Just trying to keep the dumb questions to the developer list to a minimum. ;-) Some of my questions are an attempt to get a feel for how this developer group works. > So if this is interesting to you, think about it, work out a proposal of > how you think it might best be implemented and write the list--we'll give > feedback and go from there. OK. I need to do some code reading but I'll take a whack at a proposal in the next week or so... Is the thinking that these changes (queriable index fields) would take place in the mifluz indexing code? Is mifluz the core index code in 3.2? > > The release will happen when it's ready. The two main "chunks" that > absolutely, positively need to be finished are a sync with the current > mifluz code and the new htsearch framework. Beyond that, many things are > up in the air--if people come along to do things like Unicode and they can > be tested thoroughly, great. If not, it moves back. Just to be clear.. do you mean the release is moved back or that the feature is postponed until the next release? I'm guessing the latter. I'm trying to get an good idea of what feature goals the htdig developers have 3.2 [other than what was just given]. Not to be pushy, but how complete does everyone consider 3.2 to be currently? My main immediate task is to choose to use either 3.1.5 or 3.2Beta as a starting point. I would obviously prefer to use 3.2Beta and submit bug-fixes, feature improvements, & memory leak/corruption fixes than do that effort on a previous version.. as well as test 3.2Beta on very large data-sets.. A ball-park ETA is fine.. I'm looking at an early spring release for the next version of the software that will use this as an archiving tool. [See www.rightnow.com for more info on our software] Note that this (our software) release does not require the archive to be UTF-8 ready... just moving in that direction. With QA/Testing taking a big chunk of that time, I want to be confident that the state of the 3.2Beta code as of mid-february is good enough to send to QA... and of course get all the fixes from our QA back to htdig ASAP ;-) ! I'm gathering implicitly that the changes envisioned before the 3.2 release are improvements.. ie the Beta code is fully functional now. Correct? I'd be happy to assist in syncing with mifluz. > At the moment, there is support only for 8-bit character > chunks. Gilles told you the current state: if you have an 8-bit locale > that supports the character set you want, you're probably going to be > OK. Significant parts of the code rely on the 8-bit char assumption, which > means that there will be some rewriting work beyond just the backend. Ah! That's the magic phrase ("8-bit char assumption") I was looking for. My fault, I should have just asked it more clearly the first time! Thanks for the clarification. > My guess is that the biggest challenges for moving to Unicode are going to > be removing the 8-bit char assumption and parsing "words." Updating the > String class to be Unicode-aware (perhaps via UTF-8) is probably going to > help a lot towards the first problem. The word-breaking problem for Asian languages like Japanese is a hard one. There are no 'spaces' in most Japanese text, so a language specific word-breaking algorithm is needed before any soft of searchable index can be built. We've contracted with a large internationalization-software company to assist us in our conversion to be UTF-8. We accomplish word breaking for Japanese documents with software that they supply under contract via NDA. I'll ask our in-house Japanese developer about free tools for word breaking. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
From: Geoff H. <ghu...@ws...> - 2002-01-09 04:51:54
|
I'm keeping this on-list so it's archived. I'm by far not the only active member in the group. (esp. judging from # of CVS commits...) On Tue, 8 Jan 2002, Neal Richter wrote: > > implemented regardless of when I have time to finish wiring in Quim's new > > htsearch framework. > > Need any help? I'd be glad to throw in with you.. this would be a > usefull feature for us by allowing some databasish type functionality. If this is what you want, I'd suggest working on the indexing side. Someone needs to propose how users could configure specific metadata to be wired into the unused word index "flags." There are 8 of 32 bits currently defined for things like Titles, Headings, Keywords, etc. (See htcommon/HtWordReference.h) So if a user is defining custom data-types, then they may want to "alias" these slots, or use some of the remaining unused bits. The parsers will need to be changed to treat this user-defined word flags appropriately. And this is all assuming that there's a global allocation of the flags. This should probably be more fine-grained, (e.g. some documents have DTDs which suggest certain meta-data or XML tags...) but then something needs to be stored in the document DB. So if this is interesting to you, think about it, work out a proposal of how you think it might best be implemented and write the list--we'll give feedback and go from there. I'll definitely let everyone know when the htsearch code is ready. In particular, some things may break and I'm sure there will be bugs. > > While the key documentation (attrs.html, cf_byname.html, etc.) is updated > > as appropriate patches are made, files such as main.html, RELEASE.html, > > TODO.html, etc. are only updated just before a release. > > I was mostly curious about what things are going to be included in > the final 3.2. I can't seem to find anything that clearly defines what > will be incuded in 3.2 or when the release might happen. The release will happen when it's ready. The two main "chunks" that absolutely, positively need to be finished are a sync with the current mifluz code and the new htsearch framework. Beyond that, many things are up in the air--if people come along to do things like Unicode and they can be tested thoroughly, great. If not, it moves back. > Do you have anything to add to Gilles' answer to Question 3 (Unicode)? I > think he may have missed the point. At some level the engine must support > multi-byte matching or the engine wouldn't work. > > One commercial company's search engine claims to be 'binary' at the core, > so Unicode is basically a higher-level API code problem... not a problem > with the base functionality of the index. At the moment, there is support only for 8-bit character chunks. Gilles told you the current state: if you have an 8-bit locale that supports the character set you want, you're probably going to be OK. Significant parts of the code rely on the 8-bit char assumption, which means that there will be some rewriting work beyond just the backend. > Anything you can add to enlighten me on the basics of how the DB & > Matching algorithms work at a high level would be usefull. Ultimately, I > think I'm going to have to do lots of code-reading with our Unicode GURU > to figure this out. The word database is currently keyed by the word itself--though this will change when the new mifluz code is imported. (This assigns word ids, AFAICT.) In either case, I don't think it matters much whether it's 8-bit characters or not--the database treats it as a binary key anyway. My guess is that the biggest challenges for moving to Unicode are going to be removing the 8-bit char assumption and parsing "words." Updating the String class to be Unicode-aware (perhaps via UTF-8) is probably going to help a lot towards the first problem. -Geoff |
From: Geoff H. <ghu...@ws...> - 2002-01-09 04:30:38
|
On Tue, 8 Jan 2002, Neal Richter wrote: > Would anyone be interested in having the Porter-type Stemming > Algorithms integrated as a second stemming algorithm? Sure. There's nothing saying the current crop of fuzzy algorithms is necessarily the final word on things. :-) But keep in mind that the way the current Endings algorithm works is slightly different than most "stemming" approaches. All words are indexed as-is and then at search time, the fuzzy algorithm can add additional "fuzzy query words" to the user query (at usually lower weight). > Is everyone happy with the current stemming system? There are benefits and drawbacks. One benefit is that ispell stemming dictionaries are available for most languages. One drawback is that it's only as good as the ispell dictionary used--and many are not exactly designed for what ht://Dig is doing with them. (Several people recently worked on a revised version for htfuzzy purposes.) -Geoff |
From: Neal R. <ne...@ri...> - 2002-01-09 00:08:25
|
Hello, Would anyone be interested in having the Porter-type Stemming Algorithms integrated as a second stemming algorithm? The Porter algorithm is a well known (in the NLP community) multi-language stemmer that implements an approximate 'morphological' stemming for each language. The original algorithm was english-only, but the framework was adapted. Here is Martin Porter's algorithm page. http://www.tartarus.org/~martin/PorterStemmer/ Martin Porter is also running this sourceforge site for his newer stemming work. http://snowball.sourceforge.net/ A complete set of Porter-type stemmers is available within the omseek project here: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/omseek/om/languages/ Is everyone happy with the current stemming system? Thanks. -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site |
From: Geoff H. <ghu...@ws...> - 2002-01-08 22:58:25
|
On Tue, 8 Jan 2002, Neal Richter wrote: > The 3.2 Beta3 release TODO page marks Question 5 (Field-based > searching) as 'out-standing'. Who is working on that? At the moment, no one is. Then again, this task can't be completed until the new htsearch parser is hooked in place. But that's on the search side--you're asking about user-defined fields, which also needs to be implemented on the index side. And of course the indexing side can be implemented regardless of when I have time to finish wiring in Quim's new htsearch framework. > I also notice the 3.2 Beta4 snapshot from Dec 30 does not have an updated > TODO.html file (dated Feb 15). While the key documentation (attrs.html, cf_byname.html, etc.) is updated as appropriate patches are made, files such as main.html, RELEASE.html, TODO.html, etc. are only updated just before a release. > Is there a browsable cvsweb of the code somewhere? Of course. See the SourceForge project page: <http://sourceforge.net/projects/htdig/> -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ |
From: Neal R. <ne...@ri...> - 2002-01-08 22:44:39
|
Any one have an answer for Question 1? The 3.2 Beta3 release TODO page marks Question 5 (Field-based searching) as 'out-standing'. Who is working on that? I also notice the 3.2 Beta4 snapshot from Dec 30 does not have an updated TODO.html file (dated Feb 15). Is there a browsable cvsweb of the code somewhere? Thanks -- Neal Richter Knowledgebase Developer RightNow Technologies, Inc. Customer Service for Every Web Site ---------- Forwarded message ---------- Date: Fri, 4 Jan 2002 12:23:00 -0700 (MST) From: Neal Richter <ne...@ri...> To: htd...@li... Subject: [htdig-dev] Incremenal Index Efficiency, Unicode, & 2GIG limit Hello again, I've got a couple more questions.. 1. Is there any need to rebuild the index from scratch periodically? Some commercial search engines use incremental indexing and recommend that when the incremental portion of the index gets to be a given size (say 20%) the entire index is rebuilt. 2. Is it possible to turn stemming off for particular languages during run time? We have our own stemming tools.. (Porter Algorithm) 3. (Unicode) Is the index (the core of the index code) capable of doing multibyte searching? For example if a fully escaped version of a Japanese or other multibyte document was indexed.. and then searched with a properly escaped query.. would valid matches occur? (exculde any UI or upper level code in your thing here.) 4. (2 Gig Limit) Some of the archives will be at a million+ documents in size with an average length exceeding 2K. Other than using XFS or JFS, the solution in this case is to use multiple index files? 5. Is there a way to add a 'field' to the index? Ie.. multiple documents share a source-id & a query is given to return the documents with that source-id. This could accomplished implicitly by modifying the source-id to be some special alpha-numeric character (DJ23KJD823).. but this has a small probability of giving false-positive search results. Thanks for your help! -- Neal Richter Knowledge base Developer Right Now Technologies, Inc. Customer Service for Every Web Site _______________________________________________ htdig-dev mailing list htd...@li... https://lists.sourceforge.net/lists/listinfo/htdig-dev |
From: Gilles D. <gr...@sc...> - 2002-01-08 22:21:25
|
According to Neal Richter: > 1. Is there any need to rebuild the index from scratch > periodically? Some commercial search engines use incremental indexing and > recommend that when the incremental portion of the index gets to be a > given size (say 20%) the entire index is rebuilt. htdig doesn't keep the incremental portion of the index in a separate database, so there aren't any hard and fast rules about needing to rebuild from scratch. However, it has been our experience that some things to seem to degenerate over time, but not for anything we've been able to pin down and fix, so we do still recommend that you occasionally rebuild from scratch. Some users do this only once a month or thereabouts. Others only rebuild when odd problems develop. We don't have any stats on how long anyone has run with daily updates to a database with no rebuilds. > 2. Is it possible to turn stemming off for particular languages > during run time? We have our own stemming tools.. (Porter Algorithm) I'm not sure what you mean by stemming. If you mean the "endings" fuzzy algorithm then it can be turned off by changing the search_algorithm attribute. > 3. (Unicode) Is the index (the core of the index code) capable of > doing multibyte searching? For example if a fully escaped version of a > Japanese or other multibyte document was indexed.. and then searched with > a properly escaped query.. would valid matches occur? (exculde any UI or > upper level code in your thing here.) No, currently only 8-bit character sets supported by a "locale" on your system are supported by htdig. > 4. (2 Gig Limit) Some of the archives will be at a million+ > documents in size with an average length exceeding 2K. Other than using > XFS or JFS, the solution in this case is to use multiple index files? Correct. The 3.1 series can only search one index file at a time, but you can set it up to manually select with config file you want to use, and the config files can select their own database. The 3.2 beta series has preliminary support for "collections", where results for a search of multiple index files are combined. > 5. Is there a way to add a 'field' to the index? Ie.. multiple > documents share a source-id & a query is given to return the documents > with that source-id. This could accomplished implicitly by modifying the > source-id to be some special alpha-numeric character (DJ23KJD823).. but > this has a small probability of giving false-positive search results. At the moment, there isn't an easy way to add fields, although it's on the wish list for 3.2. Depending on how you encode your source IDs, though, you can reduce the probability of a false-positive to almost 0. The encoded source IDs could be placed in meta keywords tags in your documents, and htdig would pick them up and index them as any other keywords, with a scoring factor controlled by keywords_factor. -- Gilles R. Detillieux E-mail: <gr...@sc...> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 |
From: Geoff H. <ghu...@ws...> - 2002-01-05 05:58:04
|
At 9:31 AM -0500 1/4/02, Bradley M. Handy wrote: >The mirror at www.jack-of-all-trades.net will no longer be available. This >is effective with the sending of this message. Sorry to hear that, we've taken it off the mirrors list (as of an hour or so after your message). Thanks for your support! -Geoff P.S. Out of curiosity, do you have some idea of the number of hits you were getting to the mirror? |