From: Geoff H. <ghu...@us...> - 2001-10-07 07:13:44
|
STATUS of ht://Dig branch 3-2-x RELEASES: 3.2.0b4: In progress 3.2.0b3: Released: 22 Feb 2001. 3.2.0b2: Released: 11 Apr 2000. 3.2.0b1: Released: 4 Feb 2000. SHOWSTOPPERS: KNOWN BUGS: * Odd behavior with $(MODIFIED) and scores not working with wordlist_compress set but work fine without wordlist_compress. (the date is definitely stored correctly, even with compression on so this must be some sort of weird htsearch bug) * Not all htsearch input parameters are handled properly: PR#648. Use a consistant mapping of input -> config -> template for all inputs where it makes sense to do so (everything but "config" and "words"?). * If exact isn't specified in the search_algorithms, $(WORDS) is not set correctly: PR#650. (The documentation for 3.2.0b1 is updated, but can we fix this?) * META descriptions are somehow added to the database as FLAG_TITLE, not FLAG_DESCRIPTION. (PR#859) PENDING PATCHES (available but need work): * Additional support for Win32. * Memory improvements to htmerge. (Backed out b/c htword API changed.) * MySQL patches to 3.1.x to be forward-ported and cleaned up. (Should really only attempt to use SQL for doc_db and related, not word_db) NEEDED FEATURES: * Field-restricted searching. * Return all URLs. * Handle noindex_start & noindex_end as string lists. * Handle local_urls through file:// handler, for mime.types support. * Handle directory redirects in RetrieveLocal. * Merge with mifluz TESTING: * httools programs: (htload a test file, check a few characteristics, htdump and compare) * Turn on URL parser test as part of test suite. * htsearch phrase support tests * Tests for new config file parser * Duplicate document detection while indexing * Major revisions to ExternalParser.cc, including fork/exec instead of popen, argument handling for parser/converter, allowing binary output from an external converter. * ExternalTransport needs testing of changes similar to ExternalParser. DOCUMENTATION: * List of supported platforms/compilers is ancient. * Add thorough documentation on htsearch restrict/exclude behavior (including '|' and regex). * Document all of htsearch's mappings of input parameters to config attributes to template variables. (Relates to PR#648.) Also make sure these config attributes are all documented in defaults.cc, even if they're only set by input parameters and never in the config file. * Split attrs.html into categories for faster loading. * require.html is not updated to list new features and disk space requirements of 3.2.x (e.g. phrase searching, regex matching, external parsers and transport methods, database compression.) * TODO.html has not been updated for current TODO list and completions. OTHER ISSUES: * Can htsearch actually search while an index is being created? (Does Loic's new database code make this work?) * The code needs a security audit, esp. htsearch * URL.cc tries to parse malformed URLs (which causes further problems) (It should probably just set everything to empty) This relates to PR#348. |
From: Tod T. <tt...@ch...> - 2001-10-15 11:55:16
|
Has anybody heard of this in relation to search technology and would like to provide some input on the topic? Thanks - Tod |
From: Tod T. <tt...@ch...> - 2001-10-17 13:14:38
|
At a helpful list member's prompting, it might help if I present the reason for my original post. My original query was intended to see if anybody was looking into implementing this into htdig. We've just gone through a pretty extensive vendor selection process to find the 'best' search engine to couple with an internal knowledge management initiative. A common marketing thread for a number of the vendors was the utilization of the 'bayesian algorithm'. I have to admit that those products tended to outperform the others, particularly when it came to language specific searches. It was almost as if the search could understand the concept behind the request, not just the words. That's when I began to wonder if anybody in the htdig development community had looked into implementing 'bayesian' searching, or if htdig could do 'it', hence my vague post. My theory was that most new market trends (worth paying attention to) are usually already, or quickly will be, reflected in the open source development community. I suspected this might be the case with htdig too. Given the individual response that I got there is at least some interest so by elaborating a little on my original post maybe I can learn more. Comments? Thanks - Tod |
From: Quim S. <qs...@gt...> - 2001-10-17 14:29:42
|
> -----Mensaje original----- > De: htd...@li... > [mailto:htd...@li...]En nombre de Tod Thomas > Enviado el: miercoles, 17 de octubre de 2001 15:23 > Para: htd...@li... > Asunto: [htdig-dev] Bayesian Algorithm Part II [snip] > > A common marketing thread for a number of the vendors was the > utilization of the 'bayesian algorithm'. Some vendors, like Autonomy, do use this kind of techniques in their search engines. 'Bayesian', or probabilistic, models are computed from indexed documents. The longer the query, the better the results. As with vector-space similarity measures, this is particularly useful when searching 'documents similar to this' or when classifiyng documents. Results tend to be poorer when using short queries. > I have to admit that those products tended to outperform > the others, particularly when it came to language specific > searches. It was almost as if the search could understand the concept behind > the request, not just the words. In fact, what is evaluated is the 'blend' of the words, i.e. the more or less roughly estimated probability of finding a given set of words in each document. > > That's when I began to wonder if anybody in the htdig > development community had looked into implementing 'bayesian' > searching, or if htdig could do 'it', hence my vague post. > The htdig databases are postitively not prepared to deal with such techniques, IMHO. They are not intended to. The power of htdig is based in 'classical' boolean queries. > My theory was that most new market trends (worth paying > attention to) are usually already, or quickly will be, reflected in the open > source development community. Entering deep water -- opinion unreliable from this point :) My perception is eventually the inverse. Open-source has been traditionally being bound to research and innovation. It's now being used by companies as an innovation channel, so that market trends emerge later from there... > I suspected this might be the case with htdig > too. Given the individual response that I got there is at least some > interest so by elaborating a little on my original post maybe I can learn more. Comments? > Nearly related to probabilistic stuff is Xapian, a.k.a. Omseek, a.k.a. Omsee, a.k.a. Open Muscat, an open-source project intended as a probabilistic search-engine framework. Initially financed by Brightstation, was some time ago left to its own. Now lives in Sourceforge. More in the research field, there's libbow/rainbow by Andrew McCallum et al. from CMU, including bayesian classifiers, vector-space algorithms, and other nice artifacts. Here at gtd, we're experimenting internally with some new vector-space based search and classification algorithms. What we have does look quite promising, but AFAIK it's not to be open-sourced -- by now. -- Quim |
From: Tod T. <tt...@ch...> - 2001-10-18 12:26:36
|
Quim Sanmarti wrote: > > My theory was that most new market trends (worth paying > > attention to) are usually already, or quickly will be, reflected in the open > > > source development community. > > Entering deep water -- opinion unreliable from this point :) > My perception is eventually the inverse. Open-source has been traditionally > being bound to research and innovation. It's now being used by companies as > an innovation channel, so that market trends emerge later from there... > Sorry, I wasn't clear on this. I agree completely. Tod |