Menu
▾
▴
[Fulltextsearch-devel] RE: Fulltextsearch-devel digest, Vol 1 #9 - 1 msg
From: Vladimir B. <b_...@te...> - 2002-04-20 16:52:43
|
I have finally got the time to finish coding beta version of the Lucene searching algorithm for FullTextSearch module. The patch file and test scripts are attached inside text_search.rar archive file. The archive should contain these files: FullTextSearch.patch -- patch file (using WinPatchMaker-1.0) index_search.conf -- configuration file for my index searcher test script. index_search_init.pl -- run this first to initialize index tables etc. index_search_test.pl -- the main test file (contains a few test cases and allows you to easily add your own). Since this is only a 'beta' release of the algorithm implementation, apply the patch against a copy of the FullTextSearch module. In this release, scoring has been implemented for the phrase backend only. Also, for now I assume numerical document ids (rather than strings as whould be the case with the String backend?) only. Adding scoring to other backends shouldn't be a hard task since all major scoring routines are located in the main FullTextSearch.pm module. There's actually only a single subroutine that has to be invoked from other backend modules in order to enable scoring for them. I hope you'll find inline documentation useful. At this stage it is crucial that we get comments/suggestions/brilliant ideas flowing in. ;-). PS: I'll also post the archive file on our project home page, just in case you are not able to receive it via email or prefer downloading it off the site. Cheers, Vlad. -----Original Message----- From: ful...@li... [mailto:ful...@li...]On Behalf Of ful...@li... Sent: Tuesday, April 09, 2002 12:07 PM To: ful...@li... Subject: Fulltextsearch-devel digest, Vol 1 #9 - 1 msg Send Fulltextsearch-devel mailing list submissions to ful...@li... To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/fulltextsearch-devel or, via email, send a message with subject or body 'help' to ful...@li... You can reach the person managing the list at ful...@li... When replying, please edit your Subject line so it is more specific than "Re: Contents of Fulltextsearch-devel digest..." Today's Topics: 1. RE: Fulltextsearch-devel digest, Vol 1 #8 - 2 msgs (Vladimir Bogdanov) --__--__-- Message: 1 From: "Vladimir Bogdanov" <b_...@te...> To: <ful...@li...> Cc: <b_...@te...> Date: Mon, 8 Apr 2002 20:36:33 -0700 Subject: [Fulltextsearch-devel] RE: Fulltextsearch-devel digest, Vol 1 #8 - 2 msgs Dealing with multiple backends, I'm wondering whether the notion of 'term' changes with each backend type? For example, 'phrase' a term may also be full phrase: 'foo bar doc' as well as a single word: 'foobar' Also, a search query might have a mix of single terms and 'complex' terms (phrases): "foo bar doc" +foobar in which case our terms are: term 1: "foo bar doc" term 2: "foobar" I guess this is right? cheers, Vladimir Bogdanov. -----Original Message----- From: ful...@li... [mailto:ful...@li...]On Behalf Of ful...@li... Sent: Monday, April 08, 2002 12:23 PM To: ful...@li... Subject: Fulltextsearch-devel digest, Vol 1 #8 - 2 msgs Send Fulltextsearch-devel mailing list submissions to ful...@li... To subscribe or unsubscribe via the World Wide Web, visit https://lists.sourceforge.net/lists/listinfo/fulltextsearch-devel or, via email, send a message with subject or body 'help' to ful...@li... You can reach the person managing the list at ful...@li... When replying, please edit your Subject line so it is more specific than "Re: Contents of Fulltextsearch-devel digest..." Today's Topics: 1. Re: Applying Lucene's scoring algorithm to FullTextSearch... (T.J. Mather) -- __--__-- Message: 1 Date: Mon, 8 Apr 2002 01:56:38 -0400 (EDT) From: "T.J. Mather" <tjm...@tj...> To: Vladimir Bogdanov <b_...@te...> cc: <Ful...@li...> Subject: Re: [Fulltextsearch-devel] Applying Lucene's scoring algorithm to FullTextSearch... On Sat, 6 Apr 2002, Vladimir Bogdanov wrote: > Doug Culling's (inventor of Lucene) has summarized his algorithm > as follows: > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) > > where: > score_d : score for document d > sum_t : sum for all terms t > tf_q : the square root of the frequency of t in the query > tf_d : the square root of the frequency of t in d > numDocs : number of documents in index > docFreq_t : number of documents containing t > idf_t : log(numDocs/docFreq_t+1) + 1.0 > norm_q : sqrt(sum_t((tf_q*idf_t)^2)) > norm_d_t : square root of number of tokens in d in the same field as t > > Here's how I think this formula could be applied in our own > scoring algorithm for FullTextSearch: > > ---- Example: -------- > Search query = "foo foo bar foo bar file" > Document: > ---- > [To] index files, use [the] frontend file. [Here] [the] content [of the] > document > [is] clearly [the] content [of the] file specified [by the] filename. > ---- > > Calculating variables: > sum_t : sum for all terms t > ??? > is this equal to the total number of times a > term was found? > > Then let's put it at 123 This is not a number, but instead a mathematical operation - basically it means calcuate (tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) for each term t sum up all numbers calculated above for each term t. I hope that clears things up. Cheers, TJ -- __--__-- _______________________________________________ Fulltextsearch-devel mailing list Ful...@li... https://lists.sourceforge.net/lists/listinfo/fulltextsearch-devel End of Fulltextsearch-devel Digest --__--__-- _______________________________________________ Fulltextsearch-devel mailing list Ful...@li... https://lists.sourceforge.net/lists/listinfo/fulltextsearch-devel End of Fulltextsearch-devel Digest |