Thanks for helping keep SourceForge clean.
Screenshot instructions:
Windows
Mac
Red Hat Linux
Ubuntu
Click URL instructions:
Rightclick on ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)
DBIx::FullTextSearch
beta
Brought to you by:
tjmather
fulltextsearchdevel
From: <a4a15206@te...>  20020330 20:57:22

On Jakarta's project home page, I have found an FAQ question related to Lucene's scoring algorithm: http://lucene.sourceforge.net/cgibin/faq/faqmanager.cgi?file=3Dchapter.search&toc=3Dfaq#q31 Doug Culling's (inventor of Lucene) has summarized his algorithm as follows: score_d =3D sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) where: score_d : score for document d sum_t : sum for all terms t tf_q : the square root of the frequency of t in the query=09 tf_d : the square root of the frequency of t in d numDocs : number of documents in index docFreq_t : number of documents containing t idf_t : log(numDocs/docFreq_t+1) + 1.0 norm_q : sqrt(sum_t((tf_q*idf_t)^2)) norm_d_t : square root of number of tokens in d in the same field as t Here's how I think this formula could be applied in our own scoring algorithm for FullTextSearch:  Example:  Search query =3D "foo foo bar foo bar file" Document: =09 =09[To] index files, use [the] frontend file. [Here] [the] content [of the] document =09[is] clearly [the] content [of the] file specified [by the] filename. =09 Calculating variables: sum_t : sum for all terms t =09??? =09is this equal to the total number of times a =09term was found? =09Then let's put it at 123 tf_q : the square root of the frequency of t in the query =09(how often term (keyword) t appears in user specified search query) =09tf_q for keyword 'foo' which appears 3 times in a query with a total of 5 keywords will be: =09 sqrt(3/5) or ~0.775 =09 tf_d : the square root of the frequency of t in d =09(how often keyword t appears in given document) =09keyword 'file' appears 2 times in the document containing =0913 words. Therefore =09tf_d =3D sqrt(2/13) ~ 0.392 numDocs : number of documents in index =09let's have it at 100 docFreq_t : number of documents containing t =09lets put this number at 12 idf_t : log(numDocs/docFreq_t+1) + 1.0 =09idf_t =3D log(100/12+1)+1.0 ~ 1.263 norm_q : sqrt(sum_t((tf_q*idf_t)^2)) =09 =09sqrt(123*(0.775*1.263)^2)) =3D 0.979 norm_d_t : square root of number of tokens in d in the same field as t =09This doesn't really apply in our case since FullTextSearch (as it is =09now) doesn't support field based search/indexing. score_d : score for document d=09 =09 score_d =3D sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) =09We have to take the norm_d_t out of the equation? =09or just set it to 1? =09So we get: =09score_d =3D sum_t( tf_q * idf_t / norm_q * tf_d * idf_t) =3D =09=09=3D 1.263*(0.775 * 1.263/0.979 * 0.392 * 1.263) =3D 0.625 If this looks right, would you think it's safe to proceed with the implementation? Of course, there are a few other finer details involved such as changes to the way we store indexed data, if any? To me it seems like any backend should be able to support scoring since the only information that we require from the database (directly or indirectly  ak'a derived) is this: 1. total number of indexed documents (numDocs) 2. number of times a keyword is present in indexed data (sum_t?) 3. number of times a keyword is present in a given indexed document (required to derive tf_d). 4. number of documents containing a keyword (docFreq_t). for point 2, it might be possible to simply add an extra 'count' field to the _words table so that the table looks like this: wordidcount The 'count' field would then be adjusted as new data is added to the index or old one is removed. for point 3, a similar 'count' fields might be added to the _data table. So for a 'phrase' backend, the table may have these fields: word_iddoc_ididxcount Any thoughts/comments? Cheers, Vladimir Bogdanov. ================================================================= Internet service provided by telus.net http://www.telus.net/ 
From: Vladimir Bogdanov <b_vlad@te...>  20020406 22:59:23

On Jakarta's project home page, I have found an FAQ question related to Lucene's scoring algorithm: http://lucene.sourceforge.net/cgibin/faq/faqmanager.cgi?file=chapter.search &toc=faq#q31 Doug Culling's (inventor of Lucene) has summarized his algorithm as follows: score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) where: score_d : score for document d sum_t : sum for all terms t tf_q : the square root of the frequency of t in the query tf_d : the square root of the frequency of t in d numDocs : number of documents in index docFreq_t : number of documents containing t idf_t : log(numDocs/docFreq_t+1) + 1.0 norm_q : sqrt(sum_t((tf_q*idf_t)^2)) norm_d_t : square root of number of tokens in d in the same field as t Here's how I think this formula could be applied in our own scoring algorithm for FullTextSearch:  Example:  Search query = "foo foo bar foo bar file" Document:  [To] index files, use [the] frontend file. [Here] [the] content [of the] document [is] clearly [the] content [of the] file specified [by the] filename.  Calculating variables: sum_t : sum for all terms t ??? is this equal to the total number of times a term was found? Then let's put it at 123 tf_q : the square root of the frequency of t in the query (how often term (keyword) t appears in user specified search query) tf_q for keyword 'foo' which appears 3 times in a query with a total of 5 keywords will be: sqrt(3/5) or ~0.775 tf_d : the square root of the frequency of t in d (how often keyword t appears in given document) keyword 'file' appears 2 times in the document containing 13 words. Therefore tf_d = sqrt(2/13) ~ 0.392 numDocs : number of documents in index let's have it at 100 docFreq_t : number of documents containing t lets put this number at 12 idf_t : log(numDocs/docFreq_t+1) + 1.0 idf_t = log(100/12+1)+1.0 ~ 1.263 norm_q : sqrt(sum_t((tf_q*idf_t)^2)) sqrt(123*(0.775*1.263)^2)) = 0.979 norm_d_t : square root of number of tokens in d in the same field as t This doesn't really apply in our case since FullTextSearch (as it is now) doesn't support field based search/indexing. score_d : score for document d score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) We have to take the norm_d_t out of the equation? or just set it to 1? So we get: score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t) = = 1.263*(0.775 * 1.263/0.979 * 0.392 * 1.263) = 0.625 If this looks right, would you think it's safe to proceed with the implementation? Of course, there are a few other finer details involved such as changes to the way we store indexed data, if any? To me it seems like any backend should be able to support scoring since the only information that we require from the database (directly or indirectly  ak'a derived) is this: 1. total number of indexed documents (numDocs) 2. number of times a keyword is present in indexed data (sum_t?) 3. number of times a keyword is present in a given indexed document (required to derive tf_d). 4. number of documents containing a keyword (docFreq_t). for point 2, it might be possible to simply add an extra 'count' field to the _words table so that the table looks like this: wordidcount The 'count' field would then be adjusted as new data is added to the index or old one is removed. for point 3, a similar 'count' fields might be added to the _data table. So for a 'phrase' backend, the table may have these fields: word_iddoc_ididxcount Any thoughts/comments? Cheers, Vladimir Bogdanov. 
From: T.J. Mather <tjmather@tj...>  20020408 05:13:48

On Sat, 6 Apr 2002, Vladimir Bogdanov wrote: > Doug Culling's (inventor of Lucene) has summarized his algorithm > as follows: > > score_d = sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) > > where: > score_d : score for document d > sum_t : sum for all terms t > tf_q : the square root of the frequency of t in the query > tf_d : the square root of the frequency of t in d > numDocs : number of documents in index > docFreq_t : number of documents containing t > idf_t : log(numDocs/docFreq_t+1) + 1.0 > norm_q : sqrt(sum_t((tf_q*idf_t)^2)) > norm_d_t : square root of number of tokens in d in the same field as t > > Here's how I think this formula could be applied in our own > scoring algorithm for FullTextSearch: > >  Example:  > Search query = "foo foo bar foo bar file" > Document: >  > [To] index files, use [the] frontend file. [Here] [the] content [of the] > document > [is] clearly [the] content [of the] file specified [by the] filename. >  > > Calculating variables: > sum_t : sum for all terms t > ??? > is this equal to the total number of times a > term was found? > > Then let's put it at 123 This is not a number, but instead a mathematical operation  basically it means calcuate (tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) for each term t sum up all numbers calculated above for each term t. I hope that clears things up. Cheers, TJ 