fulltextsearchdevel — For Developers of FullTextSearch
You can subscribe to this list here.
2000 
_{Jan}

_{Feb}

_{Mar}

_{Apr}

_{May}

_{Jun}

_{Jul}

_{Aug}
(7) 
_{Sep}

_{Oct}

_{Nov}

_{Dec}
(1) 

2001 
_{Jan}
(1) 
_{Feb}

_{Mar}

_{Apr}

_{May}

_{Jun}

_{Jul}

_{Aug}

_{Sep}

_{Oct}

_{Nov}

_{Dec}

2002 
_{Jan}

_{Feb}

_{Mar}
(1) 
_{Apr}
(7) 
_{May}

_{Jun}
(5) 
_{Jul}
(5) 
_{Aug}

_{Sep}
(1) 
_{Oct}
(4) 
_{Nov}
(3) 
_{Dec}
(18) 
S  M  T  W  T  F  S 






1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30
(1) 
31







From: <a4a15206@te...>  20020330 20:57:22

On Jakarta's project home page, I have found an FAQ question related to Lucene's scoring algorithm: http://lucene.sourceforge.net/cgibin/faq/faqmanager.cgi?file=3Dchapter.search&toc=3Dfaq#q31 Doug Culling's (inventor of Lucene) has summarized his algorithm as follows: score_d =3D sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) where: score_d : score for document d sum_t : sum for all terms t tf_q : the square root of the frequency of t in the query=09 tf_d : the square root of the frequency of t in d numDocs : number of documents in index docFreq_t : number of documents containing t idf_t : log(numDocs/docFreq_t+1) + 1.0 norm_q : sqrt(sum_t((tf_q*idf_t)^2)) norm_d_t : square root of number of tokens in d in the same field as t Here's how I think this formula could be applied in our own scoring algorithm for FullTextSearch:  Example:  Search query =3D "foo foo bar foo bar file" Document: =09 =09[To] index files, use [the] frontend file. [Here] [the] content [of the] document =09[is] clearly [the] content [of the] file specified [by the] filename. =09 Calculating variables: sum_t : sum for all terms t =09??? =09is this equal to the total number of times a =09term was found? =09Then let's put it at 123 tf_q : the square root of the frequency of t in the query =09(how often term (keyword) t appears in user specified search query) =09tf_q for keyword 'foo' which appears 3 times in a query with a total of 5 keywords will be: =09 sqrt(3/5) or ~0.775 =09 tf_d : the square root of the frequency of t in d =09(how often keyword t appears in given document) =09keyword 'file' appears 2 times in the document containing =0913 words. Therefore =09tf_d =3D sqrt(2/13) ~ 0.392 numDocs : number of documents in index =09let's have it at 100 docFreq_t : number of documents containing t =09lets put this number at 12 idf_t : log(numDocs/docFreq_t+1) + 1.0 =09idf_t =3D log(100/12+1)+1.0 ~ 1.263 norm_q : sqrt(sum_t((tf_q*idf_t)^2)) =09 =09sqrt(123*(0.775*1.263)^2)) =3D 0.979 norm_d_t : square root of number of tokens in d in the same field as t =09This doesn't really apply in our case since FullTextSearch (as it is =09now) doesn't support field based search/indexing. score_d : score for document d=09 =09 score_d =3D sum_t( tf_q * idf_t / norm_q * tf_d * idf_t / norm_d_t) =09We have to take the norm_d_t out of the equation? =09or just set it to 1? =09So we get: =09score_d =3D sum_t( tf_q * idf_t / norm_q * tf_d * idf_t) =3D =09=09=3D 1.263*(0.775 * 1.263/0.979 * 0.392 * 1.263) =3D 0.625 If this looks right, would you think it's safe to proceed with the implementation? Of course, there are a few other finer details involved such as changes to the way we store indexed data, if any? To me it seems like any backend should be able to support scoring since the only information that we require from the database (directly or indirectly  ak'a derived) is this: 1. total number of indexed documents (numDocs) 2. number of times a keyword is present in indexed data (sum_t?) 3. number of times a keyword is present in a given indexed document (required to derive tf_d). 4. number of documents containing a keyword (docFreq_t). for point 2, it might be possible to simply add an extra 'count' field to the _words table so that the table looks like this: wordidcount The 'count' field would then be adjusted as new data is added to the index or old one is removed. for point 3, a similar 'count' fields might be added to the _data table. So for a 'phrase' backend, the table may have these fields: word_iddoc_ididxcount Any thoughts/comments? Cheers, Vladimir Bogdanov. ================================================================= Internet service provided by telus.net http://www.telus.net/ 