From: Michael S. <sta...@us...> - 2005-11-19 01:22:08
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25216/xdocs Modified Files: faq.fml Log Message: * xdocs/faq.fml More on nutch scoring. Index: faq.fml =================================================================== RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/faq.fml,v retrieving revision 1.15 retrieving revision 1.16 diff -C2 -d -r1.15 -r1.16 *** faq.fml 1 Nov 2005 19:17:09 -0000 1.15 --- faq.fml 19 Nov 2005 01:21:58 -0000 1.16 *************** *** 264,270 **** <faq id="scoring"> <question>Tell me more about how scoring is done in ! nutch/nutchwax.</question> <answer> ! <p>By default, at query time, the following fields are boosted as follows: <pre>query.url.boost, 4.0f query.anchor.boost, 2.0f --- 264,305 ---- <faq id="scoring"> <question>Tell me more about how scoring is done in ! nutch/nutchwax (Or 'explain' the <code>explain</code> page)?</question> <answer> ! <p>Nutch is built on Lucene. To understand Nutch scoring, study ! how Lucene does it. The formula Lucene uses scoring can be found ! at the head of the Lucene Similarity class in the ! <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html">Lucene Similarity Javadoc</a>. ! Rougly, the score for a particular document in a set of query results, ! <code>score(q,d)</code>, is the sum of the score for each term of a ! query (<code>t in q</code>). A terms' score in a document is itself the ! sum of the term run against each field that comprises a document (e.g. ! <code>title</code> is one field, <code>url</code> is another. A 'document' ! is a set of 'fields'). Per field, the terms' score is the product of ! the following factors: Its <code>td</code> (term ! freqency in the document), a score factor <code>idf</code> usually a factor ! made up of frequency of term relative to amount of docs in index, an ! index-time boost, ! a normalization of count of terms found relative to size of document ! (<code>lengthNorm</code>), a similar normalization is done for the term in ! the query itself (<code>queryNorm</code>), and finally, a factor that ! has a weight for how many instances of the total amount of terms a ! particular document contains. Study the lucene javadoc to get more ! detail on each of the equation components and how they effect ! overall score.</p> ! <p>The nutch <code>explain.jsp</code> page can be interpreted with the ! Lucene scoring equation in mind. First, notice how we move right as ! we move from score total, to score per term, to score per field (Nothing ! is shown if a term was not found in a particular field). ! Next, studying a particular field scoring, it comprises a ! query component and then a field component (Score is product of ! these two components). The query component includes ! query time -- as opposed to index time -- boost, an idf (that is same ! for the query and field components), and then a queryNorm. Similar for ! the field component (fieldNorm is an aggregation of certain of the ! Lucene equation components).</p> ! ! <p>The easiest way to influence scoring is to change query time boost ! (will require edit of nutch-site.xml and redeploy of the nutchwax.war ! file). Query-time boost by default looks like this: <pre>query.url.boost, 4.0f query.anchor.boost, 2.0f *************** *** 273,278 **** query.phrase.boost, 1.0f</pre></p> <p>From the list above, you can see that terms found in a document URL get ! the highest boost with anchor text next, etc. ! You can change the above boosts by editing your nutch-site.xml</p> <p>Anchor text makes a large contribution to a document ranking score. You can see the anchor text for a page by browsing to the 'explain' then --- 308,312 ---- query.phrase.boost, 1.0f</pre></p> <p>From the list above, you can see that terms found in a document URL get ! the highest boost with anchor text next, etc.</p> <p>Anchor text makes a large contribution to a document ranking score. You can see the anchor text for a page by browsing to the 'explain' then |