Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25216/xdocs
Modified Files:
faq.fml
Log Message:
* xdocs/faq.fml
More on nutch scoring.
Index: faq.fml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/faq.fml,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** faq.fml 1 Nov 2005 19:17:09 -0000 1.15
--- faq.fml 19 Nov 2005 01:21:58 -0000 1.16
***************
*** 264,270 ****
<faq id="scoring">
<question>Tell me more about how scoring is done in
! nutch/nutchwax.</question>
<answer>
! <p>By default, at query time, the following fields are boosted as follows:
<pre>query.url.boost, 4.0f
query.anchor.boost, 2.0f
--- 264,305 ----
<faq id="scoring">
<question>Tell me more about how scoring is done in
! nutch/nutchwax (Or 'explain' the <code>explain</code> page)?</question>
<answer>
! <p>Nutch is built on Lucene. To understand Nutch scoring, study
! how Lucene does it. The formula Lucene uses scoring can be found
! at the head of the Lucene Similarity class in the
! <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html">Lucene Similarity Javadoc</a>.
! Rougly, the score for a particular document in a set of query results,
! <code>score(q,d)</code>, is the sum of the score for each term of a
! query (<code>t in q</code>). A terms' score in a document is itself the
! sum of the term run against each field that comprises a document (e.g.
! <code>title</code> is one field, <code>url</code> is another. A 'document'
! is a set of 'fields'). Per field, the terms' score is the product of
! the following factors: Its <code>td</code> (term
! freqency in the document), a score factor <code>idf</code> usually a factor
! made up of frequency of term relative to amount of docs in index, an
! index-time boost,
! a normalization of count of terms found relative to size of document
! (<code>lengthNorm</code>), a similar normalization is done for the term in
! the query itself (<code>queryNorm</code>), and finally, a factor that
! has a weight for how many instances of the total amount of terms a
! particular document contains. Study the lucene javadoc to get more
! detail on each of the equation components and how they effect
! overall score.</p>
! <p>The nutch <code>explain.jsp</code> page can be interpreted with the
! Lucene scoring equation in mind. First, notice how we move right as
! we move from score total, to score per term, to score per field (Nothing
! is shown if a term was not found in a particular field).
! Next, studying a particular field scoring, it comprises a
! query component and then a field component (Score is product of
! these two components). The query component includes
! query time -- as opposed to index time -- boost, an idf (that is same
! for the query and field components), and then a queryNorm. Similar for
! the field component (fieldNorm is an aggregation of certain of the
! Lucene equation components).</p>
!
! <p>The easiest way to influence scoring is to change query time boost
! (will require edit of nutch-site.xml and redeploy of the nutchwax.war
! file). Query-time boost by default looks like this:
<pre>query.url.boost, 4.0f
query.anchor.boost, 2.0f
***************
*** 273,278 ****
query.phrase.boost, 1.0f</pre></p>
<p>From the list above, you can see that terms found in a document URL get
! the highest boost with anchor text next, etc.
! You can change the above boosts by editing your nutch-site.xml</p>
<p>Anchor text makes a large contribution to a document ranking score.
You can see the anchor text for a page by browsing to the 'explain' then
--- 308,312 ----
query.phrase.boost, 1.0f</pre></p>
<p>From the list above, you can see that terms found in a document URL get
! the highest boost with anchor text next, etc.</p>
<p>Anchor text makes a large contribution to a document ranking score.
You can see the anchor text for a page by browsing to the 'explain' then
|