[Archive-access-cvs] archive-access/projects/nutch/xdocs faq.fml,1.15,1.16

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25216/xdocs

Modified Files:
	faq.fml 
Log Message:

* xdocs/faq.fml 
    More on nutch scoring.


Index: faq.fml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/faq.fml,v
retrieving revision 1.15
retrieving revision 1.16
diff -C2 -d -r1.15 -r1.16
*** faq.fml	1 Nov 2005 19:17:09 -0000	1.15
--- faq.fml	19 Nov 2005 01:21:58 -0000	1.16
***************
*** 264,270 ****
      <faq id="scoring">
          <question>Tell me more about how scoring is done in
!         nutch/nutchwax.</question>
          <answer>
!         <p>By default, at query time, the following fields are boosted as follows:
          <pre>query.url.boost, 4.0f
  query.anchor.boost, 2.0f
--- 264,305 ----
      <faq id="scoring">
          <question>Tell me more about how scoring is done in
!         nutch/nutchwax (Or 'explain' the <code>explain</code> page)?</question>
          <answer>
!         <p>Nutch is built on Lucene.  To understand Nutch scoring, study
!         how Lucene does it.  The formula Lucene uses scoring can be found
!         at the head of the Lucene Similarity class in the
!         <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html">Lucene Similarity Javadoc</a>. 
!         Rougly, the score for a particular document in a set of query results,
!         <code>score(q,d)</code>, is the sum of the score for each term of a
!         query (<code>t in q</code>).  A terms' score in a document is itself the
!         sum of the term run against each field that comprises a document (e.g. 
!         <code>title</code> is one field, <code>url</code> is another. A 'document'
!         is a set of 'fields').  Per field, the terms' score is the product of
!         the following factors: Its <code>td</code> (term
!         freqency in the document), a score factor <code>idf</code> usually a factor
!         made up of frequency of term relative to amount of docs in index, an
!         index-time boost,
!         a normalization of count of terms found relative to size of document
!         (<code>lengthNorm</code>), a similar normalization is done for the term in
!         the query itself (<code>queryNorm</code>), and finally, a factor that
!         has a weight for how many instances of the total amount of terms a
!         particular document contains. Study the lucene javadoc to get more
!         detail on each of the equation components and how they effect
!         overall score.</p>
!         <p>The nutch <code>explain.jsp</code> page can be interpreted with the
!         Lucene scoring equation in mind.  First, notice how we move right as
!         we move from score total, to score per term, to score per field (Nothing
!         is shown if a term was not found in a particular field).
!         Next, studying a particular field scoring, it comprises a 
!         query component and then a field component (Score is product of
!         these two components).  The query component includes
!         query time -- as opposed to index time -- boost, an idf (that is same
!         for the query and field components), and then a queryNorm.  Similar for
!         the field component (fieldNorm is an aggregation of certain of the
!         Lucene equation components).</p>
! 
!         <p>The easiest way to influence scoring is to change query time boost
!         (will require edit of nutch-site.xml and redeploy of the nutchwax.war
!         file).  Query-time boost by default looks like this:
          <pre>query.url.boost, 4.0f
  query.anchor.boost, 2.0f
***************
*** 273,278 ****
  query.phrase.boost, 1.0f</pre></p>
  <p>From the list above, you can see that terms found in a document URL get
! the highest boost with anchor text next, etc.
! You can change the above boosts by editing your nutch-site.xml</p>
  <p>Anchor text makes a large contribution to a document ranking score.
  You can see the anchor text for a page by browsing to the 'explain' then
--- 308,312 ----
  query.phrase.boost, 1.0f</pre></p>
  <p>From the list above, you can see that terms found in a document URL get
! the highest boost with anchor text next, etc.</p>
  <p>Anchor text makes a large contribution to a document ranking score.
  You can see the anchor text for a page by browsing to the 'explain' then