Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv15920/xdocs
Modified Files:
faq.fml
Log Message:
* xdocs/faq.fml
Point to nutch FAQ.
Index: faq.fml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/projects/nutch/xdocs/faq.fml,v
retrieving revision 1.16
retrieving revision 1.17
diff -C2 -d -r1.16 -r1.17
*** faq.fml 19 Nov 2005 01:21:58 -0000 1.16
--- faq.fml 21 Nov 2005 21:29:54 -0000 1.17
***************
*** 266,316 ****
nutch/nutchwax (Or 'explain' the <code>explain</code> page)?</question>
<answer>
! <p>Nutch is built on Lucene. To understand Nutch scoring, study
! how Lucene does it. The formula Lucene uses scoring can be found
! at the head of the Lucene Similarity class in the
! <a href="http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html">Lucene Similarity Javadoc</a>.
! Rougly, the score for a particular document in a set of query results,
! <code>score(q,d)</code>, is the sum of the score for each term of a
! query (<code>t in q</code>). A terms' score in a document is itself the
! sum of the term run against each field that comprises a document (e.g.
! <code>title</code> is one field, <code>url</code> is another. A 'document'
! is a set of 'fields'). Per field, the terms' score is the product of
! the following factors: Its <code>td</code> (term
! freqency in the document), a score factor <code>idf</code> usually a factor
! made up of frequency of term relative to amount of docs in index, an
! index-time boost,
! a normalization of count of terms found relative to size of document
! (<code>lengthNorm</code>), a similar normalization is done for the term in
! the query itself (<code>queryNorm</code>), and finally, a factor that
! has a weight for how many instances of the total amount of terms a
! particular document contains. Study the lucene javadoc to get more
! detail on each of the equation components and how they effect
! overall score.</p>
! <p>The nutch <code>explain.jsp</code> page can be interpreted with the
! Lucene scoring equation in mind. First, notice how we move right as
! we move from score total, to score per term, to score per field (Nothing
! is shown if a term was not found in a particular field).
! Next, studying a particular field scoring, it comprises a
! query component and then a field component (Score is product of
! these two components). The query component includes
! query time -- as opposed to index time -- boost, an idf (that is same
! for the query and field components), and then a queryNorm. Similar for
! the field component (fieldNorm is an aggregation of certain of the
! Lucene equation components).</p>
!
! <p>The easiest way to influence scoring is to change query time boost
! (will require edit of nutch-site.xml and redeploy of the nutchwax.war
! file). Query-time boost by default looks like this:
! <pre>query.url.boost, 4.0f
! query.anchor.boost, 2.0f
! query.title.boost, 1.5f
! query.host.boost, 2.0f
! query.phrase.boost, 1.0f</pre></p>
! <p>From the list above, you can see that terms found in a document URL get
! the highest boost with anchor text next, etc.</p>
! <p>Anchor text makes a large contribution to a document ranking score.
! You can see the anchor text for a page by browsing to the 'explain' then
! editing the URL to put in place 'anchors.jsp' instead of 'explain.jsp'.
! </p>
</answer>
</faq>
--- 266,272 ----
nutch/nutchwax (Or 'explain' the <code>explain</code> page)?</question>
<answer>
! <p>See <i>How is scoring done in Nutch? (Or, explain the
! "explain" page?)</i> and <i>How can I influence Nutch scoring?</i> over on
! the <a href="http://wiki.apache.org/nutch/FAQ">Nutch FAQ</a> page.</p>
</answer>
</faq>
|