Indri Document Scoring

David Fisher

Indri uses the language modeling approach to information retrieval. Language modeling assigns a probability value to each document, meaning that every score is a value between 0 and 1. For computational accuracy reasons, Indri returns the log of the actual probability value. log(0) equals negative infinity, and log(1) equals zero, so Indri document scores are always negative.

Without diving into a lot of math, it's probably best to assume that these values are not comparable across queries. In particular, you'll probably notice that as you add words to a query, the average document score tends to drop, even though the system probably gets better at finding good documents.

By default, Indri uses a query likelihood function with Dirichlet prior smoothing to weight terms. The formulation is given by:

c(w;D) =count of word in the document

c(w;C) =count of word in the collection

|D| =number of words in the document

|C| =number of words in the collection



numerator = c(w;D) + mu * c(w;C) / |C|

denominator = |D| + mu



score = log( numerator / denominator )

By default, mu is equal to 2500, which means that for the very small documents you're using, the score differences will be very small.

More information can be found on [Indri Retrieval Model].


Related

Wiki: Home
Wiki: Indri Retrieval Model
Wiki: Scored Query Evaluation
Wiki: Technical Details

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks