## calculating uwN in indri

Retrieval
Anonymous
2011-12-01
2012-09-27

• Anonymous
2011-12-01

Hi,

I have a question regarding the calculation of uwN in indri.

I created a small index composed of 3 documents:

1 (docno) - "A penny penny for your thoughts"
2 - "A penny saved is a penny earned"
3 - "your hat costs a pretty penny"

I used indri api to index those documents so I added all of the terms in the
documents (without stopping and stemming). The total number of terms in the
index is 19, the documents internal id's are 0,1 and 2 and the documents
lengths are 6 7 and 6.

Now, I run the following commands and got the following results:

IndriRunQuery -query="#uw8( a penny )" -index=indexlocation
-rule=method:jm,collectionLambda=0.4
-1.36399 1 0 7
-1.69168 2 0 6
-1.69168 0 0 6

dumpindex indexLocation e "#uw8( a penny )"

# uw8( a penny ) 19 3

1 1 0 2
2 1 0 2
2 1 1 5
2 1 4 6
3 1 3 6

dumpindex indexLocation x "#uw8( a penny )"

# uw8(a penny ):4

If I understand correctly - the second results means that the expression
appears in the index 5 times. 1 time in documents 1 and 3 and 3 times in
document 2. The third result means that the expression appears 4 times.

There are 2 things which does not make sense to me:

1. The final score calculation - taking document 1 (docid) for example, I do the following calculation:

log(0.6(3/7)+0.4(5/19)). This gives -1.0150 which is not the result
obtained. The result in the output is obtained if I assume that document 1 has
only 2 occurrences of the expression and that the number of total occurrences
is 4. log(0.6(2/7)+0.4(4/19))=-1.36399. How does it correspond with the
second command results?

1. I don't understand how the occurrences of the unordered window are counted. I read the following definition: tf#uwN(qi:::qj );D is the number of times the terms qi; : : : qj appear ordered or unordered within a window N terms. What corresponds with the list obtained is taking a window of 8 tokens each time, starting from 0 term and moving one term to the right each time. Counting the number of times that the window contained both terms a and penny in any order. Doing that I get the results in the list of the second command. But this brings me back to the first question - how did the score was calculated?

I'll appreciate you help
Thanks

• David Fisher
2011-12-01

The counts for the scoring are without double dipping, so there are only 4
instances of the window in the collection. Each window expression consumes the
terms used to constitute the window.

The expression list command (dumpindex e) includes the 5th window produced
when the penny from window one combines with the a from window three because
it enumerates all possible windows, without regard to overlapping. It is
unrelated to evaluating the query for scoring.

• Le Zhao
2011-12-21

Ah, that's new. I always thought scoring and "dumpindex e" have the same
counting implementation..