Anonymous
2011-12-01
Hi,
I have a question regarding the calculation of uwN in indri.
I created a small index composed of 3 documents:
1 (docno) - "A penny penny for your thoughts"
2 - "A penny saved is a penny earned"
3 - "your hat costs a pretty penny"
I used indri api to index those documents so I added all of the terms in the
documents (without stopping and stemming). The total number of terms in the
index is 19, the documents internal id's are 0,1 and 2 and the documents
lengths are 6 7 and 6.
Now, I run the following commands and got the following results:
IndriRunQuery -query="#uw8( a penny )" -index=indexlocation
-rule=method:jm,collectionLambda=0.4
-1.36399 1 0 7
-1.69168 2 0 6
-1.69168 0 0 6
dumpindex indexLocation e "#uw8( a penny )"
1 1 0 2
2 1 0 2
2 1 1 5
2 1 4 6
3 1 3 6
dumpindex indexLocation x "#uw8( a penny )"
If I understand correctly - the second results means that the expression
appears in the index 5 times. 1 time in documents 1 and 3 and 3 times in
document 2. The third result means that the expression appears 4 times.
There are 2 things which does not make sense to me:
log(0.6(3/7)+0.4(5/19)). This gives -1.0150 which is not the result
obtained. The result in the output is obtained if I assume that document 1 has
only 2 occurrences of the expression and that the number of total occurrences
is 4. log(0.6(2/7)+0.4(4/19))=-1.36399. How does it correspond with the
second command results?
I'll appreciate you help
Thanks
David Fisher
2011-12-01
The counts for the scoring are without double dipping, so there are only 4
instances of the window in the collection. Each window expression consumes the
terms used to constitute the window.
The expression list command (dumpindex e) includes the 5th window produced
when the penny from window one combines with the a from window three because
it enumerates all possible windows, without regard to overlapping. It is
unrelated to evaluating the query for scoring.
Le Zhao
2011-12-21
Ah, that's new. I always thought scoring and "dumpindex e" have the same
counting implementation..