calculating uwN in indri


  • Anonymous


    I have a question regarding the calculation of uwN in indri.

    I created a small index composed of 3 documents:

    1 (docno) - "A penny penny for your thoughts"
    2 - "A penny saved is a penny earned"
    3 - "your hat costs a pretty penny"

    I used indri api to index those documents so I added all of the terms in the
    documents (without stopping and stemming). The total number of terms in the
    index is 19, the documents internal id's are 0,1 and 2 and the documents
    lengths are 6 7 and 6.

    Now, I run the following commands and got the following results:

    IndriRunQuery -query="#uw8( a penny )" -index=indexlocation
    -1.36399 1 0 7
    -1.69168 2 0 6
    -1.69168 0 0 6

    dumpindex indexLocation e "#uw8( a penny )"

    uw8( a penny ) 19 3

    1 1 0 2
    2 1 0 2
    2 1 1 5
    2 1 4 6
    3 1 3 6

    dumpindex indexLocation x "#uw8( a penny )"

    uw8(a penny ):4

    If I understand correctly - the second results means that the expression
    appears in the index 5 times. 1 time in documents 1 and 3 and 3 times in
    document 2. The third result means that the expression appears 4 times.

    There are 2 things which does not make sense to me:

    1. The final score calculation - taking document 1 (docid) for example, I do the following calculation:

    log(0.6(3/7)+0.4(5/19)). This gives -1.0150 which is not the result
    obtained. The result in the output is obtained if I assume that document 1 has
    only 2 occurrences of the expression and that the number of total occurrences
    is 4. log(0.6(2/7)+0.4(4/19))=-1.36399. How does it correspond with the
    second command results?

    1. I don't understand how the occurrences of the unordered window are counted. I read the following definition: tf#uwN(qi:::qj );D is the number of times the terms qi; : : : qj appear ordered or unordered within a window N terms. What corresponds with the list obtained is taking a window of 8 tokens each time, starting from 0 term and moving one term to the right each time. Counting the number of times that the window contained both terms a and penny in any order. Doing that I get the results in the list of the second command. But this brings me back to the first question - how did the score was calculated?

    I'll appreciate you help

  • David Fisher
    David Fisher

    The counts for the scoring are without double dipping, so there are only 4
    instances of the window in the collection. Each window expression consumes the
    terms used to constitute the window.

    The expression list command (dumpindex e) includes the 5th window produced
    when the penny from window one combines with the a from window three because
    it enumerates all possible windows, without regard to overlapping. It is
    unrelated to evaluating the query for scoring.

  • Le Zhao
    Le Zhao

    Ah, that's new. I always thought scoring and "dumpindex e" have the same
    counting implementation..