I am new to text mining, and of course new to Indri. I saw that when I run
query on a set of indexed documents a negative scores is associated to each
doc. Can you tell me in a line or two how the documents are scored? What dies
the score mean?
Is it based on how many times a term is present in the doc? Does the score
changes with the size of the document? Are there other parameters involved?
Here is a scenario I have. I have some hundreds of blogs, each saved in a
separate document. I want to find from the blogs which author is a java
expert. So, first I indexed the documents. Then I ran the query on them. In
the query.xml file I entered query words like "java", "j2ee", etc all related
java words. Now, I want to score the docs which contain some or all of these
words. So, I used #or in writing the query words.
Now, I want to interpret the scores. Does the highest scored doc will mean it
contains most of the search words? WIll the score depend on the size of the
file as well? Are there other parameters considered in the scoring that I
should be aware of?
At one place I read Indri follows OKAPI scoring. I searched about it but could
not follow the literature. So, I would appreciate if you can explain the
scoring to me in a few lines.
Thanks a lot for your time,
It's log probability.
See the discussion on this topic on the old Lemur archive:http://www.lemurpro
thanks a lot for the prompt reply! The archive helped!
Log in to post a comment.