Please review https://sourceforge.net/p/lemur/wiki/Indri%20Retrieval%20Model/ The formulae for both are given there. The value of cf passed in to the JelinekMercerTermScoreFunction is the computed estimate of P(t|C), which is collection term frequency/|C|.
I expect that the problem relates more to your conversion of the WARC files, as you have multiple queries experiencing the issue. You can modify the dumpindex code to iterate over your internal document ids to identify the ones that have an empty string for their docno element. You can then use dumpindex to retrieve the ParsedDocument and see which ones have the problem..
The opening and closing DOC tags must appear on a separate line, by themselves, eg: <DOC> .... </DOC>
Since you processed the WARC files into your own document set, I am unable to replicate your issue. Beacuse you have changed to a subset of the collection, I am unable to replicate your issue. The exact query you show is number 102, but your output from the original unedited post indicated query number 157.
Can't say for certain where your problem lies. Please provide me with the indexing paramters that you used (include a copy of the index manifest files), and the exact form of the offending query (#157) from your run so that I can try to replicate the issue.
The missing document id is indicative of a duplicate document entry or some string hashing bug. What operating system did you build this on? What version of indri? Did you compile the code yourself? If so, what are your configuration options? I beleive that there is a document in the TREC-B set from wikipedia which contains a TRECWEB example document embedded in it, which could be incorrectly parsed by some versions of indri. Retrieving that embedded document could also cause the document id to be...
Not really, no. It would require modification of the code to change that behavior.
#combinep scores all p entries for any document containing one of the query terms.