2009-08-01 13:04:28 UTC
When I use IndriBuildIndex for indexing files, the index created stores the doc-numbers as if they were part of the text.
For instance, a single file contains:
<DOC>
<DOCNO>D1</DOCNO>
<TEXT>
test
</TEXT>
</DOC>
the build parameter file:
<parameters>
<index>/home/usr/tmpIndex</index>
<corpus>
<path>/home/usr/dox</path>
<class>trectext</class>
</corpus>
<stemmer><name>porter</name></stemmer>
</parameters>
the manifest of the index created (notice the 2 total-terms):
<parameters>
<code-build-date>Jun 16 2009</code-build-date>
<corpus>
<document-base>1</document-base>
<frequent-terms>0</frequent-terms>
<maximum-document>2</maximum-document>
<total-documents>1</total-documents>
<total-terms>2</total-terms>
<unique-terms>2</unique-terms>
</corpus>
<fields>
<field>
<byte-offset>0</byte-offset>
<isNumeric>false</isNumeric>
<isOrdinal>true</isOrdinal>
<isParental>true</isParental>
<name>document</name>
<total-documents>1</total-documents>
<total-terms>2</total-terms>
</field>
</fields>
<indri-distribution>Indri development release 2.9</indri-distribution>
<type>DiskIndex</type>
</parameters>
If I run a query q="d1", it finds "d1" as if it were part of the text.
I know in Lemur's BuildIndex (key for example) the index does not treat the doc-numbers as an integral part of the text.
Is there any way to make IndriBuildIndex ignore the DOCNO or at least not store it as if it were part of the text? the way it is now, it corrputs the index statistics.
Thank you very much