term count after document deletion

Anonymous
2011-07-13
2012-09-27
  • Anonymous - 2011-07-13

    hi there

    Term count and uniqe term count are not changed after deleting documents of
    Clueweb09 index.
    The index was built by IndriBuildIndex.

    This was the index's initial state.
    $ ../../Lemur/lemur-4.12_lib/bin/dumpindex
    /home1/neo/user/prozect/IndexedData/clueweb09b s
    Repository statistics:
    documents: 50220423
    unique terms: 87261859
    total terms: 40416831010
    fields:

    I deleted 424723 documents using Java API.
    (IndexEnvironment.deleteDocument(..))
    After that, I could see the following.
    Repository statistics:
    documents: 49795699
    unique terms: 87261859
    total terms: 40416831010
    fields:

    I think it is natural to see decreased term count, since terms of deleted
    documents do not exist in the index any more.

    How can I get correct term counts?

     
  • Anonymous - 2011-07-13

    There's a typo error : I deleted 424724 documents.

     
  • David Fisher

    David Fisher - 2011-07-13

    Document deletion only removes the document from retrieval consideration.
    Collection statistics are not updated. You will need to build an index with
    the documents that you want deleted filtered out prior to indexing.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks