Menu

#75 Add deleteDocument method to QueryEnvironment

Next_Release
closed
nobody
None
1
2013-12-12
2013-09-17
Dan Jamrog
No

It would be helpful if the QueryEnvironment class had a deleteDocument(const std::vector< lemur::api::DOCID_T > &documentIDs) method. I'm dealing with a number of sharded indexes. I can load them all into a QE and do queries to find the documentID(s) of docs I need to delete, but I don't think that helps me because I'm getting back the 'cooked' documentIDs which can't be used in an IndexEnvironment::deleteDocument(int documentID) call. Instead, I have to load each shard into a QE, find the documentID(s) for docs I need to delete, then init an IndexEnvironment with the shard and call IndexEnvironment::deleteDocument(int documentID). (If I'm missing something that would make this easier, please let me know.)
A really nice enhancement would be something like Lucene's IndexWriter.deleteDocuments(Query query) which would let me search and delete in one method call. Thanks.

Discussion

  • David Fisher

    David Fisher - 2013-11-20

    The goal of the QueryEnvironment class is to provide read only access to the underlying repository.

    You can accomplish this activity by modifying your code to use
    QueryEnvironment::addIndex(IndexEnvironment), keeping the IndexEnvironments around between each query call. See http://lemur.sourceforge.net/indri/classindri_1_1api_1_1QueryEnvironment.html#a00e5012eafbbdff0eee582166d0e35a4

    You can uncook the document ids as long as you know the order the IndexEnvironments were added to the QueryEnvironment (see the comment in QueryEnvironment.cpp):

    // by reassigning document IDs with the following function:                     
    //      serverCount = _servers.size();                                          
    //      cookedDocID = rawDocID * serverCount + docServer;                       
    // So, for document 6 from server 3 (out of 7 servers), the cooked docID would be:                                                                             
    //      (6 * 7) + 3 = 45.
    

    and the see the documentLength API implementation for an example of uncooking:

      int serverCount = (int)_servers.size();
      DOCID_T id = documentID/serverCount;
      int serverID = documentID % serverCount;
      length = _servers[serverID]->documentLength( id );
    

    Performing a similar bit of work will enable using the appropriate IndexEnvironment to delete each document.

    When you are all done with your deletions, close each IndexEnvironment to commit the changes.

     
  • David Pane

    David Pane - 2013-12-12
    • status: open --> closed
     

Log in to post a comment.

MongoDB Logo MongoDB