Menu

Getting word occurrences from a document.

Retrieval
Purusharth
2017-10-05
2017-10-05
  • Purusharth

    Purusharth - 2017-10-05

    I need to get the occurences of certain query words in a document for a small set of documents. I cannot use expressionList, as it return all the documents and positions from the collection, containing the term, which is very slow.

    Is there any built-in functionality of achieving the same? Someway of limiting expressionList to a small set of documents. Basically, like in runQuery we can specify the number of docs to consider for query, can we do the same for expressionList.

    Can DocumentVector class be used for this. Using documentVector class, I am able to get the list of words in the document. But, I am not sure how documentvector.positions are returned. How should I interpret this array of int.

     

    Last edit: Purusharth 2017-10-05
  • David Fisher

    David Fisher - 2017-10-05

    stems[positions[i]] is the term at position i-1 in the document (array index starts at 0, position count starts at 1)

    Using that, you can iterate over positions to count the terms you are interested in in a given document.

    Alternatively, if you are actually writing code, look at the dumpindex.cpp source for print_term_counts(). This function fetches the inverted list for a term, iterating over its entries. You could certainly modify that to pick out the specific document ids that you are interested in.

     

Log in to post a comment.