Galago Advanced Retrieval Configuration

Mostafa Keikha Stephen Harding

Galago Advanced Retrieval Configuration

Galago enables user to control almost all aspects of the system through setting corresponding parameters. Non-developers users can easily choose among existing functionalities to change the behavior of the system. Developers can further modify or implement their own classes and easily plug them into the system. In this document, we describe the most important classes that control the retrieval process. We explain how to set those classes using the parameters and how to implement new classes and integrate them into the system.

Processing Models

Processing models define the overall behavior of the retrieval process. Any processing model should extend the org.lemurproject.galago.core.retrieval.processing.ProcessingModel interface that defines how to precess a query through the execute function. There are different models already implemented in galago that can be set using processingModel parameter. The value of this parameter indicated the class that will be used for retrieval. Following example defines a passage retrieval model to be used for executing the query:

~~~~~~~~~~~~~~~~~~~
{
"casefold" : true,
"requested" : 10,
"processingModel":"org.lemurproject.galago.core.retrieval.processing.RankedPassageModel",
"passageQuery":true,
"passageSize": 50,
"passageShift" : 25,
"queries" :
{
"number" : " 301",
"text" : " international organized crime"
}

}
~~~~~~~~~~~~~~~~~~~~~~~

The most important existing processing models that are the following:

  • RankedDocumentModel : Performs straightforward document-at-a-time (daat) processing model.
  • RankedPassageModel : Performs passage-level retrieval scoring.
  • MaxScoreDocumentModel : Assumes the use of delta functions for scoring, then prunes using Maxscore that speeds up the processing time.
  • TwoPassDocumentPassageModel : Performs two stage retrieval using document-level retrieval as the first stage and passage-level retrieval as the second stage.
  • WorkingSetDocumentModel : Performs document retrieval over a given set of documents as working set.

In case the implemented processing models do not provide functionality that one might need, he needs to extend the ProcessingModel interface and implement the execute function. Integrating the newly implemented model would be as simple as setting the processingModel parameter to the name of the class.

It's worth mentioning that processing models can be defined at the query-level. In the following example, we use passage retrieval for the first query and document retrieval for the second query. This would enable developers to implement different retrieval models and selectively use them based on the query properties.

{
"casefold" : true,
"requested" : 10,
"queries" : [
{
"number" : " 301",
"text" : " international organized crime",
"processingModel":"org.lemurproject.galago.core.retrieval.processing.RankedPassageModel",
"passageSize": 50,
"passageShift" : 25,
"passageQuery":true
}
,
{
"number" : " 302",
"text" : " poliomyelitis and post polio",
"processingModel":"org.lemurproject.galago.core.retrieval.processing.RankedDocumentModel"
}
]
}

Controlling Query Stemming

By default, galago produces both stemmed and non-stemmed index parts. Thus for every index term related part, there will be at least two versions of the index part. The primary postings part will exist as "postings" and "postings.krovetz". Similarly with any field parts that were produced. If more than one stemmer was used, for example, both Porter and Krovetz stemmers, then there will be corresponding parts for each stemmer.

By default, queries will be stemmed using the Krovetz stemmer, and only those term index parts used in the retrieval. To control whether a query should be applied against stemmed or unstemmed index data, use the "defaultTextPart" parameter on the command line or in the query configuration file.

To apply a query against unstemmed data from the index, set defaultTextPart to "postings". To apply the query against the Krovetz or Porter stemmed index parts, set defaultTextPart to "postings.krovetz" or "postings.porter". Examine the transformed query output (in verbose mode) to confirm that those parts were used to access the query terms.

  # Query non-stemmed index
  galago batch-search --verbose=true --casefold=true --requested=10 \
                                          --defaultTextPart=postings --index=/myindexes/ap89.idx \
                                          /myqueries/my-batch-queries.json

   # Query Krovetz-stemmed index
  galago batch-search --verbose=true --casefold=true --requested=10 \
                                          --defaultTextPart=postings.krovetz --index=/myindexes/ap89.idx \
                                          /myqueries/my-batch-queries.json

  # Query Porter-stemmed index
  galago batch-search --verbose=true --casefold=true --requested=10 \
                                          --defaultTextPart=postings.porter --index=/myindexes/ap89.idx \
                                          /myqueries/my-batch-queries.json

Note: The defaultTextPart parameter may not be used as a query specific parameter. It applies to an entire query set.


Related

Wiki: Galago
Wiki: Home

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks