Galago Query Language

David Fisher sjh

Introduction

This page describes the syntax and some of the operators in the galago query language. It also describes each of the parameters and default values that each operator provides. Galago is an extensible query language, you can add user-defined operators without modifying the source code.

The galago query language is inspired by, and extends the Indri Query Language. While some Indri queries will run in Galago as is, there are some important differences, and some operators that will never exist. A general aim of the Galago project is to retain the functionality of the Indri query language, even if the exact syntax is not preserved.

The galago query language can be described as a query tree. Each nodes in the tree is an Operator. An Operator must have a name. It may have a set of parameters that defines some aspect of the operator's function. It may also have one or more child nodes. A good way of considering each node in the tree is as a function to be computed for each document in the collection. The function (operator) has two types of parameters: static parameters that do not depend on the document currently being scored; and values that do depend on the document.

One important feature of the Galago Query Language is the presence of traversals. A traversal transforms a query tree from one form into another. Traversals are used internally to annotate or transform query nodes with additional information that can be extracted from the retrieval object. Traversals are also used to make the query more efficient to process. An important consequence is that complex retrieval models can be generated within a traversal, for example:

#sdm( term term )

Is transformed through a traversal to a computable representation of the sequential dependence model:

#combine:0=0.85:1=0.1:2=0.05( #combine(term term) #combine(#od:1(term term) #combine(#uw:8(term term )))

See [Galago Traversals] for more information on these operations.

Syntax

Nodes are the basic unit of a query. They are composed to form a query tree:

 node := '#' <operator> [ ':' [ <key> '='] <value> ]* '(' [ <node> ]* ')'
 operator := [ a-z A-Z 0-9 ] +
 key := <value>
 value := [^ '=' '(' ')' ] +

So, each node is a function named <operator>, the document-independent parameters are specified as a list of key=value pairs, the document dependent parameters are specified as a list of child nodes. If a key is not provided, galago will assume that `default' is the implied key. Note that the set of keys should be unique - duplicate keys will overwrite previous keys.

For example:

#combine(
    #dirichlet:collectionLength=252013235:mu=1500:nodeFrequency=47411(
        #lengths:document:part=lengths()
        #counts:white:part=postings.krovetz()
      )
    #dirichlet:collectionLength=252013235:mu=1500:nodeFrequency=105476(
        #lengths:document:part=lengths()
        #counts:house:part=postings.krovetz()
      )
  )

This example node can compute the query likelihood score for the query white house, for each document in the collection. The innermost nodes (#lengths, and #counts) both extract statistics from the index, specifically they read data from the files specified by the part parameter.

The #dirichlet nodes compute the smoothed probability that of the term in the current document. The computed document probability is smoothed using the collection probability (nodeFrequency / collectionLength). The smoothing parameter mu determines the average document length. To avoid underflow errors, this node return the log probability rather than the raw probability.

The outer #combine node performs a weighted sum of the scores produced by the inner nodes. In this example no weights are specified, so a uniform weighting is used. Given that the output from the inner nodes are log probabilities, the weighted sum is equivalent to a weighted geometric mean over the raw probabilities.

A [Query Processing Model] is the function that iterates over all documents in the collection, executing this query tree to score each document and returns the top-k documents. See [Query Processing Model] for more details.

Query Parsing

For ease of use, the query parser provides some syntactically equivalent strings:

First, plain text term is parsed using an implicit 'text' operator.

a --> #text : a ()

Second, a string of plain text terms are wrapped in an enclosing root operator.

a b c --> #root( #text:a() #text:b() #text:c() )

Third, occasionally it may be necessary to force a particular string or parameter to be a string rather than an integer. For example:

#operator:key:1234()

the value for the key is parsed to be an integer. You can force the parameter to be considered a string using a specific quotation method.

#operator:key:@/1234/()

This method can be handy if unusual query terms must be entered. Note that the @ indicates that the next character / delimits the start and end of the quoted string. Any character can replace the / character. In the following example a whole query tree is specified as a parameter to a particular operator. `*' is used to delimit the string.

#operator:key:@*#combine( a b )*()


Related

Wiki: Galago Traversals
Wiki: Galago
Wiki: Home
Wiki: Query Processing Model