Galago Quick Configuration

David Fisher Marc Cartright Michael Zarozinski Mostafa Keikha Stephen Harding

Galago Quick Configuration

This section describes various ways Galago can be configured. This includes configuration across a drmaa-enabled computing cluster, as well as batch-level query-processing behavior. Parameters to the system can be stored in files, and well as supplied on the command line as arguments. Both formats are discussed below.

File-Based Parameters

These examples will only work with version 3.1 or newer - prior to this version XML parameters were used.

All file-based configuration uses JSON to structure information for indexing and retrieval. See http://www.json.org/

{ 
 "key" : <value>
}

Command-Line Parameters

The command-line parameters allow you to specify nested key/value pairs as if they were appearing in a JSON file. We first describe how to specify primitive values. If we begin with the following file-based example:

{
  "key1" : "string",
  "key2" : true,
  "key3" : 5,
  "key4" : 3.14159
}

These are all expressed in the same way on the command line: --<key>=<value>
Therefore, as a set of command-line parameters, the above information would be

--key1=string --key2=true --key3=5 --key4=3.14159

In each case, the correct type of the value is inferred, meaning key1 holds a value of string-type, key2 holds a boolean, key3 an integer (technically a long), and key4 a double. If key3 is meant to hold a floating-point value, than it must be explicitly written as such: --key3=5.0.

Note that if string contained spaces, than the whole value should be surrounded by double-quotes:

--comment="This is an example comment."

Specifying Arrays and Maps on the Command Line

Container types require a little more thought, but they are not much more difficult. Suppose we have the following parameter file:

{
  "arrkey" : [ val1, val2, val3 ],
  "mapkey" : { 'a' : 5,
               'b' : false
             }
}

On the command line, arrays are modified with the following: --<key>+<value>,
and maps are modified via --<k1>/<k2>=<value>. On the command line, these parameters are:

--arrkey+val1 --arrkey+val2 --arrkey+val3 --mapkey/a=5 --mapkey/b=false

Note that val1, val2, and val3 must all be of the same type.

Cluster-wide configuration: galago.conf

The galago.conf file allows you to specify parameters that are constant for a given system. The file should be written to the home directory, specifically: ~/.galago.conf

This file should be a parameter file. No particular settings are required, there are default values that will be used if the file does not exist.

Currently parameters include "tmpdir" and "drmaa". First "tmpdir" specifies a writable temporary data location. Default is the system variable $TMPDIR. The "tmpdir" parameter can define a single temporary storage location, or it can define a list of temporary storage locations. The value list is used as a list of back-off temporary locations - before temporary files are written, the locations are checked to ensure there is sufficient storage space.

{
 "tmpdir" : "/path/to/folder"
}
{
 "tmpdir" : ["/path/to/folder_1", "/path/to/folder_2"]
}

The "drmaa" specifies parameters that control the submission and running of jobs on remote processing nodes through a DRMAA interface, http://www.drmaa.org/ :

{
 "drmaa" : 
  {
   "mem" : "####(g|m)" ,
   "nativeSpec" : "..." ,
   "nativeSpecEach" : "..." ,
   "nativeSpecCombined" : "..."
  }
}

The value for the "drmaa" key should be a mapping of parameters for the drmaa interface. "mem" specifies the amount of memory to use at each node. "nativeSpec" is a string that lists parameters for the drmaa interface. The "nativeSpecEach" and "nativeSpecCombined" parameters provides control over the job submission for these different types of stages.

The set of parameters stored in this file will likely be extended in the future.

Example config files

Non-cluster computer:

$cat ~/.galago.conf
{
 "tmpdir" : "/tmp/galagoTmp"
}

Drmaa supported cluster 1:

$cat ~/.galago.conf
{
 "tmpdir" : ["/tmp/galagoTmp", "/mnt/nfs/shared/raid/storage" ],
 "drmaa" : 
  {
   "mem" : "3500m" ,
   "nativeSpec" : "-l mem_free=4G -l mem_token=4G"
  }
}

Drmaa supported cluster 2:

$cat ~/.galago.conf
{
 "tmpdir" : "/tmp/galagoTmp" ,
 "drmaa" : 
  {
   "mem" : "3500m" ,
   "nativeSpec" : "-l mem_free=4G -l mem_token=4G" ,
   "nativeSpecEach" : "-q all.q" ,
   "nativeSpecCombined" : "-q long.q -l long=TRUE"
  }
}

Indexing Configuration Files

The typical command to have Galago build an index is galago build <command-line parameters|parameter file>+

Basic configuration files for indexing often specify the following:

{
 "inputPath" : "./inputdata",
 "indexPath" : "./myindex",
 "mode" : "drmaa",
 "port" : 8000,
 "galagoJobDir" : "/tmp/galago",
 "deleteJobDir" : false,
 "distrib" : 30
}

where each field means:

inputPath : String or array of strings. The path to the input file(s) for indexing. Any directory encountered will be recursively traversed to look for files to add to the input list.

indexPath : String. The path to the resulting index.

mode : One of drmaa, local, or threaded. Indicates how Galago should operate. drmaa will submit jobs to a drmaa-compliant job scheduling queue which Galago will monitor for indexing progress. local will use a single thread to perform the entire indexing process, therefore stages will be completed serially. In threaded mode, Galago will estimate the number of processors on the host machine, and spawn an equivalent number of threads to execute indexing tasks. drmaa and threaded are preferred for faster execution, however local is useful for debugging purposes.

port : Integer. The browser port to check the Galago status page during indexing. Given the parameters above, the indexing status can be seen at http://localhost:6800.

galagoJobDir : String. The location of intermediate stage files, job status, and output/error files. Must be a location large enough to hold the intermediate indexing data. In drmaa mode, this must be a location accessible by both stage executor and job executor processes (network-mounted storage is typically sufficient for this task).

deleteJobDir : Boolean. If true, the path specified by galagoJobDir will be deleted after the job executor determines all stages are complete. Otherwise, the job files will remain, and will need to be deleted manually. Leave false if debugging to examine intermediate output and error files.

distrib : Integer. Only used if mode is drmaa. This specifies the number of splits to use when distributing a stage among multiple machines in a cluster. During standard indexing, the major task that is distributed is the parsing stage.

Parser-Specific Parameters

The parser class may be specified via parameters, as well as any parameters that are specific only to the parser class:

{
 "inputPath" : "./inputdata",
 "indexPath" : "./myindex",
 "mode" : "drmaa",
 "port" : 8000,
 "galagoJobDir" : "/tmp/galago",
 "deleteJobDir" : false,
 "distrib" : 30,
 "parser" : {
       "class" : "org.lemurproject.galago.core.parse.UniversalParser",
       "externalParsers" : [
            "class" : "myparsers.IMDBDocumentParser",
            "filetype" : "imdb"
                           ]
             },
        "filetype" : "warc"
}

In the example above, the class that will be used in the parsing step of indexing is the UniversalParser, which is the default parser used (we only explicitly state it here for illustrative purposes). The UniversalParser acts a routing mechanism to select instance-specific parsers for an input file type. For example, input splits that follow an <html>...</html> pattern, or end in html, will be inferred as html-type documents, and the FileParser class, which correctly parses SGML-style documents, will be used.

In the above example, we also add a new instance-specific parser, the IMDBDocumentParser, which must extends the DocumentStreamParser abstract class. We additionally associate this parser with documents od imdb type, which may be inferred by the UniversalParser (in the case of addtional externally-defined parsers, the only reliable inference the UniversalParser has is to examine the file extensions of the input files).

Additionally, you can force the UniversalParser to treat all input documents as a particular type by defining the filetype property in the parameters to the parser. In the case above, we tell the UniversalParser that all input files are of type warc, causing the UniversalParser to always instantiate WARCParser instance-specific parsers.

Galago tries to automatically detect type of the documents based on the file extensions or content of the files ( by analyzing the top 100 lines of each file). However, sometimes it is easier to explicitly define the file types via filetype parameter. This parameter can get different values including warc, trectext, trecweb.

Tokenizer-Specific Parameters

The tokenization class may be specified via parameters, as well as any specific parameters that should be forwarded to the tokenization class:

{
 "inputPath" : "./inputdata",
 "indexPath" : "./myindex",
 "mode" : "drmaa",
 "port" : 8000,
 "galagoJobDir" : "/tmp/galago",
 "deleteJobDir" : false,
 "distrib" : 30,
 "tokenizer" : {
       "class" : "org.lemurproject.galago.core.parse.TagTokenizer",
       "fields" : ["a", "title", "p", "meta"]
               }
}

In the above example, we explicitly told Galago to use the TagTokenizer as the tokenization class. This class is used by default, so explicitly stating it here is only for illustration. The fields member specifies the fields in the input Documents that should be treated as separate fields for searching. In this case, we specified the a, title, p, and meta fields. By storing these fields explicitly, during retrieval we can issue queries such as #combine( #field:a( directions ) ) which specifies the ranking function "Find me documents that have the word 'directions' inside the 'a' tag in the input" - for web documents, obviously, that is typically the anchor field.

Any key/value pair can be specified and it will be passed to the tokenizer. Say you have compiled in your own tokenizer implementation, and want to filter documents ending with "com" or "gov" using a pattern specified by the parameter filterPattern. You may specify your parameters to this class as such:

"tokenizer" : {
       "class" : "my.institution.FilteringTokenizer",
       "filterPattern" : "(gov|com)$"
               }

Stemming Specific Parameters

Parameter stemmer should be a list of strings indicating stemmer names. This parameter is only active when the stemmedPostings parameter is true (the default).

Stemmer names should be associated with classes in stemmerClass parameter. Entries may be automatically generated from the stemmerClass parameter.

The default is :

{
  "stemmer": "krovetz",
  "stemmerClass" : {"krovetz" : "org.lemurproject.galago.core.parse.stem.KrovetzStemmer"}
}

Using Metadata for Filtering Operations

Sometimes you may want to filter documents during retrieval based on some comparable attribute of the document. This information must be stored at index time in order to make it accessible during retrieval. This is specified in the tokenizer-level parameters using the format key:

"tokenizer" : {
       "class" : "org.lemurproject.galago.core.parse.TagTokenizer",
       "formats" : {
             "timestamp" : "long",
             "spamminess" : "double" 
                   }
}

Using these parameters, Galago will associated a single timestamp and a single spamminess value to every input Document that has the fields. Unlike the fields specification described above, these values must conform to their specified type as they are expected to be comparable. Given the parameters above, an input document is expected to look something like:

<doc>
<timestamp>1000432434</timestamp>
<spamminess>0.6</spamminess>
...
</doc>

During retrieval, these fields are then available for filtering via the #require and #reject operators. For example, to filter all documents with a spamminess greater than 0.75:

#reject( #greater( spamminess 0.75 ) <query if passes filter> )

Retrieval Configuration Files

Retrieval can be done using either search or batch-search commands.

search command requires mainly index and corpus parameters and provide a web interface for searching the index interactively.

batch-search command is useful for running a batch of queried against the index. A simple invocation of this command would look like this:

galago batch-search --index=/tmp/myindex --requested=200 /tmp/queries.json

where index points to an already generated index, requested defines the maximum number of results per query (default=1000) and the last argument is the path to a query file that includes query set in JSON format. Each query in the query file has a text field, which contains the text of the query, and a number field, which uniquely identifies the query in the output. An example query file would be like the following:

~~~~~~~~~~~~~~~~~
{
"casefold" : true,
"queries" :
{
"number" : "CACM-408",
"text" : "my query"

},
{
"number" : "WIKI-410",
"text" : "#combine(another query)"
}



}
~~~~~~~~~~~~~~~~

Galago detects the type of stemmer based on the provided index and apply the same stemming method on the queries. It is important to not feed in stemmed queries because they will be stemmed twice which would result in unexpected outputs.

By default there is no case folding over queries and it can be turned on by setting casefold parameter to true. This parameter, like all other query-level parameters, can be defined once for all queries, like the example above, or separately for each query by putting it in the query brackets.

Simple passage retrieval can be done by setting three parameters: passageQuery is a boolean parameter that needs to be true for passage retrieval, passageSize and passageShift are integer parameters that show the size and overlap between passages. A simple setting for passageSize and passageShift parameters can be 50 and 25 respectively.


Related

Wiki: Galago Temporary Files
Wiki: Galago
Wiki: Home

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks