Menu

Galago Operators

David Fisher sjh Lemur Project

This page describes the operators available to the Galago search engine.

List of implemented operators

Index file operators (uses part files directly)

  • #counts
  • #extents
  • #field
  • #indicator
  • #lengths
  • #names
  • #neighbors
  • #prior
  • #scores

Feature Factory operators (combines/transforms child operators)

  • #all
  • #any
  • #between
  • #bm25rf
  • #combine
  • #equals
  • #greater
  • #inside
  • #intersect
  • #less
  • #maxscore
  • #null
  • #ordered
  • #od
  • #reject
  • #require
  • #rm
  • #scale
  • #syn
  • #synonym
  • #threshold
  • #unordered
  • #uw

Meta Operators (replaced with operators above during a traversal)

  • #field
  • #fulldep
  • #prms
  • #root
  • #seqdep
  • #text
  • #window
  • #N
  • #odN
  • #uwN

#COMBINE

Return a normalized, weighted sum of the scores produced by each of the operator's children.
The weights are normalized by the sum of the child node weights.

Parameters

One may specify weights to assign during the summing of each of the children nodes of the
combine operation. The weights are specified as an array with values. For example

#combine:0=0.1:1=0.2:2=0.7 (international organized crime.h3)

Example Query
  galago batch-search --verbose=true --requested=5 \
                                          --index=/myindexes/robust04.idx \
                                          --query="#combine:0=0.1:1=0.2:2=0.7 (international organized
                                                                                                                    crime.h3)"

   Mar 09, 2016 11:54:37 AM org.lemurproject.galago.core.tools.apps.BatchSearch run
   INFO: RUNNING: unk-0 : #combine:0=0.1:1=0.2:2=0.7(international organized crime.h3)
   Mar 09, 2016 11:54:37 AM org.lemurproject.galago.core.tools.apps.BatchSearch run
   INFO: Transformed Query:
   #combine:0=0.1:1=0.2:2=0.7:w=1.0(
     #dirichlet:collectionLength=252359881:maximumCount=326:nodeFrequency=174191:w=0.1(
       #lengths:document:part=lengths()
       #counts:international:part=postings.krovetz()
     )
     #dirichlet:collectionLength=252359881:maximumCount=111:nodeFrequency=11401:w=0.2(
       #lengths:document:part=lengths()
       #counts:organized:part=postings.krovetz()
     )
     #dirichlet:collectionLength=252359881:maximumCount=1:nodeFrequency=1:w=0.7(
       #lengths:document:part=lengths()
       #counts:crime:part=field.krovetz.h3()
     )
   )

   unk-0 Q0 FBIS3-8153 1 -8.07971877 galago
   unk-0 Q0 LA121990-0141 2 -15.50911595 galago
   unk-0 Q0 LA102290-0116 3 -15.51177100 galago
   unk-0 Q0 FBIS4-54904 4 -15.51292184 galago
   unk-0 Q0 FBIS4-19535 5 -15.51592076 galago

#SDM() Sequential Dependence Model

A model that assumes dependencies between adjacent query terms. It is implemented by the core.retrieval.traversal.SequentialDependenceTraversal class via the #sdm or
#seqdep operators.

The traversal produces a combined query consistem of unigram, ordered distance and
unordered distance components from the original query tersm. Component default
weights are 0.8 unigrams, 0.15 ordered distance and 0.05 unordered window. Weights
for parts of each query component will be divided by the number of query terms.
The traversal uses dirichlet terms smoothing by default.

  #sdm( term1 term2 ... termk ) becomes

    #combine ( 0.8  #combine ( term1 term2 ... termk)
                          0.15 #combine ( #od(term1 term2)
                                                          #od(term2 term3) ...
                                                          #od(termk-1 termk) )
                          0.05 #combine ( #uw8(term term2) ...
                                                          #uw8(termk-1 termk) )
                      )
  )
Parameters
  • uniw Unigram query component weight (default 0.8)
  • odw Ordered window query component weight (default 0.15)
  • uww Unordered window query component weight (default 0.05)
  • windowLimit Window proximity limit (default 2)
  • fast Faster operators (true/false) (bigram/ubigram operators)
  • sdm.od.op Ordered window operator (default "ordered")
  • sdm.uw.op Unordered window operator (default "unordered")
  • sdm.od.width Window width (default 1)
  • sdm.uw.width Window width (default 4)
Example Configuration File
    {
      "verbose"   : true,
      "casefold"  : true,
      "requested" : 5,
      "index"     : "/myqueries/qrys_sdm.json",
      "queries" : [
        {
          "number" : "sdm",
          "text"   : "#sdm(weatherman new york)",
          "uniw"   : 0.65,
          "odw"    : 0.20,
          "uww"    : 0.15,
          "windowLimit" : 3
        }
      ]
    }

NOTE: With the windowLimit value of 3 as opposed to default two, the tranformed query
will contain three word as well as two word groupings for distance operations.

         The unordered window size parameter is also increased for the additional word
          groupings.

         Weights for the three query components are divided equally among the term
          groupings and should sum to the weights specified in the configuration file
          for each grouping.
Example Query
    galago batch-search /myqueries/qrys_sdm.json

    Mar 09, 2016 9:30:20 AM org.lemurproject.galago.core.tools.apps.BatchSearch run
    INFO: RUNNING: sdm : #sdm:uniw=0.65:odw=0.20:uww=0.15:windowlimit=3(weatherman new york)
    Mar 09, 2016 9:30:20 AM org.lemurproject.galago.core.tools.apps.BatchSearch run
    INFO: Transformed Query:
     #combine:0=0.21666666666666667:1=0.21666666666666667:2=0.21666666666666667
                       :3=0.06666666666666667:4=0.06666666666666667:5=0.06666666666666667
                       :6=0.049999999999999996:7=0.049999999999999996
                       :8=0.049999999999999996:w=1.0(
        #dirichlet:collectionLength=3801748:maximumCount=1
                         :nodeFrequency=12:w=0.21666666666666667(
           #lengths:document:part=lengths()
           #counts:weatherman:part=postings.krovetz()
        )
        #dirichlet:collectionLength=3801748:maximumCount=22
                          :nodeFrequency=9878:w=0.21666666666666667(
          #lengths:document:part=lengths()
          #counts:new:part=postings.krovetz()
        )
        #dirichlet:collectionLength=3801748:maximumCount=13
                         :nodeFrequency=2986:w=0.21666666666666667(
          #lengths:document:part=lengths()
          #counts:york:part=postings.krovetz()
        )
        #dirichlet:collectionLength=3801748:maximumCount=0
                         :nodeFrequency=0:w=0.06666666666666667(
          #lengths:document:part=lengths()
          #ordered:1(
            #extents:weatherman:part=postings.krovetz()
            #extents:new:part=postings.krovetz()
          )
        )
        #dirichlet:collectionLength=3801748:maximumCount=13
                         :nodeFrequency=2970:w=0.06666666666666667(
          #lengths:document:part=lengths()
          #ordered:1(
            #extents:new:part=postings.krovetz()
            #extents:york:part=postings.krovetz()
          )
        )
        #dirichlet:collectionLength=3801748:maximumCount=0
                         :nodeFrequency=0:w=0.06666666666666667(
          #lengths:document:part=lengths()
          #ordered:1(
            #extents:weatherman:part=postings.krovetz()
            #extents:new:part=postings.krovetz()
            #extents:york:part=postings.krovetz()
          )
        )
        #dirichlet:collectionLength=3801748:maximumCount=0
                         :nodeFrequency=0:w=0.049999999999999996(
          #lengths:document:part=lengths()
          #unordered:8(
            #extents:weatherman:part=postings.krovetz()
            #extents:new:part=postings.krovetz()
          )
        )
        #dirichlet:collectionLength=3801748:maximumCount=17
                         :nodeFrequency=3118:w=0.049999999999999996(
          #lengths:document:part=lengths()
          #unordered:8(
            #extents:new:part=postings.krovetz()
            #extents:york:part=postings.krovetz()
          )
        )
        #dirichlet:collectionLength=3801748:maximumCount=0
                         :nodeFrequency=0:w=0.049999999999999996(
          #lengths:document:part=lengths()
          #unordered:12(
            #extents:weatherman:part=postings.krovetz()
            #extents:new:part=postings.krovetz()
            #extents:york:part=postings.krovetz()
          )
        )
     )

     sdm Q0 AP890110-0137 1 -8.63710973 galago
     sdm Q0 AP890111-0014 2 -8.75586082 galago
     sdm Q0 AP890120-0172 3 -8.94746306 galago
     sdm Q0 AP890109-0244 4 -8.98105408 galago
     sdm Q0 AP890119-0213 5 -8.99652757 galago

#FDM() Full Dependence Model

Implemented by class core.retrieval.traversal.FullDependenceTraversal class using the #fdm or
#fulldep operators.

The model transforms original queries into the following form:

  #fdm ( term1 term2 term3 ) -->

  #combine ( 0.8  term1 term2 term3 )

             0.15  #od:1 ( term1 term2 )       
                         #od:1 ( term1 term3 )
                         #od:1 ( term2 term3 )
                         #od:1 ( term1 term2 term3 )

             0.05  #uw:8 ( term1 term2 )       
                         #uw:8 ( term1 term3 )        
                         #uw:8 ( term2 term3 )
                         #uw:12 ( term1 term2 term3 )
  )

Note: The components weights will be divided by the number of unigram, odN and uwN
operations performed. Unordered window distances may be augmented when the
number of query terms exceeds windowLimit setting.

Parameters
  • uniw Unigram weight (default 0.8)
  • odw Ordered distance weight (default 0.15)
  • uww Unordered window weight (default 0.05)
  • windowLimit Maximum term groupings (default 3)
Example Configuration File
    {
      "verbose"   : true,
      "casefold"  : true,
      "requested" : 5,
      "index"     : "/myindexes/robust04.idx",
      "uniw"    : 0.75,
      "odw"     : 0.15,
      "uww"     : 0.10,
      "queries" : [
        {
          "number" : "fdm",
          "text"   : "#fdm (international organized crime)"
        }
      ]
    }
Example Query
     galago batch-search --verbose=true --requested=5 \
                                             --index=/myindexes/robust04.idx \
                                             --query="#fdm(international organized crime)"

     Mar 10, 2016 11:35:24 AM org.lemurproject.galago.core.tools.apps.BatchSearch run
     INFO: RUNNING: unk-0 : #fdm(international organized crime)
     Mar 10, 2016 11:35:24 AM org.lemurproject.galago.core.tools.apps.BatchSearch run
     INFO: Transformed Query:
     #combine:0=0.26666666666666666:1=0.26666666666666666:10=0.0125
                       :2=0.26666666666666666:3=0.0375:4=0.0375:5=0.0375:6=0.0375
                       :7=0.0125:8=0.0125:9=0.0125:w=1.0(
        #dirichlet:collectionLength=252359881:maximumCount=326
                            :nodeFrequency=174191:w=0.2666666666666667(
          #lengths:document:part=lengths()
          #counts:international:part=postings.krovetz()
        )
        #dirichlet:collectionLength=252359881:maximumCount=111
                            :nodeFrequency=11401:w=0.2666666666666667(
           #lengths:document:part=lengths()
           #counts:organized:part=postings.krovetz()
        )
        #dirichlet:collectionLength=252359881:maximumCount=68
                            :nodeFrequency=30997:w=0.2666666666666667(
          #lengths:document:part=lengths()
          #counts:crime:part=postings.krovetz()
        )
        #dirichlet:collectionLength=252359881:maximumCount=2
                            :nodeFrequency=22:w=0.037500000000000006(
          #lengths:document:part=lengths()
          #ordered:1(
            #extents:international:part=postings.krovetz()
            #extents:organized:part=postings.krovetz()
          )
        )
        #dirichlet:collectionLength=252359881:maximumCount=2
                            :nodeFrequency=100:w=0.037500000000000006(
         #lengths:document:part=lengths()
         #ordered:1(
           #extents:international:part=postings.krovetz()
           #extents:crime:part=postings.krovetz()
         )
       )
       #dirichlet:collectionLength=252359881:maximumCount=29
                           :nodeFrequency=1744:w=0.037500000000000006(
         #lengths:document:part=lengths()
         #ordered:1(
           #extents:organized:part=postings.krovetz()
           #extents:crime:part=postings.krovetz()
         )
       )
       #dirichlet:collectionLength=252359881:maximumCount=2
                           :nodeFrequency=18:w=0.037500000000000006(
         #lengths:document:part=lengths()
         #ordered:1(
           #extents:international:part=postings.krovetz()
           #extents:organized:part=postings.krovetz()
           #extents:crime:part=postings.krovetz()
         )
       )
       #dirichlet:collectionLength=252359881:maximumCount=3
                           :nodeFrequency=205:w=0.012500000000000004(
         #lengths:document:part=lengths()
         #unordered:8(
           #extents:international:part=postings.krovetz()
           #extents:organized:part=postings.krovetz()
         )
       )
       #dirichlet:collectionLength=252359881:maximumCount=4
                           :nodeFrequency=446:w=0.012500000000000004(
         #lengths:document:part=lengths()
         #unordered:8(
           #extents:international:part=postings.krovetz()
           #extents:crime:part=postings.krovetz()
         )
       )
       #dirichlet:collectionLength=252359881:maximumCount=31
                           :nodeFrequency=1875:w=0.012500000000000004(
         #lengths:document:part=lengths()
         #unordered:8(
           #extents:organized:part=postings.krovetz()
           #extents:crime:part=postings.krovetz()
         )
       )
       #dirichlet:collectionLength=252359881:maximumCount=3
                           :nodeFrequency=85:w=0.012500000000000004(
         #lengths:document:part=lengths()
         #unordered:12(
           #extents:international:part=postings.krovetz()
           #extents:organized:part=postings.krovetz()
           #extents:crime:part=postings.krovetz()
         )
       )
     )

     unk-0 Q0 FBIS3-26415 1 -6.26893740 galago
     unk-0 Q0 FBIS3-41247 2 -6.26893740 galago
     unk-0 Q0 FBIS3-27916 3 -6.65951146 galago
     unk-0 Q0 FBIS3-41108 4 -6.65951146 galago
     unk-0 Q0 FBIS4-41684 5 -6.68566150 galago

#WSDM() Weighted Sequential Dependence Model

Implemented by class core.retrieval.traversal.WeightedSequentialDependenceTraversal.class using the #wsdm operator.

Weighted Sequential Dependency Model is structurally similar to the Sequential Dependency Model, however node weights are the linear combination of node features. The operator requires term ("text") type arguments only (no child operations).

Furthermore, the original query terms can be evaluated in bigram and trigram groupings.

In particular the weight for a node "term" is determined as a linear combination of features. The features are divided into unigram and bigram classes.

  #wsdm( term1 term2 ... termk ) becomes

    #combine ( 0.8  #combine ( term1 term2 ... termk)
                          0.15 #combine ( #od(term1 term2)
                                                          #od(term2 term3) ...
                                                          #od(termk-1 termk)
                                    )
                          0.05 #combine ( #uw8(term1 term2)
                                                          #uw8(term2 term3) ...
                                                          #uw8(termk-1 termk)
                                    )
    )
  )
Parameters
  • verboseWSDM Verbose output of feature values (default false)
  • norm Normalize (default false)
  • wsdmFeatures List of feature definitions
    • 1-const constant (default 0.8)
    • 1-lntf log tf (default 0.0)
    • 1-lndf log df (default 0.0)
    • 2-const (default 0.1)
    • 2-lntf (default 0.0)
    • 2-lndf (default 0.0)
WSDM Feature definitions
    {
      name    : [ "1-const" | "1-lntf" | "1-lndf" | "2-const" | "2-lntf" | "2-lndf" ]
      type    : [ "const", "logtf|logngramtf", "logdf" ] (default "const")
      lambda  : <weight_value>                (defaults: const=1.0  lntf=0.0  lndf=0.0)
      group   : <retrieval_group_name>        (default missing or empty)
      part    : <retrieval_index_part_name>   (default missing or empty)
      unigram : true | false                  can be used on unigrams  (default true)
      bigram  : true | false                  can be used on bigrams   (default false)
      trigram : true | false                  can be used on ttigrams  (default false)
    }

    NOTE: CONST type features always have a lambda (weight) of 1.0.

    NOTE: unigram/bigram/trigram values are mutually exclusive, i.e. if unigram is true,
                bi/tri grams must be false; if bigram is true, uni/tri grams must be false,
                etc.
Example Configuration File
    {
      "verbose"   : true,
      "casefold"  : true,
      "requested" : 5,
      "index"     : "/myindexes/robust04.idx",
      "norm"    : true,
      "verboseWSDM" : true,
      "wsdmFeatures" : [
        {
          "name"  : "1-const",
          "type"  : "const",
          "lambda" : 0.7,
          "part"  : "postings",
          "unigram" : true,
          "bigram" : false,
          "trigram" : false
        },
        {
          "name"  : "1-lntf",
          "type"  : "logtf",
          "lambda" : 0.3,
          "part"  : "postings.krovetz",
          "unigram" : true,
          "bigram" : false,
          "trigram" : false
        },
        {
          "name"  : "1-lndf",
          "type"  : "logdf",
          "lambda" : 0.2,
          "part"  : "extents",
          "unigram" : false,
          "bigram" : true,
          "trigram" : false
        },
        {
          "name"  : "2-const",
          "type"  : "const",
          "lambda" : 0.85,
          "part"  : "postings.krovetz",
          "unigram" : true,
          "bigram" : false,
          "trigram" : false
        },
        {
          "name"  : "2-lntf",
          "type"  : "logtf",
          "part"  : "field.krovetz.h3",
          "unigram" : true,
          "bigram" : false,
          "trigram" : false
        },
        {
          "name"  : "2-lndf",
          "type"  : "logdf",
          "lambda" : 0.25,
          "part"  : "field.krovetz.text",
          "unigram" : false,
          "bigram" : false,
          "trigram" : true
        }
      ],
      "queries" : [
        {
          "number" : "wsdm-1",
          "text" : "#wsdm(international organized crime)"
        }
      ]
    }
Example Query
  galago batch-search /myqueries/qrys_wsdm.json

  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.tools.apps.BatchSearch run
  INFO: RUNNING: wsdm-1 : #wsdm(international organized crime)
  Mar 08, 2016 4:05:15 PM  org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: international -- feature:1-const:0.700000 * 1.00000 = 0.700000
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: international -- feature:1-lntf:0.300000 * 12.0679 = 3.62037
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: international -- feature:2-const:0.850000 * 1.00000 = 0.850000
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: international -- feature:2-lntf:1.00000 * 3.58352 = 3.58352
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: organized -- feature:1-const:0.700000 * 1.00000 = 0.700000
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: organized -- feature:1-lntf:0.300000 * 9.34146 = 2.80244
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: organized -- feature:2-const:0.850000 * 1.00000 = 0.850000
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: crime -- feature:1-const:0.700000 * 1.00000 = 0.700000
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: crime -- feature:1-lntf:0.300000 * 10.3416 = 3.10249
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: crime -- feature:2-const:0.850000 * 1.00000 = 0.850000
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: crime -- feature:2-lntf:1.00000 * 0.00000 = 0.00000
  Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal  computeWeight
  INFO: international, organized, crime -- feature:2-lndf:0.250000 * 2.83321 = 0.708303

  #combine:0=8.753891241649924:1=4.352436904950116:2=4.652493711377242
                    :3=0.0:4=0.0:5=0.0:6=0.0:7=0.708303336014054:8=0.708303336014054:norm=true(
     #text:international()
     #text:organized()
     #text:crime()

     #od:1(
       #extents:international()
       #extents:organized()
     }
     #uw:8(
       #extents:international()
       #extents:organized()
     )
     #od:1(
       #extents:organized()
       #extents:crime()
     )
     #uw:8(
       #extents:organized()
       #extents:crime()
     )
     #od:1(
       #extents:international()
       #extents:organized()
       #extents:crime()
     )
     #uw:12(
       #extents:international()
       #extents:organized()
       #extents:crime()
     )
   )

   Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.tools.apps.BatchSearch run
   INFO: Transformed Query:
   #combine:0=8.753891241649924:1=4.352436904950116:2=4.652493711377242
                   :3=0.0:4=0.0:5=0.0:6=0.0:7=0.708303336014054
                   :8=0.708303336014054:norm=true:w=1.0(
      #dirichlet:collectionLength=252359881:maximumCount=326
                            :nodeFrequency=174191:w=0.4565160683607139(
        #lengths:document:part=lengths()
        #counts:international:part=postings.krovetz()
      )
      #dirichlet:collectionLength=252359881:maximumCount=111
                            :nodeFrequency=11401:w=0.2269799028553388(
        #lengths:document:part=lengths()
        #counts:organized:part=postings.krovetz()
      )
      #dirichlet:collectionLength=252359881:maximumCount=68
                            :nodeFrequency=30997:w=0.24262788725149467(
        #lengths:document:part=lengths()
        #counts:crime:part=postings.krovetz()
      )
      #dirichlet:collectionLength=252359881:maximumCount=2
                            :nodeFrequency=22:w=0.0(
        #lengths:document:part=lengths()
        #od:1(
           #extents:international:part=postings.krovetz()
           #extents:organized:part=postings.krovetz()
        )
      )
      #dirichlet:collectionLength=252359881:maximumCount=3
                            :nodeFrequency=205:w=0.0(
         #lengths:document:part=lengths()
         #uw:8(
           #extents:international:part=postings.krovetz()
           #extents:organized:part=postings.krovetz()
         )
      )
      #dirichlet:collectionLength=252359881:maximumCount=29
                            :nodeFrequency=1744:w=0.0(
         #lengths:document:part=lengths()
         #od:1(
           #extents:organized:part=postings.krovetz()
           #extents:crime:part=postings.krovetz()
         )
      )
      #dirichlet:collectionLength=252359881:maximumCount=31
                            :nodeFrequency=1875:w=0.0(
         #lengths:document:part=lengths()
         #uw:8(
           #extents:organized:part=postings.krovetz()
           #extents:crime:part=postings.krovetz()
         )
      )
      #dirichlet:collectionLength=252359881:maximumCount=2
                            :nodeFrequency=18:w=0.03693807076622631(
        #lengths:document:part=lengths()
        #od:1(
          #extents:international:part=postings.krovetz()
          #extents:organized:part=postings.krovetz()
          #extents:crime:part=postings.krovetz()
        )
      }
      #dirichlet:collectionLength=252359881:maximumCount=3
                             :nodeFrequency=85:w=0.03693807076622631(
         #lengths:document:part=lengths()
         #uw:12(
           #extents:international:part=postings.krovetz()
           #extents:organized:part=postings.krovetz()
           #extents:crime:part=postings.krovetz()
         )
      )
    )

     wsdm-1 Q0 FBIS3-26415 1 -6.17248337 galago
     wsdm-1 Q0 FBIS3-41247 2 -6.17248337 galago
     wsdm-1 Q0 FBIS4-41991 3 -6.19015392 galago
     wsdm-1 Q0 FBIS4-38364 4 -6.37916313 galago
     wsdm-1 Q0 FBIS3-19646 5 -6.39789464 galago

#RM() Relevance [Feedback] Model Operator

A relevance feedback model in which the #rm operator defaults to RelevanceModel3.

If default RelevanceModel3 is used, the original query terms are augmented by the specified
number of feedback expansion terms at the specified weight. If RelevanceModel1 is used, the original
query terms are replaced by the expansion terms.

Parameters
  • relevanceModel org.lemurproject.galago.core.retrieval.prf.RelevanceModel3 by default
  • fbDocs Number of top ranked docs to use in deriving feedback terms (default 20)
  • fbTerm Number of top feedback terms to be added to the query (default 100)
    NOTE: singular "fbTerm" rather than "fbTerms".
  • fbOrigWeight The weight to give to the original query (default 0.25)
  • rmstopwords Use a specified stopword list (exclusion terms). Default is "rmstops"
    residing in galago/core/src/main/resources/stopwords/.
  • rmwhitelist List of inclusion terms. Look in resources/stopwords by default.

Makes use of ExapnsionModelFactory and RelevanceModelTraversal classes.

Example Configuration File
    {
      "verbose"   : true,
      "casefold"  : true,
      "requested" : 5,
      "index"     : "/myindexes/ap89_fields.idx",
      "relevanceModel" : "org.lemurproject.galago.core.retrieval.prf.RelevanceModel1",
      "fbDocs" : 10,
      "fbTerm" : 5,
      "fbOrigWeight" : 0.75,
      "passageQuery" : true,  [passage query requires size and shift parameters]
      "passageSize" : 10,
      "passageShift" : 20,
      "extentQuery" : true,
      "rmstopwords" : "rmstop",
      "rmwhitelist" : "/myqueries/whitelist.txt",   [be careful with this one!]
      "rmStemmer" : "org.lemurproject.galago.core.parse.stem.KrovetzStemmer",
      "queries" : [
        {
          "number" : "rm",
          "text" : "#rm(six survivors)"
        }
      ]
    }
Example Query
  galago batch-search /myqueries/qrys_rm.json

  Mar 09, 2016 10:45:06 AM org.lemurproject.galago.core.tools.apps.BatchSearch run
  INFO: RUNNING: rm : #rm(six survivors)
  Mar 09, 2016 10:45:06 AM org.lemurproject.galago.core.tools.apps.BatchSearch run
  INFO: Transformed Query:
  #combine:0=0.375:1=0.375:2=0.11103813303830858:3=0.07096813315615444
                    :4=0.06799373380553697:w=1.0(
     #dirichlet:collectionLength=3801748:maximumCount=8
                           :nodeFrequency=1735:w=0.375(
       #passagelengths(
         #lengths:document:part=lengths()
       )
       #passagefilter(
          #extents:six:part=postings.krovetz()
       )
     )
     #dirichlet:collectionLength=3801748:maximumCount=10
                           :nodeFrequency=230:w=0.375(
       #passagelengths(
         #lengths:document:part=lengths()
       )
       #passagefilter(
         #extents:survivors:part=postings.krovetz()
       )
     )
     #dirichlet:collectionLength=3801748:maximumCount=19
                           :nodeFrequency=352:w=0.11103813303830858(
       #passagelengths(
         #lengths:document:part=lengths()
       )
       #passagefilter(
         #extents:tass:part=postings.krovetz()
       )
     )
     #dirichlet:collectionLength=3801748:maximumCount=22
                           :nodeFrequency=2465:w=0.07096813315615444(
       #passagelengths(
         #lengths:document:part=lengths()
       )
       #passagefilter(
         #extents:fire:part=postings.krovetz()
       )
     )
     #dirichlet:collectionLength=3801748:maximumCount=15
                           :nodeFrequency=72:w=0.06799373380553697(
       #passagelengths(
         #lengths:document:part=lengths()
       )
       #passagefilter(
         #extents:akopyan:part=postings.krovetz()
       )
     )
   )

    rm Q0 AP890112-0108 1 -7.31099958 galago 220 230
    rm Q0 AP890110-0038 2 -7.55503569 galago 560 570
    rm Q0 AP890112-0108 3 -7.55503569 galago 0 10
    rm Q0 AP890103-0047 4 -7.65907614 galago 40 50
    rm Q0 AP890103-0144 5 -7.65907614 galago 0 10

#PRMS() Probabalistic Retrieval Model

for Semi-structured Data [PRM-S]

This operator implements a pseudo relevance feedback operation expanding a
query with automatically generated "relevant" terms. It adds these terms to the
original query (RelevanceModel3) or replaces them altogether (RelevanceModel1).

The operator obtains statistics and length information for each specified field.
The original query is expanded into a combination of weighted sums for each query
term over each of the specified fields, using weights as specified for each field.

Given meg ryan war and document fields cast team title a #prms operation
should produce a query expansion such as follows:

      #combine( 
        #wsum:w1:w2:w3 ( 
          meg.cast
          meg.team
          meg.title
        )
        #wsum:w1:w2:w3 (
          ryan.cast
          ryan.team
          ryan.title
        )
        #wsum:w1:w2:w3 (
          war.cast
          war.team
          war.title
        )
     )

Implemented by the core.retrieval.traversal.PRMS2Traversal class using the
#prms or #prms2 operators.

Parameters
  • fields The fields from which terms should be evaluated
  • weights The weights for the specified fields
Example Configuration File
    {
      "verbose"   : true,
      "casefold"  : true,
      "requested" : 5,
      "index"     : "/home/harding/work/idx/robust04.idx",
      "relevanceModel" : "org.lemurproject.galago.core.retrieval.prf.RelevanceModel1",
      "fields" : [ "h3", "text" ],
      "weights" : {
        "h3" : 0.7,
        "text" : 0.3
      },
      "queries" : [
        {
          "number" : "prms-jm-rm1",
          "text" : "#prms(international organized crime)",
          "scorer" : "jm"
        }
      ]
    }
Example Query
  galago batch-search /home/harding/work/queries/qrys_prms.json

  Mar 09, 2016 11:16:51 AM org.lemurproject.galago.core.tools.apps.BatchSearch run
  INFO: Transformed Query:
  #combine:norm=false:w=1.0(
    #wsum:0=0.7:1=0.3:w=1.0(
      #jm:collectionLength=29039:lengths=h3:maximumCount=2:nodeFrequency=36:w=0.7(
        #lengths:h3:part=lengths()
        #counts:international:part=field.krovetz.h3()
      )
      #jm:collectionLength=247217451:lengths=text:maximumCount=326
            :nodeFrequency=149205:w=0.3(
        #lengths:text:part=lengths()
        #counts:international:part=field.krovetz.text()
      )
    )
    #wsum:0=0.7:1=0.3:w=1.0(
      #jm:collectionLength=29039:lengths=h3:maximumCount=0:nodeFrequency=0:w=0.7(
        #lengths:h3:part=lengths()
        #counts:organized:part=field.krovetz.h3()
      )
       #jm:collectionLength=247217451:lengths=text:maximumCount=111
             :nodeFrequency=11375:w=0.3(
         #lengths:text:part=lengths()
         #counts:organized:part=field.krovetz.text()
       )
    )
    #wsum:0=0.7:1=0.3:w=1.0(
      #jm:collectionLength=29039:lengths=h3:maximumCount=1:nodeFrequency=1:w=0.7(
        #lengths:h3:part=lengths()
        #counts:crime:part=field.krovetz.h3()
      )
       #jm:collectionLength=247217451:lengths=text:maximumCount=68
             :nodeFrequency=30058:w=0.3(
         #lengths:text:part=lengths()
         #counts:crime:part=field.krovetz.text()
       )
    )
 )

 prms-jm Q0 FT931-1 1        NaN galago 
 prms-jm Q0 FT941-2 2        NaN galago
 prms-jm Q0 LA010189-0002 3        NaN galago
 prms-jm Q0 FBIS3-2 4        NaN galago
 prms-jm Q0 FBIS3-3 5        NaN galago

#PDFR() Proximity Divergence From Randomness Model

The Proximity Divergence from Randomness Model assumes all adjacent paris of query terms
are dependent. Terms are scored using PL2 scoring model while bigrams use BIL2 scoring
model by default. Parameters allow other scoring models to be used. Weights for the
term and bigram query components (PL2 and BiL2 scorers) may also be specified. Document
scores are the weighted sum of term and bigram (bi-term) features.

Implemented by class core.retrieval.traversal.ProximityDFRTraversal class using the #pdfr operator.

  #pdfr ( term1 term2 term3 )  becomes

  #combine (
    w: #pl2:c=6.0  (stats for term1)
    w: #pl2:c=6.0  (stats for term2)
    w: #pl2:c=6.0  (stats for term3)

    w: #bil2:c=0.05 (
         #ordered:5 ( term1 term2 )
    )
    w: #bil2:c=0.05 (
         #ordered:5 ( term2 term3 )
    )
  )

Note: The components weights will be divided by the number of pl2 and bil2
operations performed. Unordered window distances is specified by the
windowSize parameter or default of 5.

Parameters
  • pdfrSeq (default true)
  • termLambda (default 1.0)
  • c (default 6.0)
  • cp (default 0.05)
  • pdfrTerm (default pl2)
  • pdfrProx (default bil2)
  • windowSize (default 5)
Example Configuration File
    {
      "verbose"   : true,
      "casefold"  : true,
      "requested" : 5,
      "index"     : "/myindexes/robust04.idx",
      "pdfrSeq"    : true,
      "termLambda" : 1.0,
      "c"          : 6.0,
      "cp"         : 0.05,
      "pdfrTerm"   : "pl2",
      "pdfrProx"   : "bil2",
      "windowSize" : 5,
      "queries" : [
        {
          "number" : "pdfr",
          "text"   : "#pdfr (international organized crime)"
        }
      ]
    }
Example Query
      galago batch-search /myqueries/qrys_pdfr.json

      Mar 10, 2016 12:38:43 PM org.lemurproject.galago.core.tools.apps.BatchSearch run
      INFO: RUNNING: fdm : #pdfr (international organized crime.h3)
      Mar 10, 2016 12:38:43 PM org.lemurproject.galago.core.tools.apps.BatchSearch run
      INFO: Transformed Query:
      #combine:0=0.3333333333333333:1=0.3333333333333333:2=0.3333333333333333:3=0.0:4=0.0(
        #pl2:c=6.0:collectionLength=252359881:documentCount=528155:maximumCount=326:nodeFrequency=174191(
          #lengths:document:part=lengths()
          #counts:international:part=postings.krovetz()
        )
        #pl2:c=6.0:collectionLength=252359881:documentCount=528155:maximumCount=111:nodeFrequency=11401(
          #lengths:document:part=lengths()
          #counts:organized:part=postings.krovetz()
        )
        #pl2:c=6.0:collectionLength=252359881:documentCount=528155:maximumCount=1:nodeFrequency=1(
          #lengths:document:part=lengths()
          #counts:crime:part=field.krovetz.h3()
        )
        #bil2:c=0.05:collectionLength=252359881:documentCount=528155(
          #lengths:document:part=lengths()
          #ordered:5(
            #extents:international:part=postings.krovetz()
            #extents:organized:part=postings.krovetz()
          )
        )
        #bil2:c=0.05:collectionLength=252359881:documentCount=528155(
          #lengths:document:part=lengths()
          #ordered:5(
            #extents:organized:part=postings.krovetz()
            #extents:crime:part=field.krovetz.h3()
          )
        )
      )

      pdfr Q0 FBIS3-8153 1 4.99343555 galago
      pdfr Q0 FBIS4-41991 2 3.79434205 galago
      pdfr Q0 LA121990-0141 3 3.65754089 galago
      pdfr Q0 FBIS4-54904 4 3.61571623 galago
      pdfr Q0 FBIS4-38364 5 3.57971170 galago

#DIRICHLET (Smoothing function -- Scorer)

Dirichlet smoothing function depending on document length. This is the default smoothing function for all query operators.

Parameters
  • mu Default 1500
Example Query
    #dirichlet:mu=1200(international)

Expands to:

    #dirichlet:collectionLength=N:maximumCount=N:mu=1200:noqdeFrequency=N:w=0.N (
      #lengths:document:part=lengths()
      #counts:theTerm:part=postings.krovetz
    )

Parameters listed in expanded query appear in alphabetic order.

Example Query
  galago batch-search --verbose=true --requested=5 --mu=1000 \
                      --index=/myindexes/ap89_fields.idx --query="#dirichlet(survivors)"

  Mar 09, 2016 1:56:44 PM org.lemurproject.galago.core.tools.apps.BatchSearch run
  INFO: RUNNING: unk-0 : #dirichlet(survivors)
  Mar 09, 2016 1:56:44 PM org.lemurproject.galago.core.tools.apps.BatchSearch run
  INFO: Transformed Query:
  #dirichlet:collectionLength=3801748:maximumCount=10:mu=1000:nodeFrequency=230(
    #lengths:document:part=lengths()
    #counts:survivors:part=postings.krovetz()
  )

  unk-0 Q0 AP890102-0135 1 -5.03674813 galago
  unk-0 Q0 AP890112-0108 2 -5.34195179 galago
  unk-0 Q0 AP890113-0159 3 -5.61024136 galago
  unk-0 Q0 AP890102-0044 4 -5.64420944 galago
  unk-0 Q0 AP890102-0137 5 -5.70465571 galago

#BM25 (Smoothing function -- Scorer)

The BM25 (Okapi) scoring function. Implemented in the org.lemurproject.galago.core.retrieval.iterator.scoring.BM25ScoringIterator class using the #bm25 operator.

Parameters
  • b Controls degree of length normalization (values 0..1; default 0.75)
  • K Normalization parameter for document length (default 1.2)
  • w Custom weight for term (default 1.0)
Example Configuration File
    {
      "verbose"   : true,
      "casefold"  : true,
      "requested" : 5,
      "index"     : "/myindexes/robust04.idx",
      "K" : 0.7777,
      "b" : 0.345,
      "queries" : [
        {
          "number" : "bm25-combine",
          "text" : "#combine(#bm25(international) #bm25(organized) #bm25(crime))",
        }
      ]
    }
Example Query
    galago batch-search /myqueries/qrys_bm25.json

    Mar 09, 2016 4:11:55 PM org.lemurproject.galago.core.tools.apps.BatchSearch run
    INFO: RUNNING: bm25f-combine : #combine(#bm25(international) #bm25(organized)   #bm25(crime))
    Mar 09, 2016 4:11:55 PM org.lemurproject.galago.core.tools.apps.BatchSearch run
    INFO: Transformed Query:
    #combine:w=1.0(
      #bm25:b=0.345:collectionLength=252359881:documentCount=528155:maximumCount=326
                :nodeDocumentCount=102493:nodeFrequency=174191:w=0.3333333333333333(
        #lengths:document:part=lengths()
        #counts:international:part=postings.krovetz()
      )
      #bm25:b=0.345:collectionLength=252359881:documentCount=528155:maximumCount=111
                :nodeDocumentCount=8455:nodeFrequency=11401:w=0.3333333333333333(
        #lengths:document:part=lengths()
        #counts:organized:part=postings.krovetz()
      )
      #bm25:b=0.345:collectionLength=252359881:documentCount=528155:maximumCount=68
                :nodeDocumentCount=14954:nodeFrequency=30997:w=0.3333333333333333(
        #lengths:document:part=lengths()
        #counts:crime:part=postings.krovetz()
      )
    )

    bm25-combine Q0 FBIS4-41991 1 5.87150854 galago
    bm25-combine Q0 FBIS4-38364 2 5.65608722 galago
    bm25-combine Q0 FBIS4-55395 3 5.59812441 galago
    bm25-combine Q0 FBIS3-19646 4 5.54131058 galago
    bm25-combine Q0 FBIS3-21961 5 5.54131058 galago

#BM25F/#BM25FCOMB

The BM25F operator is a traversal that implements the bm25 smoothing/ranking algorithm on
a document field basis. It is actually implemented as the #bm25fcomb operator.

Since it is a contribution one must explicitly add the traversal class to retrieval paramenters.
E.g., add

 --traversals+org.lemurproject.galago.contrib.retrieval.traversal.BM25FTraversal

to your retrieval parameters or the command line when invoking search or batch-search.

Since contrib.jar is needed for this traversal, it should be on the classpath.

Alternatively use the ./contrib/target/appassembler/bin/galago version of galago rather than
the usual ./core/target/appassembler/bin/galago. Just copy the contrib jar to the appassembler/lib
there and edit the galago script to include the contrib jar in the CLASSPATH.

Parameters
  • fields Define fields that are to be used by the operator
  • bm25f The parent key for bm25f parameters
  • K BM25 K value (defaults to 0.5)
  • weights Map of field based weights
  • smoothing Map of field based smoothing values (b values)

  • traversals Define the traversal class to be used and how used

  • name Name of class (org.lemurproject.galago.contrib.retrieval.traversal.BM25FTraversal)
  • order Use "before", "after" or "instead". "Instead" ignores all default traversals.
Example Configuration File
    {
      "verbose"   : true,
      "casefold"  : true,
      "requested" : 5,
      "index"     : "/myindexes/robust04.idx",
      "traversals" : [
        {
          "name" : "org.lemurproject.galago.contrib.retrieval.traversal.BM25FTraversal",
          "order" : "before"
        }
      ],
      "fields" : [ "h3", "text" ],
      "bm25f" : {
        "K" : 0.7777,
        "b" : 0.345,
        "weights" : {
          "h3" : 0.455,
          "text" : 0.201,
        },
        "smoothing" : {
          "h3" : 0.309,
          "text" : 0.105
        }
      },
      "queries" : [
        {
          "number" : "bm25f",
          "text" : "#bm25f(international organized crime)",
        }
      ]
    }
Example Query
    ./galago batch-search /myqueries/qrys_bm25f.json

    Mar 09, 2016 3:58:18 PM org.lemurproject.galago.core.tools.apps.BatchSearch run
    INFO: RUNNING: bm25f : #bm25f(international organized crime)
    Mar 09, 2016 3:58:18 PM org.lemurproject.galago.core.tools.apps.BatchSearch run
    INFO: Transformed Query:
    #bm25fcomb:K=0.7777:idf0=1.6395904192983146:idf1=4.134572684024236
                        :idf2=3.5643775433430793:norm=false(
      #combine:0=0.455:1=0.201:norm=false(
        #bm25field:K=0.7777:b=0.309:collectionLength=29039:documentCount=821
                            :idf=1.6395904192983146
                  :lengths=h3:maximumCount=2:nodeDocumentCount=31:pIdx=0:w=0.455(
          #lengths:h3:part=lengths()
          #counts:international:part=field.krovetz.h3()
        )
        #bm25field:K=0.7777:b=0.105:collectionLength=247217451:documentCount=524000
                            :idf=1.6395904192983146
                  :lengths=text:maximumCount=326:nodeDocumentCount=84025:pIdx=0:w=0.201(
          #lengths:text:part=lengths()
          #counts:international:part=field.krovetz.text()
        )
      )
      #combine:0=0.455:1=0.201:norm=false(
        #bm25field:K=0.7777:b=0.309:collectionLength=29039:documentCount=821
                            :idf=4.134572684024236
                  :lengths=h3:maximumCount=0:nodeDocumentCount=0:pIdx=1:w=0.455(
          #lengths:h3:part=lengths()
          #counts:organized:part=field.krovetz.h3()
        )
        #bm25field:K=0.7777:b=0.105:collectionLength=247217451:documentCount=524000
                            :idf=4.134572684024236
                  :lengths=text:maximumCount=111:nodeDocumentCount=8450:pIdx=1:w=0.201(
          #lengths:text:part=lengths()
          #counts:organized:part=field.krovetz.text()
        )
      )
      #combine:0=0.455:1=0.201:norm=false(
        #bm25field:K=0.7777:b=0.309:collectionLength=29039:documentCount=821
                            :idf=3.5643775433430793
                  :lengths=h3:maximumCount=1:nodeDocumentCount=1:pIdx=2:w=0.455(
          #lengths:h3:part=lengths()
          #counts:crime:part=field.krovetz.h3()
        )
        #bm25field:K=0.7777:b=0.105:collectionLength=247217451:documentCount=524000
                            :idf=3.5643775433430793
                  :lengths=text:maximumCount=68:nodeDocumentCount=14712:pIdx=2:w=0.201(
          #lengths:text:part=lengths()
          #counts:crime:part=field.krovetz.text()
        )
      )
    )

    bm25f Q0 FBIS4-41991 1 6.68788814 galago
    bm25f Q0 FBIS4-38364 2 6.63572238 galago
    bm25f Q0 FBIS4-7811 3 6.55618402 galago
    bm25f Q0 FBIS3-24143 4 6.39592308 galago
    bm25f Q0 FBIS4-55395 5 6.38987171 galago