This page describes the operators available to the Galago search engine.
Return a normalized, weighted sum of the scores produced by each of the operator's children.
The weights are normalized by the sum of the child node weights.
One may specify weights to assign during the summing of each of the children nodes of the
combine operation. The weights are specified as an array with values. For example#combine:0=0.1:1=0.2:2=0.7 (international organized crime.h3)
galago batch-search --verbose=true --requested=5 \ --index=/myindexes/robust04.idx \ --query="#combine:0=0.1:1=0.2:2=0.7 (international organized crime.h3)" Mar 09, 2016 11:54:37 AM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: RUNNING: unk-0 : #combine:0=0.1:1=0.2:2=0.7(international organized crime.h3) Mar 09, 2016 11:54:37 AM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: Transformed Query: #combine:0=0.1:1=0.2:2=0.7:w=1.0( #dirichlet:collectionLength=252359881:maximumCount=326:nodeFrequency=174191:w=0.1( #lengths:document:part=lengths() #counts:international:part=postings.krovetz() ) #dirichlet:collectionLength=252359881:maximumCount=111:nodeFrequency=11401:w=0.2( #lengths:document:part=lengths() #counts:organized:part=postings.krovetz() ) #dirichlet:collectionLength=252359881:maximumCount=1:nodeFrequency=1:w=0.7( #lengths:document:part=lengths() #counts:crime:part=field.krovetz.h3() ) ) unk-0 Q0 FBIS3-8153 1 -8.07971877 galago unk-0 Q0 LA121990-0141 2 -15.50911595 galago unk-0 Q0 LA102290-0116 3 -15.51177100 galago unk-0 Q0 FBIS4-54904 4 -15.51292184 galago unk-0 Q0 FBIS4-19535 5 -15.51592076 galago
A model that assumes dependencies between adjacent query terms. It is implemented by the core.retrieval.traversal.SequentialDependenceTraversal class via the #sdm or
#seqdep operators.The traversal produces a combined query consistem of unigram, ordered distance and
unordered distance components from the original query tersm. Component default
weights are 0.8 unigrams, 0.15 ordered distance and 0.05 unordered window. Weights
for parts of each query component will be divided by the number of query terms.
The traversal uses dirichlet terms smoothing by default.
#sdm( term1 term2 ... termk ) becomes #combine ( 0.8 #combine ( term1 term2 ... termk) 0.15 #combine ( #od(term1 term2) #od(term2 term3) ... #od(termk-1 termk) ) 0.05 #combine ( #uw8(term term2) ... #uw8(termk-1 termk) ) ) )
{ "verbose" : true, "casefold" : true, "requested" : 5, "index" : "/myqueries/qrys_sdm.json", "queries" : [ { "number" : "sdm", "text" : "#sdm(weatherman new york)", "uniw" : 0.65, "odw" : 0.20, "uww" : 0.15, "windowLimit" : 3 } ] }
NOTE: With the windowLimit value of 3 as opposed to default two, the tranformed query
will contain three word as well as two word groupings for distance operations.The unordered window size parameter is also increased for the additional word groupings. Weights for the three query components are divided equally among the term groupings and should sum to the weights specified in the configuration file for each grouping.
galago batch-search /myqueries/qrys_sdm.json Mar 09, 2016 9:30:20 AM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: RUNNING: sdm : #sdm:uniw=0.65:odw=0.20:uww=0.15:windowlimit=3(weatherman new york) Mar 09, 2016 9:30:20 AM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: Transformed Query: #combine:0=0.21666666666666667:1=0.21666666666666667:2=0.21666666666666667 :3=0.06666666666666667:4=0.06666666666666667:5=0.06666666666666667 :6=0.049999999999999996:7=0.049999999999999996 :8=0.049999999999999996:w=1.0( #dirichlet:collectionLength=3801748:maximumCount=1 :nodeFrequency=12:w=0.21666666666666667( #lengths:document:part=lengths() #counts:weatherman:part=postings.krovetz() ) #dirichlet:collectionLength=3801748:maximumCount=22 :nodeFrequency=9878:w=0.21666666666666667( #lengths:document:part=lengths() #counts:new:part=postings.krovetz() ) #dirichlet:collectionLength=3801748:maximumCount=13 :nodeFrequency=2986:w=0.21666666666666667( #lengths:document:part=lengths() #counts:york:part=postings.krovetz() ) #dirichlet:collectionLength=3801748:maximumCount=0 :nodeFrequency=0:w=0.06666666666666667( #lengths:document:part=lengths() #ordered:1( #extents:weatherman:part=postings.krovetz() #extents:new:part=postings.krovetz() ) ) #dirichlet:collectionLength=3801748:maximumCount=13 :nodeFrequency=2970:w=0.06666666666666667( #lengths:document:part=lengths() #ordered:1( #extents:new:part=postings.krovetz() #extents:york:part=postings.krovetz() ) ) #dirichlet:collectionLength=3801748:maximumCount=0 :nodeFrequency=0:w=0.06666666666666667( #lengths:document:part=lengths() #ordered:1( #extents:weatherman:part=postings.krovetz() #extents:new:part=postings.krovetz() #extents:york:part=postings.krovetz() ) ) #dirichlet:collectionLength=3801748:maximumCount=0 :nodeFrequency=0:w=0.049999999999999996( #lengths:document:part=lengths() #unordered:8( #extents:weatherman:part=postings.krovetz() #extents:new:part=postings.krovetz() ) ) #dirichlet:collectionLength=3801748:maximumCount=17 :nodeFrequency=3118:w=0.049999999999999996( #lengths:document:part=lengths() #unordered:8( #extents:new:part=postings.krovetz() #extents:york:part=postings.krovetz() ) ) #dirichlet:collectionLength=3801748:maximumCount=0 :nodeFrequency=0:w=0.049999999999999996( #lengths:document:part=lengths() #unordered:12( #extents:weatherman:part=postings.krovetz() #extents:new:part=postings.krovetz() #extents:york:part=postings.krovetz() ) ) ) sdm Q0 AP890110-0137 1 -8.63710973 galago sdm Q0 AP890111-0014 2 -8.75586082 galago sdm Q0 AP890120-0172 3 -8.94746306 galago sdm Q0 AP890109-0244 4 -8.98105408 galago sdm Q0 AP890119-0213 5 -8.99652757 galago
Implemented by class core.retrieval.traversal.FullDependenceTraversal class using the #fdm or
#fulldep operators.
The model transforms original queries into the following form:
#fdm ( term1 term2 term3 ) --> #combine ( 0.8 term1 term2 term3 ) 0.15 #od:1 ( term1 term2 ) #od:1 ( term1 term3 ) #od:1 ( term2 term3 ) #od:1 ( term1 term2 term3 ) 0.05 #uw:8 ( term1 term2 ) #uw:8 ( term1 term3 ) #uw:8 ( term2 term3 ) #uw:12 ( term1 term2 term3 ) )
Note: The components weights will be divided by the number of unigram, odN and uwN
operations performed. Unordered window distances may be augmented when the
number of query terms exceeds windowLimit setting.
{ "verbose" : true, "casefold" : true, "requested" : 5, "index" : "/myindexes/robust04.idx", "uniw" : 0.75, "odw" : 0.15, "uww" : 0.10, "queries" : [ { "number" : "fdm", "text" : "#fdm (international organized crime)" } ] }
galago batch-search --verbose=true --requested=5 \ --index=/myindexes/robust04.idx \ --query="#fdm(international organized crime)" Mar 10, 2016 11:35:24 AM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: RUNNING: unk-0 : #fdm(international organized crime) Mar 10, 2016 11:35:24 AM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: Transformed Query: #combine:0=0.26666666666666666:1=0.26666666666666666:10=0.0125 :2=0.26666666666666666:3=0.0375:4=0.0375:5=0.0375:6=0.0375 :7=0.0125:8=0.0125:9=0.0125:w=1.0( #dirichlet:collectionLength=252359881:maximumCount=326 :nodeFrequency=174191:w=0.2666666666666667( #lengths:document:part=lengths() #counts:international:part=postings.krovetz() ) #dirichlet:collectionLength=252359881:maximumCount=111 :nodeFrequency=11401:w=0.2666666666666667( #lengths:document:part=lengths() #counts:organized:part=postings.krovetz() ) #dirichlet:collectionLength=252359881:maximumCount=68 :nodeFrequency=30997:w=0.2666666666666667( #lengths:document:part=lengths() #counts:crime:part=postings.krovetz() ) #dirichlet:collectionLength=252359881:maximumCount=2 :nodeFrequency=22:w=0.037500000000000006( #lengths:document:part=lengths() #ordered:1( #extents:international:part=postings.krovetz() #extents:organized:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=2 :nodeFrequency=100:w=0.037500000000000006( #lengths:document:part=lengths() #ordered:1( #extents:international:part=postings.krovetz() #extents:crime:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=29 :nodeFrequency=1744:w=0.037500000000000006( #lengths:document:part=lengths() #ordered:1( #extents:organized:part=postings.krovetz() #extents:crime:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=2 :nodeFrequency=18:w=0.037500000000000006( #lengths:document:part=lengths() #ordered:1( #extents:international:part=postings.krovetz() #extents:organized:part=postings.krovetz() #extents:crime:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=3 :nodeFrequency=205:w=0.012500000000000004( #lengths:document:part=lengths() #unordered:8( #extents:international:part=postings.krovetz() #extents:organized:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=4 :nodeFrequency=446:w=0.012500000000000004( #lengths:document:part=lengths() #unordered:8( #extents:international:part=postings.krovetz() #extents:crime:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=31 :nodeFrequency=1875:w=0.012500000000000004( #lengths:document:part=lengths() #unordered:8( #extents:organized:part=postings.krovetz() #extents:crime:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=3 :nodeFrequency=85:w=0.012500000000000004( #lengths:document:part=lengths() #unordered:12( #extents:international:part=postings.krovetz() #extents:organized:part=postings.krovetz() #extents:crime:part=postings.krovetz() ) ) ) unk-0 Q0 FBIS3-26415 1 -6.26893740 galago unk-0 Q0 FBIS3-41247 2 -6.26893740 galago unk-0 Q0 FBIS3-27916 3 -6.65951146 galago unk-0 Q0 FBIS3-41108 4 -6.65951146 galago unk-0 Q0 FBIS4-41684 5 -6.68566150 galago
Implemented by class core.retrieval.traversal.WeightedSequentialDependenceTraversal.class using the #wsdm operator.
Weighted Sequential Dependency Model is structurally similar to the Sequential Dependency Model, however node weights are the linear combination of node features. The operator requires term ("text") type arguments only (no child operations).
Furthermore, the original query terms can be evaluated in bigram and trigram groupings.
In particular the weight for a node "term" is determined as a linear combination of features. The features are divided into unigram and bigram classes.
#wsdm( term1 term2 ... termk ) becomes #combine ( 0.8 #combine ( term1 term2 ... termk) 0.15 #combine ( #od(term1 term2) #od(term2 term3) ... #od(termk-1 termk) ) 0.05 #combine ( #uw8(term1 term2) #uw8(term2 term3) ... #uw8(termk-1 termk) ) ) )
{ name : [ "1-const" | "1-lntf" | "1-lndf" | "2-const" | "2-lntf" | "2-lndf" ] type : [ "const", "logtf|logngramtf", "logdf" ] (default "const") lambda : <weight_value> (defaults: const=1.0 lntf=0.0 lndf=0.0) group : <retrieval_group_name> (default missing or empty) part : <retrieval_index_part_name> (default missing or empty) unigram : true | false can be used on unigrams (default true) bigram : true | false can be used on bigrams (default false) trigram : true | false can be used on ttigrams (default false) } NOTE: CONST type features always have a lambda (weight) of 1.0. NOTE: unigram/bigram/trigram values are mutually exclusive, i.e. if unigram is true, bi/tri grams must be false; if bigram is true, uni/tri grams must be false, etc.
{ "verbose" : true, "casefold" : true, "requested" : 5, "index" : "/myindexes/robust04.idx", "norm" : true, "verboseWSDM" : true, "wsdmFeatures" : [ { "name" : "1-const", "type" : "const", "lambda" : 0.7, "part" : "postings", "unigram" : true, "bigram" : false, "trigram" : false }, { "name" : "1-lntf", "type" : "logtf", "lambda" : 0.3, "part" : "postings.krovetz", "unigram" : true, "bigram" : false, "trigram" : false }, { "name" : "1-lndf", "type" : "logdf", "lambda" : 0.2, "part" : "extents", "unigram" : false, "bigram" : true, "trigram" : false }, { "name" : "2-const", "type" : "const", "lambda" : 0.85, "part" : "postings.krovetz", "unigram" : true, "bigram" : false, "trigram" : false }, { "name" : "2-lntf", "type" : "logtf", "part" : "field.krovetz.h3", "unigram" : true, "bigram" : false, "trigram" : false }, { "name" : "2-lndf", "type" : "logdf", "lambda" : 0.25, "part" : "field.krovetz.text", "unigram" : false, "bigram" : false, "trigram" : true } ], "queries" : [ { "number" : "wsdm-1", "text" : "#wsdm(international organized crime)" } ] }
galago batch-search /myqueries/qrys_wsdm.json Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: RUNNING: wsdm-1 : #wsdm(international organized crime) Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: international -- feature:1-const:0.700000 * 1.00000 = 0.700000 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: international -- feature:1-lntf:0.300000 * 12.0679 = 3.62037 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: international -- feature:2-const:0.850000 * 1.00000 = 0.850000 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: international -- feature:2-lntf:1.00000 * 3.58352 = 3.58352 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: organized -- feature:1-const:0.700000 * 1.00000 = 0.700000 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: organized -- feature:1-lntf:0.300000 * 9.34146 = 2.80244 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: organized -- feature:2-const:0.850000 * 1.00000 = 0.850000 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: crime -- feature:1-const:0.700000 * 1.00000 = 0.700000 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: crime -- feature:1-lntf:0.300000 * 10.3416 = 3.10249 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: crime -- feature:2-const:0.850000 * 1.00000 = 0.850000 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: crime -- feature:2-lntf:1.00000 * 0.00000 = 0.00000 Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.retrieval.traversal.WeightedSequentialDependenceTraversal computeWeight INFO: international, organized, crime -- feature:2-lndf:0.250000 * 2.83321 = 0.708303 #combine:0=8.753891241649924:1=4.352436904950116:2=4.652493711377242 :3=0.0:4=0.0:5=0.0:6=0.0:7=0.708303336014054:8=0.708303336014054:norm=true( #text:international() #text:organized() #text:crime() #od:1( #extents:international() #extents:organized() } #uw:8( #extents:international() #extents:organized() ) #od:1( #extents:organized() #extents:crime() ) #uw:8( #extents:organized() #extents:crime() ) #od:1( #extents:international() #extents:organized() #extents:crime() ) #uw:12( #extents:international() #extents:organized() #extents:crime() ) ) Mar 08, 2016 4:05:15 PM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: Transformed Query: #combine:0=8.753891241649924:1=4.352436904950116:2=4.652493711377242 :3=0.0:4=0.0:5=0.0:6=0.0:7=0.708303336014054 :8=0.708303336014054:norm=true:w=1.0( #dirichlet:collectionLength=252359881:maximumCount=326 :nodeFrequency=174191:w=0.4565160683607139( #lengths:document:part=lengths() #counts:international:part=postings.krovetz() ) #dirichlet:collectionLength=252359881:maximumCount=111 :nodeFrequency=11401:w=0.2269799028553388( #lengths:document:part=lengths() #counts:organized:part=postings.krovetz() ) #dirichlet:collectionLength=252359881:maximumCount=68 :nodeFrequency=30997:w=0.24262788725149467( #lengths:document:part=lengths() #counts:crime:part=postings.krovetz() ) #dirichlet:collectionLength=252359881:maximumCount=2 :nodeFrequency=22:w=0.0( #lengths:document:part=lengths() #od:1( #extents:international:part=postings.krovetz() #extents:organized:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=3 :nodeFrequency=205:w=0.0( #lengths:document:part=lengths() #uw:8( #extents:international:part=postings.krovetz() #extents:organized:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=29 :nodeFrequency=1744:w=0.0( #lengths:document:part=lengths() #od:1( #extents:organized:part=postings.krovetz() #extents:crime:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=31 :nodeFrequency=1875:w=0.0( #lengths:document:part=lengths() #uw:8( #extents:organized:part=postings.krovetz() #extents:crime:part=postings.krovetz() ) ) #dirichlet:collectionLength=252359881:maximumCount=2 :nodeFrequency=18:w=0.03693807076622631( #lengths:document:part=lengths() #od:1( #extents:international:part=postings.krovetz() #extents:organized:part=postings.krovetz() #extents:crime:part=postings.krovetz() ) } #dirichlet:collectionLength=252359881:maximumCount=3 :nodeFrequency=85:w=0.03693807076622631( #lengths:document:part=lengths() #uw:12( #extents:international:part=postings.krovetz() #extents:organized:part=postings.krovetz() #extents:crime:part=postings.krovetz() ) ) ) wsdm-1 Q0 FBIS3-26415 1 -6.17248337 galago wsdm-1 Q0 FBIS3-41247 2 -6.17248337 galago wsdm-1 Q0 FBIS4-41991 3 -6.19015392 galago wsdm-1 Q0 FBIS4-38364 4 -6.37916313 galago wsdm-1 Q0 FBIS3-19646 5 -6.39789464 galago
A relevance feedback model in which the #rm operator defaults to RelevanceModel3.
If default RelevanceModel3 is used, the original query terms are augmented by the specified
number of feedback expansion terms at the specified weight. If RelevanceModel1 is used, the original
query terms are replaced by the expansion terms.
Makes use of ExapnsionModelFactory and RelevanceModelTraversal classes.
{ "verbose" : true, "casefold" : true, "requested" : 5, "index" : "/myindexes/ap89_fields.idx", "relevanceModel" : "org.lemurproject.galago.core.retrieval.prf.RelevanceModel1", "fbDocs" : 10, "fbTerm" : 5, "fbOrigWeight" : 0.75, "passageQuery" : true, [passage query requires size and shift parameters] "passageSize" : 10, "passageShift" : 20, "extentQuery" : true, "rmstopwords" : "rmstop", "rmwhitelist" : "/myqueries/whitelist.txt", [be careful with this one!] "rmStemmer" : "org.lemurproject.galago.core.parse.stem.KrovetzStemmer", "queries" : [ { "number" : "rm", "text" : "#rm(six survivors)" } ] }
galago batch-search /myqueries/qrys_rm.json Mar 09, 2016 10:45:06 AM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: RUNNING: rm : #rm(six survivors) Mar 09, 2016 10:45:06 AM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: Transformed Query: #combine:0=0.375:1=0.375:2=0.11103813303830858:3=0.07096813315615444 :4=0.06799373380553697:w=1.0( #dirichlet:collectionLength=3801748:maximumCount=8 :nodeFrequency=1735:w=0.375( #passagelengths( #lengths:document:part=lengths() ) #passagefilter( #extents:six:part=postings.krovetz() ) ) #dirichlet:collectionLength=3801748:maximumCount=10 :nodeFrequency=230:w=0.375( #passagelengths( #lengths:document:part=lengths() ) #passagefilter( #extents:survivors:part=postings.krovetz() ) ) #dirichlet:collectionLength=3801748:maximumCount=19 :nodeFrequency=352:w=0.11103813303830858( #passagelengths( #lengths:document:part=lengths() ) #passagefilter( #extents:tass:part=postings.krovetz() ) ) #dirichlet:collectionLength=3801748:maximumCount=22 :nodeFrequency=2465:w=0.07096813315615444( #passagelengths( #lengths:document:part=lengths() ) #passagefilter( #extents:fire:part=postings.krovetz() ) ) #dirichlet:collectionLength=3801748:maximumCount=15 :nodeFrequency=72:w=0.06799373380553697( #passagelengths( #lengths:document:part=lengths() ) #passagefilter( #extents:akopyan:part=postings.krovetz() ) ) ) rm Q0 AP890112-0108 1 -7.31099958 galago 220 230 rm Q0 AP890110-0038 2 -7.55503569 galago 560 570 rm Q0 AP890112-0108 3 -7.55503569 galago 0 10 rm Q0 AP890103-0047 4 -7.65907614 galago 40 50 rm Q0 AP890103-0144 5 -7.65907614 galago 0 10
This operator implements a pseudo relevance feedback operation expanding a
query with automatically generated "relevant" terms. It adds these terms to the
original query (RelevanceModel3) or replaces them altogether (RelevanceModel1).The operator obtains statistics and length information for each specified field.
The original query is expanded into a combination of weighted sums for each query
term over each of the specified fields, using weights as specified for each field.Given
meg ryan war
and document fieldscast team title
a #prms operation
should produce a query expansion such as follows:
#combine( #wsum:w1:w2:w3 ( meg.cast meg.team meg.title ) #wsum:w1:w2:w3 ( ryan.cast ryan.team ryan.title ) #wsum:w1:w2:w3 ( war.cast war.team war.title ) )
Implemented by the core.retrieval.traversal.PRMS2Traversal class using the
#prms or #prms2 operators.
{ "verbose" : true, "casefold" : true, "requested" : 5, "index" : "/home/harding/work/idx/robust04.idx", "relevanceModel" : "org.lemurproject.galago.core.retrieval.prf.RelevanceModel1", "fields" : [ "h3", "text" ], "weights" : { "h3" : 0.7, "text" : 0.3 }, "queries" : [ { "number" : "prms-jm-rm1", "text" : "#prms(international organized crime)", "scorer" : "jm" } ] }
galago batch-search /home/harding/work/queries/qrys_prms.json Mar 09, 2016 11:16:51 AM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: Transformed Query: #combine:norm=false:w=1.0( #wsum:0=0.7:1=0.3:w=1.0( #jm:collectionLength=29039:lengths=h3:maximumCount=2:nodeFrequency=36:w=0.7( #lengths:h3:part=lengths() #counts:international:part=field.krovetz.h3() ) #jm:collectionLength=247217451:lengths=text:maximumCount=326 :nodeFrequency=149205:w=0.3( #lengths:text:part=lengths() #counts:international:part=field.krovetz.text() ) ) #wsum:0=0.7:1=0.3:w=1.0( #jm:collectionLength=29039:lengths=h3:maximumCount=0:nodeFrequency=0:w=0.7( #lengths:h3:part=lengths() #counts:organized:part=field.krovetz.h3() ) #jm:collectionLength=247217451:lengths=text:maximumCount=111 :nodeFrequency=11375:w=0.3( #lengths:text:part=lengths() #counts:organized:part=field.krovetz.text() ) ) #wsum:0=0.7:1=0.3:w=1.0( #jm:collectionLength=29039:lengths=h3:maximumCount=1:nodeFrequency=1:w=0.7( #lengths:h3:part=lengths() #counts:crime:part=field.krovetz.h3() ) #jm:collectionLength=247217451:lengths=text:maximumCount=68 :nodeFrequency=30058:w=0.3( #lengths:text:part=lengths() #counts:crime:part=field.krovetz.text() ) ) ) prms-jm Q0 FT931-1 1 NaN galago prms-jm Q0 FT941-2 2 NaN galago prms-jm Q0 LA010189-0002 3 NaN galago prms-jm Q0 FBIS3-2 4 NaN galago prms-jm Q0 FBIS3-3 5 NaN galago
The Proximity Divergence from Randomness Model assumes all adjacent paris of query terms
are dependent. Terms are scored using PL2 scoring model while bigrams use BIL2 scoring
model by default. Parameters allow other scoring models to be used. Weights for the
term and bigram query components (PL2 and BiL2 scorers) may also be specified. Document
scores are the weighted sum of term and bigram (bi-term) features.Implemented by class core.retrieval.traversal.ProximityDFRTraversal class using the #pdfr operator.
#pdfr ( term1 term2 term3 ) becomes #combine ( w: #pl2:c=6.0 (stats for term1) w: #pl2:c=6.0 (stats for term2) w: #pl2:c=6.0 (stats for term3) w: #bil2:c=0.05 ( #ordered:5 ( term1 term2 ) ) w: #bil2:c=0.05 ( #ordered:5 ( term2 term3 ) ) )
Note: The components weights will be divided by the number of pl2 and bil2
operations performed. Unordered window distances is specified by the
windowSize parameter or default of 5.
{ "verbose" : true, "casefold" : true, "requested" : 5, "index" : "/myindexes/robust04.idx", "pdfrSeq" : true, "termLambda" : 1.0, "c" : 6.0, "cp" : 0.05, "pdfrTerm" : "pl2", "pdfrProx" : "bil2", "windowSize" : 5, "queries" : [ { "number" : "pdfr", "text" : "#pdfr (international organized crime)" } ] }
galago batch-search /myqueries/qrys_pdfr.json Mar 10, 2016 12:38:43 PM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: RUNNING: fdm : #pdfr (international organized crime.h3) Mar 10, 2016 12:38:43 PM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: Transformed Query: #combine:0=0.3333333333333333:1=0.3333333333333333:2=0.3333333333333333:3=0.0:4=0.0( #pl2:c=6.0:collectionLength=252359881:documentCount=528155:maximumCount=326:nodeFrequency=174191( #lengths:document:part=lengths() #counts:international:part=postings.krovetz() ) #pl2:c=6.0:collectionLength=252359881:documentCount=528155:maximumCount=111:nodeFrequency=11401( #lengths:document:part=lengths() #counts:organized:part=postings.krovetz() ) #pl2:c=6.0:collectionLength=252359881:documentCount=528155:maximumCount=1:nodeFrequency=1( #lengths:document:part=lengths() #counts:crime:part=field.krovetz.h3() ) #bil2:c=0.05:collectionLength=252359881:documentCount=528155( #lengths:document:part=lengths() #ordered:5( #extents:international:part=postings.krovetz() #extents:organized:part=postings.krovetz() ) ) #bil2:c=0.05:collectionLength=252359881:documentCount=528155( #lengths:document:part=lengths() #ordered:5( #extents:organized:part=postings.krovetz() #extents:crime:part=field.krovetz.h3() ) ) ) pdfr Q0 FBIS3-8153 1 4.99343555 galago pdfr Q0 FBIS4-41991 2 3.79434205 galago pdfr Q0 LA121990-0141 3 3.65754089 galago pdfr Q0 FBIS4-54904 4 3.61571623 galago pdfr Q0 FBIS4-38364 5 3.57971170 galago
Dirichlet smoothing function depending on document length. This is the default smoothing function for all query operators.
#dirichlet:mu=1200(international)
Expands to:
#dirichlet:collectionLength=N:maximumCount=N:mu=1200:noqdeFrequency=N:w=0.N ( #lengths:document:part=lengths() #counts:theTerm:part=postings.krovetz )
Parameters listed in expanded query appear in alphabetic order.
galago batch-search --verbose=true --requested=5 --mu=1000 \ --index=/myindexes/ap89_fields.idx --query="#dirichlet(survivors)" Mar 09, 2016 1:56:44 PM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: RUNNING: unk-0 : #dirichlet(survivors) Mar 09, 2016 1:56:44 PM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: Transformed Query: #dirichlet:collectionLength=3801748:maximumCount=10:mu=1000:nodeFrequency=230( #lengths:document:part=lengths() #counts:survivors:part=postings.krovetz() ) unk-0 Q0 AP890102-0135 1 -5.03674813 galago unk-0 Q0 AP890112-0108 2 -5.34195179 galago unk-0 Q0 AP890113-0159 3 -5.61024136 galago unk-0 Q0 AP890102-0044 4 -5.64420944 galago unk-0 Q0 AP890102-0137 5 -5.70465571 galago
The BM25 (Okapi) scoring function. Implemented in the org.lemurproject.galago.core.retrieval.iterator.scoring.BM25ScoringIterator class using the #bm25 operator.
{ "verbose" : true, "casefold" : true, "requested" : 5, "index" : "/myindexes/robust04.idx", "K" : 0.7777, "b" : 0.345, "queries" : [ { "number" : "bm25-combine", "text" : "#combine(#bm25(international) #bm25(organized) #bm25(crime))", } ] }
galago batch-search /myqueries/qrys_bm25.json Mar 09, 2016 4:11:55 PM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: RUNNING: bm25f-combine : #combine(#bm25(international) #bm25(organized) #bm25(crime)) Mar 09, 2016 4:11:55 PM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: Transformed Query: #combine:w=1.0( #bm25:b=0.345:collectionLength=252359881:documentCount=528155:maximumCount=326 :nodeDocumentCount=102493:nodeFrequency=174191:w=0.3333333333333333( #lengths:document:part=lengths() #counts:international:part=postings.krovetz() ) #bm25:b=0.345:collectionLength=252359881:documentCount=528155:maximumCount=111 :nodeDocumentCount=8455:nodeFrequency=11401:w=0.3333333333333333( #lengths:document:part=lengths() #counts:organized:part=postings.krovetz() ) #bm25:b=0.345:collectionLength=252359881:documentCount=528155:maximumCount=68 :nodeDocumentCount=14954:nodeFrequency=30997:w=0.3333333333333333( #lengths:document:part=lengths() #counts:crime:part=postings.krovetz() ) ) bm25-combine Q0 FBIS4-41991 1 5.87150854 galago bm25-combine Q0 FBIS4-38364 2 5.65608722 galago bm25-combine Q0 FBIS4-55395 3 5.59812441 galago bm25-combine Q0 FBIS3-19646 4 5.54131058 galago bm25-combine Q0 FBIS3-21961 5 5.54131058 galago
The BM25F operator is a traversal that implements the bm25 smoothing/ranking algorithm on
a document field basis. It is actually implemented as the #bm25fcomb operator.Since it is a contribution one must explicitly add the traversal class to retrieval paramenters.
E.g., add--traversals+org.lemurproject.galago.contrib.retrieval.traversal.BM25FTraversalto your retrieval parameters or the command line when invoking search or batch-search.
Since contrib.jar is needed for this traversal, it should be on the classpath.
Alternatively use the ./contrib/target/appassembler/bin/galago version of galago rather than
the usual ./core/target/appassembler/bin/galago. Just copy the contrib jar to the appassembler/lib
there and edit the galago script to include the contrib jar in the CLASSPATH.
smoothing Map of field based smoothing values (b values)
traversals Define the traversal class to be used and how used
{ "verbose" : true, "casefold" : true, "requested" : 5, "index" : "/myindexes/robust04.idx", "traversals" : [ { "name" : "org.lemurproject.galago.contrib.retrieval.traversal.BM25FTraversal", "order" : "before" } ], "fields" : [ "h3", "text" ], "bm25f" : { "K" : 0.7777, "b" : 0.345, "weights" : { "h3" : 0.455, "text" : 0.201, }, "smoothing" : { "h3" : 0.309, "text" : 0.105 } }, "queries" : [ { "number" : "bm25f", "text" : "#bm25f(international organized crime)", } ] }
./galago batch-search /myqueries/qrys_bm25f.json Mar 09, 2016 3:58:18 PM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: RUNNING: bm25f : #bm25f(international organized crime) Mar 09, 2016 3:58:18 PM org.lemurproject.galago.core.tools.apps.BatchSearch run INFO: Transformed Query: #bm25fcomb:K=0.7777:idf0=1.6395904192983146:idf1=4.134572684024236 :idf2=3.5643775433430793:norm=false( #combine:0=0.455:1=0.201:norm=false( #bm25field:K=0.7777:b=0.309:collectionLength=29039:documentCount=821 :idf=1.6395904192983146 :lengths=h3:maximumCount=2:nodeDocumentCount=31:pIdx=0:w=0.455( #lengths:h3:part=lengths() #counts:international:part=field.krovetz.h3() ) #bm25field:K=0.7777:b=0.105:collectionLength=247217451:documentCount=524000 :idf=1.6395904192983146 :lengths=text:maximumCount=326:nodeDocumentCount=84025:pIdx=0:w=0.201( #lengths:text:part=lengths() #counts:international:part=field.krovetz.text() ) ) #combine:0=0.455:1=0.201:norm=false( #bm25field:K=0.7777:b=0.309:collectionLength=29039:documentCount=821 :idf=4.134572684024236 :lengths=h3:maximumCount=0:nodeDocumentCount=0:pIdx=1:w=0.455( #lengths:h3:part=lengths() #counts:organized:part=field.krovetz.h3() ) #bm25field:K=0.7777:b=0.105:collectionLength=247217451:documentCount=524000 :idf=4.134572684024236 :lengths=text:maximumCount=111:nodeDocumentCount=8450:pIdx=1:w=0.201( #lengths:text:part=lengths() #counts:organized:part=field.krovetz.text() ) ) #combine:0=0.455:1=0.201:norm=false( #bm25field:K=0.7777:b=0.309:collectionLength=29039:documentCount=821 :idf=3.5643775433430793 :lengths=h3:maximumCount=1:nodeDocumentCount=1:pIdx=2:w=0.455( #lengths:h3:part=lengths() #counts:crime:part=field.krovetz.h3() ) #bm25field:K=0.7777:b=0.105:collectionLength=247217451:documentCount=524000 :idf=3.5643775433430793 :lengths=text:maximumCount=68:nodeDocumentCount=14712:pIdx=2:w=0.201( #lengths:text:part=lengths() #counts:crime:part=field.krovetz.text() ) ) ) bm25f Q0 FBIS4-41991 1 6.68788814 galago bm25f Q0 FBIS4-38364 2 6.63572238 galago bm25f Q0 FBIS4-7811 3 6.55618402 galago bm25f Q0 FBIS3-24143 4 6.39592308 galago bm25f Q0 FBIS4-55395 5 6.38987171 galago