More robust version of getting the doc ID for a WARC record.
What happens if you set <trecformat> to true? What version of Indri are you using? The "doc" tags should be all caps.</trecformat>
What happens if you set <trecformat> to true?</trecformat>
RankLib
You should be passing the "-output" path from harvest links in the <inlinks> parameter to build the index. </inlinks>
what parameters did you pass to harvestlinks?
This was caused becuse the initial run used the parameter --filetype=trecweb and specified a galagoJobDir. The run failed becuase the documents were not in trecweb format. Even with the filetype set to trectext, Galago used the initial filetype saved in the galagoJobDir. When the galagoJobDir folder was changed or removed, the new filetype was used and built the index.
Galago filetype not handled correctly
This wsa caused becuse the initial run used the parameter --filetype=trecweb and specified a galagoJobDir. The run failed becuase the documents were not in trecweb format. Even with the filetype set to trectext, Galago used the initial filetype saved in the galagoJobDir. When the galagoJobDir folder was changed or removed, the new filetype was used and built the index.
Also, could you please let us know the full URL you are expecting to be returned, I'm wondering if there is a character in there that Indri is interpreting as a token seperator.
Also, could you please let us know the full URL you are expecting to be returned, I'm wondering if there is a charracter in there that Indri is interpreting as a token seperator.
Added logic to handle empty documents.
Made the code to detect a trec text doc a bit more robust.
Galago Functions
Documented filetype param for build function.
Galago Basic Retrieval Configuration
Added support for indexing .z compressed files.
RankLib How to use
RankLib How to use
I think the easiest approach would be to manually split the data and use the train, test, and validate parameters. It may be possible to combine the ksv and tts parameters but I'm not sure. I don't think the solution provided in the reply you referenced will work.
Added tag galago-3.16 for changeset 1de6bba319f2
test of team city (windows)
test of team city
This has been fixed via commit: https://sourceforge.net/p/lemur/code/2773/ To get this change, pull the latest code (2.13-SNAPSHOT)
This has been fixed via commit: https://sourceforge.net/p/lemur/code/2773/ To get this change, pull the latest code (2.3-SNAPSHOT)
RankLib: metricT parameter is not getting set correctly
fixed in commit: https://sourceforge.net/p/lemur/code/2773/
Fixed check of "metric2t" and "metric2T" to be case sensitive.
RankLib: metricT parameter is not getting set correctly
Tests fail on Windows
fixed in v3.16
Volker, Please try running your experiment with this jar: https://sourceforge.net/projects/lemur/files/tmp/RankLib-2.13-SNAPSHOT.jar/download
What ranking algorithm are you using?
BM25 default values
2.13 snapshot
tag for release 2.12
2.12 release
3.17 SNAPSHOT
Sourceforge lost about 6 months worth of commits. This puts them back.
Galago 3.16 release
Galago 3.16 release
Galago 3.16 release
I think using passage retrieval will get you close to what you want.
Ranklib docs pointing to old version
this error is because it can't find commons-math3-3.5.jar. The easiest solution is to put that jar in the same folder as RankLib-2.11.jar or fix the class path.
Ranklib docs pointing to old version
I couldn't get that old version to run either, but had no issues using the latest (2.11 or 2.11-SNAPSHOT)
note that the parameter has an undescore: feature_stats
Please see the dump-term-stats function.
Galago Functions
What collection are you trying to index? Can you send one of the sgml files? We've seen this when there is a missing tag.
In the past we've experienced issues with newer versions of g++. We compiled with version 4.4.7 and that appears to work.
As stated in the documentation, XML documents must be indexed one at a time.
The page on the sourceforge wiki seems to have dissapered, but an old version can be found here: https://lemurproject.org/doxygen/lemur/html/IndriRunQuery.html "When running a baseline experiment, the queries may not contain any indri query language operators, they must contain only terms."
Galago Functions
I misspoke, those values are probabilities, they are very small because they are calculated across the entire corpus.
Removed dump-corpus options that were being ignored.
Galago Functions
Galago Functions
Galago Functions
Galago Functions
Iris, Only the relevant documents are returned. If there are 1,000, you'll get 1,000. If only 23 documents are relevant, it will only return those 23. You're getting 1,000 with the "#rm(#stopword(poaching))" query, because when it analyzes the query, the #rm operator performs query expansion so you're getting MANY more terms, hence more relevant documents. You can see this if you run "galago search" then in the web interface click the "debug" lilnk and you'll see all the query terms. Your two queries...
That page has not been written yet, the link was just a placeholder. I've removed the link to avoid futue confusion.
Galago Basic Retrieval Configuration
Galago Basic Retrieval Configuration
Galago Basic Retrieval Configuration
Update dependencies with security issues
Zhihao, Unfortunately changes in g++ can cause issues like this. We compiled with version 4.4.7 and did not have any issues creating the index. Michael
Tests fail on Windows
removing test string
team city test
testing mirror 3
testing mirror 2
testing mirror
Not sure how this got reverted.
Removed call to script that installed drmaa
test travis ci
test travis ci
Galago Functions
Galago: --mode=threaded has deadlocks
removed threaded mode in Galago 3.16 - fork mode replaces it.
closing branch
Removed threaded build mode
Removed threaded build mode
remove-thread-mode
Added "fork" to the help message.
Allow Multiple Index Parts to be Specified for Dump-Term-Stats
implemented in Galago 3.10
Is Galago 3.13 available via maven?
this was due to license restrictions on drmaa. Galago version 3.16 removed dramaa so at a minimum we can publish to the UMass maven repo. Don't think it's worth the effort to get it in the central maven repo.
Update wiki for SLURM
Dhruv, This is due to the small number of features that you have. The default Feature-sampling value is 0.3, and in the code the number of features is multiplied by that value then cast to an int which results in zero. To get around this, you can either increase the default value by passing in a higher value such as "-frate 0.4" or you can add one extra feature.
Galago Functions
Galago: Quick Guide for New Users
Galago Temporary Files