Discussion moved from Wiki Advanced Retrieval Configuration page.
James K J - 2018-01-08
When i run galago with the below command
galago build indexParam.json, i encountered the following issue
Couldn't delete buildManifest.json
Exception in thread "main" java.lang.RuntimeException: java.lang.NullPointerException: inStream parameter is null
at org.lemurproject.galago.core.tools.apps.BuildIndex.getIndexJob(BuildIndex.java:784)
at org.lemurproject.galago.core.tools.apps.BuildIndex.execute(BuildIndex.java:825)
at org.lemurproject.galago.core.tools.apps.BuildIndex.run(BuildIndex.java:859)
at org.lemurproject.galago.utility.tools.AppFunction.run(AppFunction.java:62)
at org.lemurproject.galago.core.tools.App.run(App.java:96)
at org.lemurproject.galago.core.tools.App.run(App.java:87)
at org.lemurproject.galago.core.tools.App.main(App.java:83)
Caused by: java.lang.NullPointerException: inStream parameter is null
at java.base/java.util.Objects.requireNonNull(Objects.java:246)
at java.base/java.util.Properties.load(Properties.java:365)
at org.lemurproject.galago.utility.VersionInfo.setGalagoVersionAndBuildDateTime(VersionInfo.java:47)
at org.lemurproject.galago.core.tools.apps.BuildIndex.getIndexJob(BuildIndex.java:652)
IndexParam.json
{
"fileType" : "trectext",
"inputPath" : "Data/trec.sample",
"indexPath" : "sample_index",
"stemmedPostings": true,
"nonStemmedPostings": true,
"stemmer": ["porter"],
"tokenizer" : {"fields" : ["docno", "headline", "text"]},
"corpus": true
}
Can someone help me to fix this issue please ??
Stephen Harding - 2018-01-08
Galago does not like the indexPath you specified.
Do you have permission to write/delete from the specified directory?
Are you able to manually delete the old index completely? If so, do so and try again. Your parameter file looks OK assuming the input and index paths are valid.
James K J - 2018-01-08
Q) Do you have permission to write/delete from the specified directory?
A) Yes
Q) Are you able to manually delete the old index completely?
A) So far, i am not able to generate the indexfile. This is the first time i am running galago.
Q) Galago does not like the indexPath you specified.?
A) can you help me to understand the above question better Please !!
Stephen Harding - 2018-01-08
Does this index directory get created (./sample_index) in your work directory?
If not, can you manually create a sample_index directory in your work location?
Try putting a full path for index and source paths just to be sure where data is coming from and where an index is going.
James K J - 2018-01-08
Q) Does this index directory get created (./sample_index) in your work directory?
A) Nope
Q) If not, can you manually create a sample_index directory in your work location?
A) Yes, i created but then, when i run the command, the directory got deleted.
Q) Try putting a full path for index and source paths just to be sure where data is coming from and where an index is going.
A) yes
Below is my indexParam.json
{
"fileType" : "trectext",
"inputPath" : "/home/james/TTDS/Project/Data/trec.sample",
"indexPath" : "/home/james/TTDS/Project/sample",
"stemmedPostings": true,
"nonStemmedPostings": true,
"stemmer": ["porter"],
"tokenizer" : {"fields" : ["docno", "headline", "text"]},
"corpus": true
}
Stephen,
I was able to get your above configuration working with version 3.14159 but not with 3.13-bin. Perhaps there is a bug? I downloaded just the galago-3.13-bin.tar.gz from https://sourceforge.net/projects/lemur/files/lemur/galago-3.13/ . If I simply change my system path variable to look for 3.13-bin as opposed to 3.14159 with the exact configurations that you have above ( save for the different input and index paths) it no longer works and reports the same error as James's
galago build test_build.json
Couldn't delete buildManifest.json
Exception in thread "main" java.lang.RuntimeException: java.lang.NullPointerException: inStream parameter is null
at org.lemurproject.galago.core.tools.apps.BuildIndex.getIndexJob(BuildIndex.java:784)
at org.lemurproject.galago.core.tools.apps.BuildIndex.execute(BuildIndex.java:825)
at org.lemurproject.galago.core.tools.apps.BuildIndex.run(BuildIndex.java:859)
at org.lemurproject.galago.utility.tools.AppFunction.run(AppFunction.java:62)
at org.lemurproject.galago.core.tools.App.run(App.java:96)
at org.lemurproject.galago.core.tools.App.run(App.java:87)
at org.lemurproject.galago.core.tools.App.main(App.java:83)
Caused by: java.lang.NullPointerException: inStream parameter is null
at java.base/java.util.Objects.requireNonNull(Objects.java:246)
at java.base/java.util.Properties.load(Properties.java:365)
at org.lemurproject.galago.utility.VersionInfo.setGalagoVersionAndBuildDateTime(VersionInfo.java:47)
at org.lemurproject.galago.core.tools.apps.BuildIndex.getIndexJob(BuildIndex.java:652)
... 6 more
Do you know why this may be the case?
Another perhaps unrelated question, does version 3.14159 support bm25 - that is the only reason I am attempting to try the newer release?
I appreciate your help.
Last edit: milbat99 11 hours ago
Stephen Harding
Stephen Harding - 4 hours ago
Version 3.14159 is quite an old version (despite the naming) and is different from newer versions.
What sort of system are you running on? Can you compile the latest galago using sources (you'll need Maven and Java 8)? It's not a long or complicated build.
Do any index part and buildManifest.json files appear in the defined index directory after you attempt to build the index? Can you manually remove any files that are present in the build directory?
I'd suggest you delete the new build index directory and subdirectories as well as the various tupleflow* temp files in the configured galago tmp directory (default /tmp) and try again.
I do notice the stack trace indicates an issue determining Galago version information which relies on Maven generated version property values that can't be loaded, although no IOException seems to be generated. The binary version would not have the version.properties file present, but I don't believe failure to read it would cause the problem. The binary version works for me locally without it.
However, if your build does generate a buildManifest.json file in the index directory, does it contain galagoVersion and galagoVersionBuildDateTime values in it?
If it does not, as an experiment, try creating a version.properties file and put it somewhere in your classpath, perhaps the bin/ or lib/ directory. The property file looks like this:
version=3.13
build.date=2018-01-30 18:54
Put whatever build date you want and see if it makes a difference.
It appears version 3.14159 does support BM25 but I'm not sure of it's use in such an older galago version. Current galago uses the #bm25 query operator or simply defines BM25 as a Scorer retrieval parameter.
Problem not related to version.properties file as it resides in the core jar file in the binary lib/ directory.
Current thinking is perhap you simply ran out of tmp space when building the index.
So remove your index build directory (the one specified by the indexPath parameter) and also the tupleflow tmp directories for the build. These files will reside in the directories specified by tmpdir in your .galago.conf file if you have one, or typcially /tmp if not otherwise specified.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am using the binary for which I assumed that you simply follow steps 1-3 as defined in the wiki and do not have to build using maven and java 8 within intellij. (I have both installed on my computer)
I am running MAC OS Sierra version 10.12.6. When I attempt to build, there is no tmp file that is created.
I attempted to add a version.properties file with the configuration you listed above (under /Users/myname/Desktop/galago-3.13-bin. I believe that galago-3.13-bin itself is correctly configured as I am able to view the added option after running galago help for version 3.13-bin as opposed to 3.14159.
Here is my test_build.json located in /Users/myname
{
"fileType" : "trectext",
"indexPath" : "/Users/myname/work/tmp/test.idx",
"inputPath" : "/Users/myname/work/data/test.trectext",
"stemmer" : ["porter"],
"nonStemmedPostings" : true,
"stemmedPostings" : true,
"tokenizer" : {
"fields" : ["docno", "text"]
},
"corpus" : true
}
My work directory is defined as follows
users
├── myname
├── work
├── data
└── test.trectext
where test.trectext contains the exact info you have in your above example. There is no tmp directory created after the galago build test_build.json fails with "Couldn't delete buildManifest.json"
Does anything from this seem incorrect to you?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think the problem is temp space filling up, which is causing one or more of the build job stages to fail.
Galago uses a lot of tmp space doing a series of map/reduce steps in generating its index parts. If there isn't enough space, the process fails.
You can define where the tmp space is to be located either by specifying it in a .galago.conf file in your home directory, or within the build configuration JSON file. Otherwise, it uses a default which is typically /tmp.
Since you had some parts of an index build underway, there MUST be tmp files somewhere. They are named tupleflowNNNN.. and will have subdirectories where various indexing job stage work was done.
Try the following. Again remove the build index directory. Remove the tupleflow tmp directory if you can find it. Then add the following parameters to your build config:
Make sure the tmp and job dir definitions have sufficient space.
Do the build.
If it fails again, go to your defined tmp directory, confirm there is a tupleflow work directory present, and then check to see how much space is available there.
I'm thinking the space will be full causing the build to fail.
You'll need to manually remove the job directory and tmp directory work files since you've specifically asked in the config file to not automatically delete them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Attached is a screen shot of the tupleflow directories present on my system- I am not sure which to delete. I added the extra configuration you mentioned to my test_build.json so that it now looks like the following where build and tmp are empty directories.
{
"fileType" : "trectext",
"galagoJobDir" : "/Users/myname/build",
"tmpdir" : "/Users/myname/tmp",
"deleteJobDir" : false,
"indexPath" : "/Users/myname/work/tmp/test.idx",
"inputPath" : "/Users/myname/work/data/test.trectext",
"stemmer" : ["porter"],
"nonStemmedPostings" : true,
"stemmedPostings" : true,
"tokenizer" : {
"fields" : ["docno", "text"]
},
"corpus" : true
}
There is nothing regarding tupleflow created in either the /Users/myname/build or /Users/myname/tmp directories.
Your screenshot is of the galago tupleflow module and isn't where any temporary files would be written.
Does your configured job directory (build) contain any sub-directories? There should be a bunch of them mapping to the different stages of the index build; names like parsePostings-corpusKeys, parsePostings-fieldLengthData and a bunch more.
If there is nothing in your configured galagoJobDir, and you are not deleting them automatically (deleteJobDir=false) then I'd have to return to thinking there is some sort of permissions issue that isn't allowing you to write to the disk at that location.
I have indexed the example document, using the example configuration on OS X 10.11.6 with galago-3.13-bin without problem.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Discussion moved from Wiki Advanced Retrieval Configuration page.
James K J - 2018-01-08
Stephen Harding - 2018-01-08
James K J - 2018-01-08
Stephen Harding - 2018-01-08
James K J - 2018-01-08
Stephen Harding - 2018-01-08
milbat99 - 11 hours ago https://sourceforge.net/u/milbat99/
Problem not related to version.properties file as it resides in the core jar file in the binary lib/ directory.
Current thinking is perhap you simply ran out of tmp space when building the index.
So remove your index build directory (the one specified by the indexPath parameter) and also the tupleflow tmp directories for the build. These files will reside in the directories specified by tmpdir in your .galago.conf file if you have one, or typcially /tmp if not otherwise specified.
Stephen,
I am using the binary for which I assumed that you simply follow steps 1-3 as defined in the wiki and do not have to build using maven and java 8 within intellij. (I have both installed on my computer)
I am running MAC OS Sierra version 10.12.6. When I attempt to build, there is no tmp file that is created.
I attempted to add a version.properties file with the configuration you listed above (under /Users/myname/Desktop/galago-3.13-bin. I believe that galago-3.13-bin itself is correctly configured as I am able to view the added option after running galago help for version 3.13-bin as opposed to 3.14159.
Here is my test_build.json located in /Users/myname
{
"fileType" : "trectext",
"indexPath" : "/Users/myname/work/tmp/test.idx",
"inputPath" : "/Users/myname/work/data/test.trectext",
"stemmer" : ["porter"],
"nonStemmedPostings" : true,
"stemmedPostings" : true,
"tokenizer" : {
"fields" : ["docno", "text"]
},
"corpus" : true
}
My work directory is defined as follows
users
├── myname
├── work
├── data
└── test.trectext
where test.trectext contains the exact info you have in your above example. There is no tmp directory created after the galago build test_build.json fails with "Couldn't delete buildManifest.json"
Does anything from this seem incorrect to you?
I think the problem is temp space filling up, which is causing one or more of the build job stages to fail.
Galago uses a lot of tmp space doing a series of map/reduce steps in generating its index parts. If there isn't enough space, the process fails.
You can define where the tmp space is to be located either by specifying it in a .galago.conf file in your home directory, or within the build configuration JSON file. Otherwise, it uses a default which is typically /tmp.
Since you had some parts of an index build underway, there MUST be tmp files somewhere. They are named tupleflowNNNN.. and will have subdirectories where various indexing job stage work was done.
Try the following. Again remove the build index directory. Remove the tupleflow tmp directory if you can find it. Then add the following parameters to your build config:
Make sure the tmp and job dir definitions have sufficient space.
Do the build.
If it fails again, go to your defined tmp directory, confirm there is a tupleflow work directory present, and then check to see how much space is available there.
I'm thinking the space will be full causing the build to fail.
You'll need to manually remove the job directory and tmp directory work files since you've specifically asked in the config file to not automatically delete them.
Stephen,
Attached is a screen shot of the tupleflow directories present on my system- I am not sure which to delete. I added the extra configuration you mentioned to my test_build.json so that it now looks like the following where build and tmp are empty directories.
{
"fileType" : "trectext",
"galagoJobDir" : "/Users/myname/build",
"tmpdir" : "/Users/myname/tmp",
"deleteJobDir" : false,
"indexPath" : "/Users/myname/work/tmp/test.idx",
"inputPath" : "/Users/myname/work/data/test.trectext",
"stemmer" : ["porter"],
"nonStemmedPostings" : true,
"stemmedPostings" : true,
"tokenizer" : {
"fields" : ["docno", "text"]
},
"corpus" : true
}
There is nothing regarding tupleflow created in either the /Users/myname/build or /Users/myname/tmp directories.
Last edit: milbat99 2018-02-28
Your screenshot is of the galago tupleflow module and isn't where any temporary files would be written.
Does your configured job directory (build) contain any sub-directories? There should be a bunch of them mapping to the different stages of the index build; names like parsePostings-corpusKeys, parsePostings-fieldLengthData and a bunch more.
If there is nothing in your configured galagoJobDir, and you are not deleting them automatically (deleteJobDir=false) then I'd have to return to thinking there is some sort of permissions issue that isn't allowing you to write to the disk at that location.
I have indexed the example document, using the example configuration on OS X 10.11.6 with galago-3.13-bin without problem.
Hi Stephen,
I was able to solve the problem by simply creating an empty build_manifest.json within my indexPath directory.