Menu

Couldn't Delete Galago buildManifest.json

Galago
2018-02-27
2018-03-08
  • Lemur Project

    Lemur Project - 2018-02-27

    Discussion moved from Wiki Advanced Retrieval Configuration page.

    James K J - 2018-01-08

    When i run galago with the below command
    
        galago build indexParam.json, i encountered the following issue
    
        Couldn't delete buildManifest.json
    
        Exception in thread "main" java.lang.RuntimeException: java.lang.NullPointerException: inStream parameter is null
        at org.lemurproject.galago.core.tools.apps.BuildIndex.getIndexJob(BuildIndex.java:784)
        at org.lemurproject.galago.core.tools.apps.BuildIndex.execute(BuildIndex.java:825)
        at org.lemurproject.galago.core.tools.apps.BuildIndex.run(BuildIndex.java:859)
        at org.lemurproject.galago.utility.tools.AppFunction.run(AppFunction.java:62)
        at org.lemurproject.galago.core.tools.App.run(App.java:96)
        at org.lemurproject.galago.core.tools.App.run(App.java:87)
        at org.lemurproject.galago.core.tools.App.main(App.java:83)
        Caused by: java.lang.NullPointerException: inStream parameter is null
        at java.base/java.util.Objects.requireNonNull(Objects.java:246)
        at java.base/java.util.Properties.load(Properties.java:365)
        at org.lemurproject.galago.utility.VersionInfo.setGalagoVersionAndBuildDateTime(VersionInfo.java:47)
        at org.lemurproject.galago.core.tools.apps.BuildIndex.getIndexJob(BuildIndex.java:652)
    
        IndexParam.json
        {
        "fileType" : "trectext",
        "inputPath" : "Data/trec.sample",
        "indexPath" : "sample_index",
        "stemmedPostings": true,
        "nonStemmedPostings": true,
        "stemmer": ["porter"],
        "tokenizer" : {"fields" : ["docno", "headline", "text"]},
        "corpus": true
        }
    
        Can someone help me to fix this issue please ??
    

    Stephen Harding - 2018-01-08

        Galago does not like the indexPath you specified.
    
        Do you have permission to write/delete from the specified directory?
    
        Are you able to manually delete the old index completely? If so, do so and try again. Your parameter file looks OK assuming the input and index paths are valid.
    

    James K J - 2018-01-08

        Q) Do you have permission to write/delete from the specified directory?
        A) Yes
    
        Q) Are you able to manually delete the old index completely?
        A) So far, i am not able to generate the indexfile. This is the first time i am running galago.
    
        Q) Galago does not like the indexPath you specified.?
        A) can you help me to understand the above question better Please !!
    

    Stephen Harding - 2018-01-08

        Does this index directory get created (./sample_index) in your work directory?
    
        If not, can you manually create a sample_index directory in your work location?
    
        Try putting a full path for index and source paths just to be sure where data is coming from and where an index is going.
    

    James K J - 2018-01-08

        Q) Does this index directory get created (./sample_index) in your work directory?
        A) Nope
    
        Q) If not, can you manually create a sample_index directory in your work location?
        A) Yes, i created but then, when i run the command, the directory got deleted.
    
        Q) Try putting a full path for index and source paths just to be sure where data is coming from and where an index is going.
        A) yes
    
        Below is my indexParam.json
    
        {
        "fileType" : "trectext",
        "inputPath" : "/home/james/TTDS/Project/Data/trec.sample",
        "indexPath" : "/home/james/TTDS/Project/sample",
        "stemmedPostings": true,
        "nonStemmedPostings": true,
        "stemmer": ["porter"],
        "tokenizer" : {"fields" : ["docno", "headline", "text"]},
        "corpus": true
        }
    

    Stephen Harding - 2018-01-08

        Manually delete the sample directory and then build. The build should create the sample root index directory, and then fill it with different index parts.
    
        Here's my test:
    
        > cat test_build.json
        {
          "fileType" : "trectext",
          "indexPath" : "/work/tmp/test.idx",
          "inputPath" : "/work/data/test.trectext",
          "stemmer" : ["porter"],
          "nonStemmedPostings" : true,
          "stemmedPostings" : true,
          "tokenizer" : {
            "fields" : ["docno", "text"]
          },
          "corpus" : true
        }
    
        Data is just a simple, single document.
    
        <DOC>
        <DOCNO>Test-1</DOCNO>
        <TEXT>
        940218 ft 19 Feb 94.
        UK Company News.
    
        Goldsborough valued at pounds 74 5m in float goldsborough.
        </TEXT>
        </DOC>
    
        Build the index
    
        > galago build test_build.json
    
        Created executor: org.lemurproject.galago.tupleflow.execution.LocalCheckpointedStageExecutor@52de51b6
        Running without server!
        Use --server=true to enable web-based status page.
        Stage inputSplit completed with 0 errors.
        Jan 08, 2018 8:44:51 AM org.lemurproject.galago.core.parse.UniversalParser process
        INFO: Processing split: /work/data/test.trectext with: org.lemurproject.galago.core.parse.TrecTextParser
        Jan 08, 2018 8:44:51 AM org.lemurproject.galago.core.parse.UniversalParser process
        INFO: Processed 1 total in split: /work/data/test.trectext with class org.lemurproject.galago.core.parse.TrecTextParser
        Stage parsePostings completed with 0 errors.
        Stage writeExtentPostings-porter completed with 0 errors.
        Stage writeExtents completed with 0 errors.
        Stage writeNames completed with 0 errors.
        Stage writeExtentPostings completed with 0 errors.
        Stage writeFields completed with 0 errors.
        Stage writeCorpusKeys completed with 0 errors.
        Stage writeLengths completed with 0 errors.
        Stage writeNamesRev completed with 0 errors.
        Stage writePostings completed with 0 errors.
        Stage writePostings-porter completed with 0 errors.
        Done Indexing.
          - 0.00 Hours
          - 0.01 Minutes
          - 0.51 Seconds
        Documents Indexed: 1.
    
        > ls
    
        test_build.json
        test.idx/
    
        The test.idx directory will hold the parts of the index, including the buildmanifest. The parts for this particular build are shown in the listing below.
    
        > ls test.idx
    
        buildManifest.json
        corpus/
        extents
        field.porter.text
        field.text
        lengths
        names
        names.reverse
        postings
        postings.porter
    
        Note the postings and defined fields have non-stemmed and Porter stemmed parts.
    
        What does your data look like? Is it in trectext format?
    
        Still looks like some sort of write permission problem.
    

    milbat99 - 11 hours ago https://sourceforge.net/u/milbat99/

        Stephen,
    
        I was able to get your above configuration working with version 3.14159 but not with 3.13-bin. Perhaps there is a bug? I downloaded just the galago-3.13-bin.tar.gz from https://sourceforge.net/projects/lemur/files/lemur/galago-3.13/ . If I simply change my system path variable to look for 3.13-bin as opposed to 3.14159 with the exact configurations that you have above ( save for the different input and index paths) it no longer works and reports the same error as James's
    
        galago build test_build.json
        Couldn't delete buildManifest.json
        Exception in thread "main" java.lang.RuntimeException: java.lang.NullPointerException: inStream parameter is null
        at org.lemurproject.galago.core.tools.apps.BuildIndex.getIndexJob(BuildIndex.java:784)
        at org.lemurproject.galago.core.tools.apps.BuildIndex.execute(BuildIndex.java:825)
        at org.lemurproject.galago.core.tools.apps.BuildIndex.run(BuildIndex.java:859)
        at org.lemurproject.galago.utility.tools.AppFunction.run(AppFunction.java:62)
        at org.lemurproject.galago.core.tools.App.run(App.java:96)
        at org.lemurproject.galago.core.tools.App.run(App.java:87)
        at org.lemurproject.galago.core.tools.App.main(App.java:83)
        Caused by: java.lang.NullPointerException: inStream parameter is null
        at java.base/java.util.Objects.requireNonNull(Objects.java:246)
        at java.base/java.util.Properties.load(Properties.java:365)
        at org.lemurproject.galago.utility.VersionInfo.setGalagoVersionAndBuildDateTime(VersionInfo.java:47)
        at org.lemurproject.galago.core.tools.apps.BuildIndex.getIndexJob(BuildIndex.java:652)
        ... 6 more
    
        Do you know why this may be the case?
        Another perhaps unrelated question, does version 3.14159 support bm25 - that is the only reason I am attempting to try the newer release?
    
        I appreciate your help.
    
        Last edit: milbat99 11 hours ago
    
        Stephen Harding
        Stephen Harding - 4 hours ago
    
        Version 3.14159 is quite an old version (despite the naming) and is different from newer versions.
    
        What sort of system are you running on? Can you compile the latest galago using sources (you'll need Maven and Java 8)? It's not a long or complicated build.
    
        Do any index part and buildManifest.json files appear in the defined index directory after you attempt to build the index? Can you manually remove any files that are present in the build directory?
    
        I'd suggest you delete the new build index directory and subdirectories as well as the various tupleflow* temp files in the configured galago tmp directory (default /tmp) and try again.
    
        I do notice the stack trace indicates an issue determining Galago version information which relies on Maven generated version property values that can't be loaded, although no IOException seems to be generated. The binary version would not have the version.properties file present, but I don't believe failure to read it would cause the problem. The binary version works for me locally without it.
    
        However, if your build does generate a buildManifest.json file in the index directory, does it contain galagoVersion and galagoVersionBuildDateTime values in it?
    
        If it does not, as an experiment, try creating a version.properties file and put it somewhere in your classpath, perhaps the bin/ or lib/ directory. The property file looks like this:
    
            version=3.13
            build.date=2018-01-30 18:54
    
        Put whatever build date you want and see if it makes a difference.
    
        It appears version 3.14159 does support BM25 but I'm not sure of it's use in such an older galago version. Current galago uses the #bm25 query operator or simply defines BM25 as a Scorer retrieval parameter.
    

    Problem not related to version.properties file as it resides in the core jar file in the binary lib/ directory.

    Current thinking is perhap you simply ran out of tmp space when building the index.

    So remove your index build directory (the one specified by the indexPath parameter) and also the tupleflow tmp directories for the build. These files will reside in the directories specified by tmpdir in your .galago.conf file if you have one, or typcially /tmp if not otherwise specified.

     
  • milbat99

    milbat99 - 2018-02-27

    Stephen,

    I am using the binary for which I assumed that you simply follow steps 1-3 as defined in the wiki and do not have to build using maven and java 8 within intellij. (I have both installed on my computer)

    I am running MAC OS Sierra version 10.12.6. When I attempt to build, there is no tmp file that is created.

    I attempted to add a version.properties file with the configuration you listed above (under /Users/myname/Desktop/galago-3.13-bin. I believe that galago-3.13-bin itself is correctly configured as I am able to view the added option after running galago help for version 3.13-bin as opposed to 3.14159.

    Here is my test_build.json located in /Users/myname
    {
    "fileType" : "trectext",
    "indexPath" : "/Users/myname/work/tmp/test.idx",
    "inputPath" : "/Users/myname/work/data/test.trectext",
    "stemmer" : ["porter"],
    "nonStemmedPostings" : true,
    "stemmedPostings" : true,
    "tokenizer" : {
    "fields" : ["docno", "text"]
    },
    "corpus" : true
    }

    My work directory is defined as follows
    users
    ├── myname
    ├── work
    ├── data
    └── test.trectext
    where test.trectext contains the exact info you have in your above example. There is no tmp directory created after the galago build test_build.json fails with "Couldn't delete buildManifest.json"

    Does anything from this seem incorrect to you?

     
  • Lemur Project

    Lemur Project - 2018-02-28

    I think the problem is temp space filling up, which is causing one or more of the build job stages to fail.

    Galago uses a lot of tmp space doing a series of map/reduce steps in generating its index parts. If there isn't enough space, the process fails.

    You can define where the tmp space is to be located either by specifying it in a .galago.conf file in your home directory, or within the build configuration JSON file. Otherwise, it uses a default which is typically /tmp.

    Since you had some parts of an index build underway, there MUST be tmp files somewhere. They are named tupleflowNNNN.. and will have subdirectories where various indexing job stage work was done.

    Try the following. Again remove the build index directory. Remove the tupleflow tmp directory if you can find it. Then add the following parameters to your build config:

    ...
       "galagoJobDir" : "/some_tmp_dir/build",
       "tmpdir" : "/some_tmp_dir/tmp/",
       "deleteJobDir" : false,
    ...
    

    Make sure the tmp and job dir definitions have sufficient space.

    Do the build.

    If it fails again, go to your defined tmp directory, confirm there is a tupleflow work directory present, and then check to see how much space is available there.

    I'm thinking the space will be full causing the build to fail.

    You'll need to manually remove the job directory and tmp directory work files since you've specifically asked in the config file to not automatically delete them.

     
  • milbat99

    milbat99 - 2018-02-28

    Stephen,

    Attached is a screen shot of the tupleflow directories present on my system- I am not sure which to delete. I added the extra configuration you mentioned to my test_build.json so that it now looks like the following where build and tmp are empty directories.
    {
    "fileType" : "trectext",
    "galagoJobDir" : "/Users/myname/build",
    "tmpdir" : "/Users/myname/tmp",
    "deleteJobDir" : false,
    "indexPath" : "/Users/myname/work/tmp/test.idx",
    "inputPath" : "/Users/myname/work/data/test.trectext",
    "stemmer" : ["porter"],
    "nonStemmedPostings" : true,
    "stemmedPostings" : true,
    "tokenizer" : {
    "fields" : ["docno", "text"]
    },
    "corpus" : true
    }

    There is nothing regarding tupleflow created in either the /Users/myname/build or /Users/myname/tmp directories.

     

    Last edit: milbat99 2018-02-28
  • Lemur Project

    Lemur Project - 2018-03-06

    Your screenshot is of the galago tupleflow module and isn't where any temporary files would be written.

    Does your configured job directory (build) contain any sub-directories? There should be a bunch of them mapping to the different stages of the index build; names like parsePostings-corpusKeys, parsePostings-fieldLengthData and a bunch more.

    If there is nothing in your configured galagoJobDir, and you are not deleting them automatically (deleteJobDir=false) then I'd have to return to thinking there is some sort of permissions issue that isn't allowing you to write to the disk at that location.

    I have indexed the example document, using the example configuration on OS X 10.11.6 with galago-3.13-bin without problem.

     
  • milbat99

    milbat99 - 2018-03-08

    Hi Stephen,

    I was able to solve the problem by simply creating an empty build_manifest.json within my indexPath directory.

     

Log in to post a comment.