Menu

#312 Galago filetype not handled correctly

v3.x
closed
None
2
2019-09-12
2019-09-09
Jeff Dalton
No

When running locally the DocumentStreamParser/UniversalParser default to trecweb. I encountered this issue when processing raw robust (.gz) compressed docs without a file extension. The "filetype" parameter was specified as "trectext", but this was ignored.

The issue is that the parameter object in DocumentStreamParser.create(split, parameters) has an empty Parameters file when it should be specified from the higher level. Also, the parameters in UniversalParser are also empty (not copied from the root parameters). The result is that filetype parameter is not able to be used or accessed correctly.

I've verified that the parameter is set correctly, but it is being ignored.

Discussion

  • Michael Zarozinski

    This was caused becuse the initial run used the parameter --filetype=trecweb and specified a galagoJobDir. The run failed becuase the documents were not in trecweb format. Even with the filetype set to trectext, Galago used the initial filetype saved in the galagoJobDir. When the galagoJobDir folder was changed or removed, the new filetype was used and built the index.

     

    Last edit: Michael Zarozinski 2019-09-12
  • Michael Zarozinski

    • status: open --> closed
    • assigned_to: Michael Zarozinski
    • Group: v5.x --> v3.x
     

Log in to post a comment.