Menu

#117 cant process paths with spaces

v4.x
closed
Indri (93)
7
2012-09-27
2009-12-02
Allard
No

My corpus has thousands of files and directories with spaces in the name. The generated resultFile from retEval could not be read with GenerateQueryModel because of this spaces.
Please make a warning when a space is detected in the program generating this file (retEval in this case).
The best would be to make it Windows compatible and possible to read paths/files with space in the name.

in the meanwhile i used http://www3.telus.net/pfrank/index.html to batch rename the paths so Lemur could handle it.

Discussion

  • David Fisher

    David Fisher - 2009-12-02

    I am unable to replicate a failure when providing a filename with a space as the feedbackDocuments parameter to GenerateQueryModel. This is on Vista 64.

    Please attach a parameter file result file that exercises the issue, so that I can replicate it.

     
  • Allard

    Allard - 2009-12-03

    Sorry this was not clear, the problem is not in the parameterfile. The problem is in processesing the actual resultFile from retEval with GenerateQueryModel. Only when there is a space in the content of this file.

    for example a line of problem content of my resultFile:
    Inspections Q0 D:\GMM OSDRU\Test Documentatie\Archief\STP OSDRU.doc 61 257.301 Exp

    My corpus has thousands of paths with spaces in it so in the meanwhile i replace the spaces in this corpus with underscore.... But it would be nice to give a warning when RetEval detects spaces in the pathname.

     
  • David Fisher

    David Fisher - 2009-12-04

    I understand now. The problem has nothing to do with Windows. When individual files are indexed by indri, the filename is used as the docno element if the file does not contain one. In doing so, indri does not enforce that the docno element may not contain spaces.

     
  • David Fisher

    David Fisher - 2009-12-04

    The affected files are:
    parsing/include/indri/DocumentIterator.hpp
    parsing/src/MboxDocumentIterator.cpp
    parsing/src/PDFDocumentExtractor.cpp
    parsing/src/TextDocumentExtractor.cpp
    parsing/src/win/PowerPointDocumentExtractor.cpp
    parsing/src/win/WordDocumentExtractor.cpp

    Changed to replace spaces with '_' when generating the docno metadata value from the filename (or subject, in the case of Mbox).

     
  • David Fisher

    David Fisher - 2009-12-21

    Shipped in the 4.11 release.

     

Log in to post a comment.