My corpus has thousands of files and directories with spaces in the name. The generated resultFile from retEval could not be read with GenerateQueryModel because of this spaces.
Please make a warning when a space is detected in the program generating this file (retEval in this case).
The best would be to make it Windows compatible and possible to read paths/files with space in the name.
in the meanwhile i used http://www3.telus.net/pfrank/index.html to batch rename the paths so Lemur could handle it.
I am unable to replicate a failure when providing a filename with a space as the feedbackDocuments parameter to GenerateQueryModel. This is on Vista 64.
Please attach a parameter file result file that exercises the issue, so that I can replicate it.
Sorry this was not clear, the problem is not in the parameterfile. The problem is in processesing the actual resultFile from retEval with GenerateQueryModel. Only when there is a space in the content of this file.
for example a line of problem content of my resultFile:
Inspections Q0 D:\GMM OSDRU\Test Documentatie\Archief\STP OSDRU.doc 61 257.301 Exp
My corpus has thousands of paths with spaces in it so in the meanwhile i replace the spaces in this corpus with underscore.... But it would be nice to give a warning when RetEval detects spaces in the pathname.
I understand now. The problem has nothing to do with Windows. When individual files are indexed by indri, the filename is used as the docno element if the file does not contain one. In doing so, indri does not enforce that the docno element may not contain spaces.
The affected files are:
parsing/include/indri/DocumentIterator.hpp
parsing/src/MboxDocumentIterator.cpp
parsing/src/PDFDocumentExtractor.cpp
parsing/src/TextDocumentExtractor.cpp
parsing/src/win/PowerPointDocumentExtractor.cpp
parsing/src/win/WordDocumentExtractor.cpp
Changed to replace spaces with '_' when generating the docno metadata value from the filename (or subject, in the case of Mbox).
Shipped in the 4.11 release.