Menu

#75 CQP can save a named query results file larger than 2 GiB but cannot read it

TODO-3.5
closed-fixed
nobody
5
2022-03-08
2022-03-07
Jyrki Niemi
No

CQP can write very large query results to a named query results file, but it cannot read a saved file whose size is between 2 and 4 GiB (or certain other, greater ranges) on a system with a 32-bit int, such as 64-bit Linux. The same is true for the cwb-decode-nqrfile command-line utility. This happens at least when the result has some 280 million hits or more.

A sample (anonymized) CQP session illustrating the problem:

$ cqp -e
[no corpus]> set DataDirectory "/corpusdata/nqr";
[no corpus]> set AutoShow off;
[no corpus]> BIGCORPUS;
BIGCORPUS> info;
Size:    283986287
[]
BIGCORPUS> [];
283986287 matches. Use 'cat' to show.
BIGCORPUS> alltokens = Last;
BIGCORPUS> save alltokens;
BIGCORPUS> exit;
Done. Share and enjoy!
$ cqp -e
[no corpus]> set DataDirectory "/corpusdata/nqr";
[no corpus]> set AutoShow off;
[no corpus]> BIGCORPUS;
BIGCORPUS> Last = alltokens;
ERROR: File length of subcorpus is <= 0
CQP Error:
        Corpus ``alltokens'' is undefined
BIGCORPUS> exit;
Done. Share and enjoy!
$ ls -lh /corpusdata/nqr/BIGCORPUS:alltokens
-rw-rw-r-- 1 user user 2.2G 2022-03-07 11:17 /corpusdata/nqr/BIGCORPUS:alltokens
$ cwb-decode-nqrfile /corpusdata/nqr/BIGCORPUS:alltokens
ERROR: File length of subcorpus is <= 0: No such file or directory
$

As far as I can see, this bug or limitation is due to implicitly converting the 64-bit off_t file size in a struct stat into a signed 32-bit int, so for example values between 231 and 232 − 1 (2 to 4 GiB) become negative, and the code considers it as an error if the file size is 0 or negative. For cqp, this happens in function attach_subcorpus in cqp/corpmanag.c: function file_length in cl/fileutils.c returns an off_t, but the result is assigned to len, which is an int. For cwb-decode-nqrfile, the same happens in function file_length_from_handle in utils/cwb-decode-nqrfile.c, where file_length_from_handle returns the size as an int.

I have tested this with CWB 3.4.27 (on 64-bit Linux), but the relevant code doesn’t seem to have changed since then.

Such large results might not in general be desirable or intended, maybe typically resulting from searching for any word in a corpus of 280 million words or more. Nevertheless, I think CQP should be able to read a file it has written, or at least the limitation should be documented. I also find the current error messages somewhat misleading or uninformative.

Discussion

  • Stephanie Evert

    Stephanie Evert - 2022-03-07
     
  • Stephanie Evert

    Stephanie Evert - 2022-03-07

    Thanks for the report and thorough diagnosis. Should be fixed in r1708.

     
  • Stephanie Evert

    Stephanie Evert - 2022-03-07
    • status: open --> closed-fixed
     
  • Jyrki Niemi

    Jyrki Niemi - 2022-03-08

    Thank you for fixing the issue so quickly! I tested r1708, and reading NQR files larger than 2 GiB worked fine.

     

Log in to post a comment.