Menu

#47 Make cwb-encode handle non-POSIX (win32) linebreaks

TODO-3.5
pending
CWB engine (4)
5
2021-03-26
2012-11-08
No

Moving CWB input text files between Win and *nix can result in CRLF (0x0d, 0x0a) linebreaks being input: if this happens, the CR is encoded as part of the final p-attribute on each line. cwb-encode should be able to spot this and work round it (likewise, in the Win build, be able to cope with POSIX line-breaks; this may already work, but needs checking).

Suggestions for fixing it by Stefan:

- We could extend -B to remove all whitespace characters around tokens, not just blanks.

- We should probably change line #46 of cwb-encode.c to

\#define FIELDSEPS  "\t\n\r"

These solutions need evaluating and one or both implementing for v 3.5.

Discussion

  • Stephanie Evert

    Stephanie Evert - 2017-07-01

    Some old Mac software might also produce files with CR-only linebreaks, but these probably can't be fixed.

     
  • Stephanie Evert

    Stephanie Evert - 2017-07-01

    New suggestion: when reading lines in cwb-encode (as well as cwb-s-encode and cwb-align-encode), strip trailing CR as well as BOM at start of line (only if in utf8 mode).

    It would be nice to do this in a function cl_gets (which also cuts off after CL_MAX_LINE_LENGTH characters) so other file input becomes more robust, too. Would still require specification of charset or a flag that determines whether utf8 BOM may be removed at start of line.

     
  • Stephanie Evert

    Stephanie Evert - 2017-07-01

    For reference, the UTF-8 BOM is the byte sequence 0xEF 0xBB 0xBF.

     
  • Andrew Hardie

    Andrew Hardie - 2017-07-03

    cwb-s-encode now handles lines with a \r (as of 3.4.12); the other changes in this FR are pending.

     
  • Stephanie Evert

    Stephanie Evert - 2021-03-26

    I think all these issues have been fixed now (in cwb-s-encode, cwb-encode, and more recently word lists read into CQP), with the possible exception of BOM markers. cwb-encode and word lists are tested by CWB/Perl; other features may need validation / final check before the FR is closed.

     
  • Stephanie Evert

    Stephanie Evert - 2021-03-26
    • status: open --> pending
     

Log in to post a comment.