Moving CWB input text files between Win and *nix can result in CRLF (0x0d, 0x0a) linebreaks being input: if this happens, the CR is encoded as part of the final p-attribute on each line. cwb-encode should be able to spot this and work round it (likewise, in the Win build, be able to cope with POSIX line-breaks; this may already work, but needs checking).
Suggestions for fixing it by Stefan:
- We could extend -B to remove all whitespace characters around tokens, not just blanks.
- We should probably change line #46 of cwb-encode.c to
\#define FIELDSEPS "\t\n\r"
These solutions need evaluating and one or both implementing for v 3.5.
Some old Mac software might also produce files with CR-only linebreaks, but these probably can't be fixed.
New suggestion: when reading lines in cwb-encode (as well as cwb-s-encode and cwb-align-encode), strip trailing CR as well as BOM at start of line (only if in utf8 mode).
It would be nice to do this in a function cl_gets (which also cuts off after CL_MAX_LINE_LENGTH characters) so other file input becomes more robust, too. Would still require specification of charset or a flag that determines whether utf8 BOM may be removed at start of line.
For reference, the UTF-8 BOM is the byte sequence 0xEF 0xBB 0xBF.
cwb-s-encode now handles lines with a \r (as of 3.4.12); the other changes in this FR are pending.
I think all these issues have been fixed now (in cwb-s-encode, cwb-encode, and more recently word lists read into CQP), with the possible exception of BOM markers. cwb-encode and word lists are tested by CWB/Perl; other features may need validation / final check before the FR is closed.