IMS Open Corpus Workbench / Feature Requests / #47 Make cwb-encode handle non-POSIX (win32) linebreaks

#47 Make cwb-encode handle non-POSIX (win32) linebreaks

Milestone: TODO-3.5

Status: pending

Owner: Andrew Hardie

Labels: CWB engine (4)

Priority: 5

Updated: 2021-03-26

Created: 2012-11-08

Creator: Andrew Hardie

Private: No

Moving CWB input text files between Win and *nix can result in CRLF (0x0d, 0x0a) linebreaks being input: if this happens, the CR is encoded as part of the final p-attribute on each line. cwb-encode should be able to spot this and work round it (likewise, in the Win build, be able to cope with POSIX line-breaks; this may already work, but needs checking).

Suggestions for fixing it by Stefan:

- We could extend -B to remove all whitespace characters around tokens, not just blanks.

- We should probably change line #46 of cwb-encode.c to

\#define FIELDSEPS  "\t\n\r"

These solutions need evaluating and one or both implementing for v 3.5.

Discussion

Stephanie Evert - 2017-07-01

Some old Mac software might also produce files with CR-only linebreaks, but these probably can't be fixed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stephanie Evert - 2017-07-01

New suggestion: when reading lines in cwb-encode (as well as cwb-s-encode and cwb-align-encode), strip trailing CR as well as BOM at start of line (only if in utf8 mode).

It would be nice to do this in a function cl_gets (which also cuts off after CL_MAX_LINE_LENGTH characters) so other file input becomes more robust, too. Would still require specification of charset or a flag that determines whether utf8 BOM may be removed at start of line.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stephanie Evert - 2017-07-01

For reference, the UTF-8 BOM is the byte sequence 0xEF 0xBB 0xBF.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andrew Hardie - 2017-07-03

cwb-s-encode now handles lines with a \r (as of 3.4.12); the other changes in this FR are pending.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stephanie Evert - 2021-03-26

I think all these issues have been fixed now (in cwb-s-encode, cwb-encode, and more recently word lists read into CQP), with the possible exception of BOM markers. cwb-encode and word lists are tested by CWB/Perl; other features may need validation / final check before the FR is closed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stephanie Evert - 2021-03-26

status: open --> pending
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Make cwb-encode handle non-POSIX (win32) linebreaks

Indexing and query tools for very large text corpora

Group

Searches

Help

#47 Make cwb-encode handle non-POSIX (win32) linebreaks

Discussion