... and counts them wrong.
Worth trying to correct this in advance of major changes to the ****-output modules.
Nope, this _is_ the major overhaul of the CQP kwic formatting code that is so urgently needed. Currently, it also breaks some of the shell escapes for highlighting/colour, though I've tried hard to work around that.
Is there a bug tracker item for the ****-output overhaul? Should be set to high priority and merged with this one.
Nothing in the tracker, but we have this on the Unicode roadmap:
"* re-implement character context in kwic output (cat command), where the current implementation counts bytes instead of characters (and may thus break MBCs in addition to failing to align query matches)
* interactive pager (cat, count, etc.) should automatically be configured for UTF-8 or ISO-8859-X character set"
And later this, which is what I had in mind as the "major" overhaul:
"*Proper handling of fixed-character context in kwic output (cat) will require a major rewrite
- affects all kwic-formatting code in cqp/output.c, cqp/print-modes.c, ascii-print.c, html-print.c, latex-print.c, sgml-print.c, etc.
- this code is inefficient and seriously broken anyway (buffer overflow + segfault for large context sizes), so it should be re-implemented from scratch
- recommendation: drop HTML, Latex and SGML modes; just offer ASCII for interactive use and XML as a general-purpose format (which can easily be transformed to other formats using XSLT, Perl, etc.)"
My instinct was that we could perhaps fix character-splitting before digging into things like buffer overflow and getting rid of latex, html etc!
Fixing this is not entirely straightforward. Kwic formatting works as follows:
The main problem is that the implementation counts bytes rather than characters.
There seem to be two possible approaches to patch the code temporarily until we tackle a full re-implementation of kwic formatting:
Change functions to count characters instead of bytes (which involves making all relevant functions aware of the current charset), taking special care with the final partial token.
Possibly collect more context than needed, then truncate to the required number of characters. For this to work, srev() has to reverse by character rather than byte and should keep terminal escapes intact. The truncation would then count characters, not including terminal escapes.
In either case, it is probably safe to assume that character context is unreliable for non-ASCII print modes.
The breaking issue should now be fixed.
The counting issue is postponed for 3.9/4.0.
Log in to post a comment.