#37 cqp: concordance output breaks utf8 characters

TODO-4.0
open
9
2014-06-15
2010-08-16
No

... and counts them wrong.

Worth trying to correct this in advance of major changes to the ****-output modules.

Discussion

  • Stefan Evert

    Stefan Evert - 2010-08-17

    Nope, this _is_ the major overhaul of the CQP kwic formatting code that is so urgently needed. Currently, it also breaks some of the shell escapes for highlighting/colour, though I've tried hard to work around that.

    Is there a bug tracker item for the ****-output overhaul? Should be set to high priority and merged with this one.

     
  • Andrew Hardie

    Andrew Hardie - 2010-08-17

    Nothing in the tracker, but we have this on the Unicode roadmap:

    "* re-implement character context in kwic output (cat command), where the current implementation counts bytes instead of characters (and may thus break MBCs in addition to failing to align query matches)
    ...
    * interactive pager (cat, count, etc.) should automatically be configured for UTF-8 or ISO-8859-X character set"

    And later this, which is what I had in mind as the "major" overhaul:

    "*Proper handling of fixed-character context in kwic output (cat) will require a major rewrite
    - affects all kwic-formatting code in cqp/output.c, cqp/print-modes.c, ascii-print.c, html-print.c, latex-print.c, sgml-print.c, etc.
    - this code is inefficient and seriously broken anyway (buffer overflow + segfault for large context sizes), so it should be re-implemented from scratch
    - recommendation: drop HTML, Latex and SGML modes; just offer ASCII for interactive use and XML as a general-purpose format (which can easily be transformed to other formats using XSLT, Perl, etc.)"

    My instinct was that we could perhaps fix character-splitting before digging into things like buffer overflow and getting rid of latex, html etc!

     
  • Andrew Hardie

    Andrew Hardie - 2011-07-31
    • priority: 8 --> 9
    • milestone: --> TODO-3.5
     
  • Stefan Evert

    Stefan Evert - 2013-05-10

    Fixing this is not entirely straightforward. Kwic formatting works as follows:

    • format left context as reverse string in buffer, moving from right to left (starting at match-1) and reversing every token after it has been inserted
    • keep track of how many printing characters have been inserted (should ignore e.g. terminal highlighting, but this doesn't work accurately)
    • stop after the requested number of characters have been printed, inserting only the last <n> characters of the last partial token
    • reverse entire string, which also repairs byte sequences (UTF-8, terminal escapes) that were broken by reversing individual tokens
    • print buffer

    The main problem is that the implementation counts bytes rather than characters.

    There seem to be two possible approaches to patch the code temporarily until we tackle a full re-implementation of kwic formatting:

    1. Change functions to count characters instead of bytes (which involves making all relevant functions aware of the current charset), taking special care with the final partial token.

    2. Possibly collect more context than needed, then truncate to the required number of characters. For this to work, srev() has to reverse by character rather than byte and should keep terminal escapes intact. The truncation would then count characters, not including terminal escapes.

    In either case, it is probably safe to assume that character context is unreliable for non-ASCII print modes.

     
  • Andrew Hardie

    Andrew Hardie - 2014-06-15
    • Group: TODO-3.5 --> TODO-4.0
     
  • Andrew Hardie

    Andrew Hardie - 2014-06-15

    The breaking issue should now be fixed.

    The counting issue is postponed for 3.9/4.0.

     

Log in to post a comment.