LXR Cross Referencer / Bugs / #254 Source w/VCS: annotation truncated on byte length instead of characters

#254 Source w/VCS: annotation truncated on byte length instead of characters

Milestone: v2.0

Status: closed-fixed

Owner: Andre-Littoz

Labels: None

Priority: 5

Updated: 2016-01-21

Created: 2014-07-01

Creator: Andre-Littoz

Private: No

When annotations (revision id and author name) are returned in UTF-8, like Git does, the width should be carefully measured in characters, not in bytes.

In UTF-8, a single character could be encoded using originally up to 6 bytes. The fact that an RFC restricted the encoding to a maximum of 4 bytes does not change the following problem.

The width allotted to annotations by script source is rather narrow to leave maximum screen space to the source line. This means the annotations must be truncated to fit in their column. The present algorithm, reflecting the limited features of older VCS like CVS, simply count bytes and chop.

This is incorrect with UTF-8 because characters are potentially composed of several bytes and blindly chopping might occur in the middle of a byte sequence for a character, resulting in an invalid UTF-8 run.

A replacement is needed for function length() taking into account the value of parameter 'encoding'. Unicode-related pragma are not deemed appropriate since it would create a dependency on Perl 5.12 and output stream is not always Unicode.

Discussion

Andre-Littoz - 2014-10-26

An experimental fix is implemented in 2.0.3 for author's name. When in UTF-8 encoding, a lexical scope is opened to switch to UTF-8 with "use utf8;" before calling length(). Revision id is not processed because it is usually a numeric string with eventually ASCII punctuation (dot-separated numbers, SHA id, ...). Subversion may allow more arbitrary ids and cause UTF-8 sanity problems.

Other uses of length() seem to be safe since they only test for characters or strings made of ASCII characters, except in diff. diff truncates source lines so that they fit in the screen panes. It deserves a specific processing but this probably requires a redesign of sub htmljust which splits the source line into "tokens" (HTML element, HTML entity reference and plain text). Plain text tokens should be measured and truncated in an UTF-8 compatible way.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andre-Littoz - 2014-12-04

The experimental fix is not satisfactory because of my misunderstanding of pragma use utf-8 semantics. As implemented, sequence are truncated on codepoint boundary which does not take into account the effective width of codepoints. Some of them are modifier diacritics which do not advance the "screen cursor". Consequently, truncation is too severe causing misalignment of the current line because other non-zero width characters could be kept.

A better implementation would use \X in a pattern to match a grapheme.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andre-Littoz - 2014-12-04

status: pending --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andre-Littoz - 2015-12-31

New fix based on pack/unpack for author name, but not forwarded to revisions (therefore, byte truncation -- instead of character truncation -- may cause misalignment in svn or Hg where revision id are arbitrary Unicode strings).
Patched in Git repo.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andre-Littoz - 2016-01-21

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Source w/VCS: annotation truncated on byte length instead of characters

Group

Searches

Help

#254 Source w/VCS: annotation truncated on byte length instead of characters

Discussion