Re: [Docutils-develop] rST and character width (was: [docutils:bugs] #305)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Tue, Jan 10, 2017 at 3:59 AM, Guenter Milde <mi...@us...> wrote:
> On 2017-01-09, Edward d'Auvergne wrote:
>> On 5 January 2017 at 11:45, Guenter Milde <mi...@us...> wrote:
>
> ...
>
>>> * replace the current handling of combining characters with a version
>>>   counting for all zero-width characters.
>
>>> * clarify in the specs, that "line length" or similar in definitions like
>
>>>     An underline/overline is a single repeated punctuation character that
>>>     begins in column 1 and forms a line extending at least as far as the
>>>     right edge of the title text.
>
>>>   are valid for monospace characters of unit width with some listed
>>>   exceptions.
>
>
>> I was wondering if you have heard about the wcwidth() and wcswidth()
>> implementations [1, 2]?
>
> Thank you for the pointer.
>
>> If this fast bisect algorithm is of interest,
>> the Python wcwidth package might need to be downgraded to the 10+ year
>> old 5.x Unicode standard used in Python 2.
>
> There are several issues when using the wcwidth module:
>
> +1 don't reinvent the wheel:
>    maintained implementation of a column-width determination function
>
> +1 stability: character tables are part of the module, do not depend on
>    Python version.
>
>    The current implementation of wide-char correction depends on
>    unicodedata from the installed Python version.
>
> -2 external dependency
>
>    -1 updating this module may break rST documents
>
>
> In addition, also the external module cannot solve the ambiguity:
>
> Example::
>
>   from wcwidth import wcswidth
>   text = u'wait ⌚ or ⌛'
>   print text
>   print 'x'*len(text)
>   print 'x'*wcswidth(text)
>
>
> For wcswidth, WATCH and HOURGLASS are 2 columns wide.
> In my text editor, WATCH and HOURGLASS are single-width characters (which
> also makes most sense to me).
> On some terminals, both characters are followed by space to make them double
> width. In `geany`, the text panel uses single width and the terminal panel
> double width.
>
> The problem is generic:
>
>   No established formal standards exist at present on which Unicode
>   character shall occupy how many cell positions on character terminals.
>   -- Markus Kuhn http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
>
> IMO, Docutils should account for the display in "common" text editors using
> monospaced fonts. Speed is no primary issue.
> Maybe using a local implementation is best.
>
>
> The documentation must make clear the remaining ambiguity and point to
> fail-safe text source:
>
>  * additional underline characters in section headings and simple tables
>
>  * avoid "critical" characters in grid tables (use substitutions if required).
>
>> Where is the width
>> algorithm implemented in docutils?
>
> docutils/docutils/statemachine.py:1450:  def pad_double_width(self, pad_char):
>
> Uses `unicodedata.east_asian_width`.
>
> @David:
>
> How about using a wcswidth()-like implementation instead of len() when
> determining text length for section headings and tables instead of the
> padding with `double_width_pad_char`?

Sure, sounds fine to me.

> +1 works also for zero-width characters and combining characters
>    (solves https://sourceforge.net/p/docutils/bugs/128/)
>
> -1 API change

What exactly would the API change be?

David Goodger
<http://python.net/~goodger>