Re: [Docutils-develop] Recent commit activity

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Adam, dear Docutils developers,

On 2022-11-06, Adam Turner wrote:

>> r9208 ... r9211:
...
> I saw these as fairly unobtrusive changes that didn't touch the core
> ``io`` classes -- thank you for noting the incompatibilities with input
> encoding, though I think this reveals some deficiencies in our test
> suite -- I run all my changes against the test suite on Python 3.7,
> 3.8, 3.9, 3.10, and 3.11 before committing.

Yes, the test suite is incomplete: it does not test all fields of the
"syntax - settings - components" matrix.

This is why non-breaking tests are no guarantee for backwards compatibility
and one reason why a review of all commits is so important.

I added some more tests for input encoding handling in included files in
r9220.

> -----

>>> For the "include" and "csv-table" directives, the changes definitely change
>>> the behaviour for the "input_encoding" default value!
...
>> Checked r9208 ... r9211 and reverted the incompatible changes in r9217.

> I should like to revert the reversion, though I shall look to see what
> needs to be done to address the custom input encoding logic.

Handling of the character encoding should be kept synchronous for the
main source and included files.

As long as the custom classes from `docutils.io` are used for the main
document, there is no much sense in replicating their behaviour for
included files.

If using the "with" statement is viewed as an important bonus, adding a
context manager to the custom io classes could be considered
(after the test suite refactoring is finished).

> -----

>>>> The naming of the `core.publish_string()` API function
...

>> IMV, the name `publish_string` uses the generic notation of a "string"
>> as "sequence of alphanumeric text or other characters, either as a
>> literal constant or as some kind of variable",
>> cf. https://en.wikipedia.org/wiki/String_(computer_science).

> I see your point with respect to "string", and it is an overloaded
> term. My primary argument is to do the "most obvious" (or "least
> surprising") thing given the domain context -- I firmly believe that in
> the modern Python 3 ecosystem, when advertising a method with "string"
> in the name we should return strings as defined by Python, which are
> sequences of Unicode code points.

I see your point. OTOH:

* `output-encoding`__ is a *general* setting defined as
  "The text encoding for output". 
  This raises the expectation that all Docutils output "has" the
  specified encoding and makes the "most obvious" return value a bit less
  obvious.

  __  https://docutils.sourceforge.io/docs/user/config.html#output-encoding 

* The behaviour of `publish_string()` has been stable for many years.
  Long-time users are familiar with it and expect it to remain stable.

>>> Regarding the "core.publish_string()" function, I see three possibilities:

>> Alternatives forward:

>>> 1. Just keep as is.

>>>    This is core API behaviour, so we need a very strong indication
>>>    that the advantages outweigh the hassle.

>>> 2. Add a new boolean argument: "encode". 
...

>>> 3. Deprecate¹ "publish_string()" in favour of new, separate
>>>    "publish_unicode_str()" and "publish_bytes()" functions.

>>>    ¹PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0.

>>   4. Keep "publish_string()" as-is.
>>      New function "publish_str_instance()", say.

>>      The "output_encoding" setting will be used for encoding declarations
>>      and to determine required character replacements in the LaTeX writer
>>      but the return value is guaranteed to be a `str` instance.

>> My current favourites are 1 or 4.

> By (1) do you mean keep it as is as of [r9217], where returning bytes
> from "publish_string" is deprecated? 

> If so, I would vote this option.
> My rationale is as above, I think for the ordinary Python programmer,
> one would expect a "string" return type. Anecdotally, I found this
> behaviour confusing when I first used Docutils.

By 1), I mean keeping the behaviour in 0.18, without any changes (except
for a more clear documentation). This would be the simplest alternative
for experienced Docutils users.

If the documentation is clear about possible return values (and even more
after adding type hints) users should be able to live with the
unfortunate naming.

OTOH, I see a use-case for a convenience function returning a `str`
instance also in cases where an "intended encoding" of the output is
given in the "output-encoding" setting. This way, a program using this
function can export a HTML, XML or LaTeX with an encoding declaration
as `str` instance, post-process and finally encode it before handing it
to storage or a non-Python processor.

If we are going to change the core API functionality regarding the
convenience function(s) to publish the output as `str` or `bytes`
instance, then we should:

* do not start this in the middle of a major refactoring of the test suite
  (where it is hard to spot the changes in expected output from 
  "cosmetic" changes in the test code).

* do it in a "quasi static" manner: both, old and the new behaviour must
  be accessible over a sequence of two or more stable releases.

Variant 4) has the disadvantage that the "wart" of the
ambiguous/misleading name is not adressed.

Changing the behaviour of `publish_string()` breaks existing code and the
expectations of long-term users.

This brings us back to variant 3):

  Define 2 new functions with non-ambiguous names, e.g.,
  `publish_str_instance()` and `publish_bytes_instance()` that
  return the output as a `str` or `bytes` instance respectively.

  Document, that the "output-encoding" setting is used for 
  encoding declarations in HTML, XML, and LaTeX also with
  `publish_str_instance()`. It is up to the user to ensure the declared
  encoding and the actual encoding of output stored in a file or passed to
  a non-Python program match. 

  Support "output-encoding" value ``''`` (and for backwards compatibility
  'unicode') with `publish_str_instance()`: In this case there is no
  encoding declaration in XML, HTML, and LaTeX.

  Deprecate "publish_string()":
  PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0.

...

>> I don't think we need a new custom "BytesOutput" class when in the long run
>> we will use standard io classes.

> By this, do you mean "BytesIO" et al? I think using this would be
> sensible, introducing the new class is (as I see it) a bridging
> mechanism, as dropping in "BytesIO" does not fit the expected API at
> the moment.

I see. The custom `docutils.io` classes are deeply embedded in the
overall structure of Docutils components and their API. 
Any move to change/replace them touches Docutils Design Specification
(PEP 258) and should be well justified and well-considered, i.e. nothing
for the near future.

This means that if we want to introduce an explicit `publish_bytes()`
convenience function, a corresponding `BytesOutput` class is appropriate.
Implementation details would need further discussion.
Again, I'd prefer not to do this in the middle of a major refactoring of
the test suite.

>>>>>> r9178
...
> I will look to add back the quicktest method, I think it might be
> valuable to keep the Python method too as it shows how to achieve the
> same result only using the Python API.

Maybe as a footnote or side-box, so it is less distracting from the main
course of the discussion.

>> r9207
>> There is an established format in the HISTORY file:
...

> Thank you for making me aware of this. What should be done for changes
> that are cross-cutting, e.g. the test changes where the same change is
> being made to many test files, though perhaps not in the same
> directory?

The test suite refactoring could be listed under the "general" changes or
under "test/*" (or, if you prefer, "test/**").

> To explain my original approach, I am more used to an approach which
> enumerates history by patch or changeset, so for example describing a
> set of changes and the rationale / effect, rather than enumerating by
> filename.

I did not invent the HISTORY file format, but I learned to accept and
value it as complimentary to the more detailled and chronological SVN log.

>> On 2022-11-03, Guenter Milde via Docutils-develop wrote:

>> New discovered bug with ``output_encoding == "unicode"``.

>> r9202 contains a "quick and dirty fix" for a hithero unrecognized problem
>> with the pseudo-encoding name "unicode":

>> This non-standard name is used in encoding declarations in XML and HTML and
>> in a package call in LaTeX.

>> The problem is, that we don't know which encoding an application calling
>> `publish_string()` will finally use for storing/transferring the output
>> outside the realms of Python.
>> Given this, an encoding declaration with hard-coded "utf-8" is worse than no
>> encoding declaration.

> OK, your fix seems reasonable. I would note though that LaTeX has used
> UTF-8 as the default encoding since 2018, so no "inputenc" and a
> declaration of UTF-8 encoding have exactly the same effect.

>> r9202
>> ... bug with ``output_encoding == "unicode"``.

>> This non-standard name is used in encoding declarations in XML and HTML and
>> in a package call in LaTeX.

>> The problem is, that we don't know which encoding an application calling
>> `publish_string()` will finally use for storing/transferring the output
>> outside the realms of Python.
>> Given this, an encoding declaration with hard-coded "utf-8" is worse than no
>> encoding declaration.

> OK, your fix seems reasonable. 

Committed in r9219, together with the planned dropping of the "inputenc"
call with UTF-8 encoded output.

> I would note though that LaTeX has used UTF-8 as the default encoding
> since 2018, so no "inputenc" and a declaration of UTF-8 encoding have
> exactly the same effect.

The difference is, that the case of encoding the output to something
other than utf-8 is easier to handle, e.g., with 
``setting_overrides[latex-preamble] = '\usepackage[latin1]{inputenc}\n'``.
https://docutils.sourceforge.io/docs/user/config.html#latex-preamble 

I will follow up with a fix dropping the encoding declaration for HTML
and XML output if "output_encoding" is "unicode".

> -----

>> In the unit tests, I suggest decoding the `bytes` returned by
>> `publish_string()` over the "unicode" pseudo-encoding, e.g.

>> diff --git a/docutils/test/DocutilsTestSupport.py b/docutils/test/DocutilsTestSupport.py

> This is to reflect where "publish_string" will end up (returning Python
> strings), but I don't see a large difference between the two approaches
> so happy to switch if you would prefer.

As discusses above, I don't think it is a good idea to change the
behaviour of `publish_string()` (but eventually replace it by 2 new
functions in the long run).

Comitted in r9218.

> Your documentation patch looks OK.

Commited as part of r9221

Günter