Re: [Docutils-develop] Recent commit activity

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Günter,

I am replying to all three notes in the below, apologies for the rather long note that follows.

-----

>>> (In the long run, I'd like to replace the "homegrown" docutils.io classes
>>> with standard objects, but this is a separate topic.)

> r9208 ... r9211:
> "In the long run" is not next day! Please ask before such changes.

I saw these as fairly unobtrusive changes that didn't touch the core ``io`` classes -- thank you for noting the incompatibilities with input encoding, though I think this reveals some deficiencies in our test suite -- I run all my changes against the test suite on Python 3.7, 3.8, 3.9, 3.10, and 3.11 before committing.

-----

>> Replacing the "FileInput/StringInput/FileOutput" classes can currently
>> only be done in cases, where we are sure not to introduce backwards
>> incompatibilities.

>> For the "include" and "csv-table" directives, the changes definitely change
>> the behaviour for the "input_encoding" default value!

>> For the other changes, I am not sure.

> Checked r9208 ... r9211 and reverted the incompatible changes in r9217.

I should like to revert the reversion, though I shall look to see what needs to be done to address the custom input encoding logic.

-----

>>> The naming of the `core.publish_string()` API function pre-dates Python3
>>> and uses the term "string" as a superordinate for "Unicode strings"
>>> (`str`/`unicode`) and "encoded strings" (`bytes`/`str`).
>>> The documentation is explicit about the return value
>>> (defaulting to 'utf-8' encoded `bytes` instances).

>>> The function is part of the "first level API": intended for use by
>>> 3rd-party code. Changing the return value has the potential of many
>>> breaks. 
>>> Please revert the changes to core.py and io.py.

>> I have reverted the change (thank you) but introduced a deprecation for
>> returning non-(unicode) strings from "publish_string". I feel this is a
>> reasonable step towards having a more intuitive API.

> IMV, the name `publish_string` uses the generic notation of a "string"
> as "sequence of alphanumeric text or other characters, either as a
> literal constant or as some kind of variable",
> cf. https://en.wikipedia.org/wiki/String_(computer_science).

I see your point with respect to "string", and it is an overloaded term. My primary argument is to do the "most obvious" (or "least surprising") thing given the domain context -- I firmly believe that in the modern Python 3 ecosystem, when advertising a method with "string" in the name we should return strings as defined by Python, which are sequences of Unicode code points.

-----

>> Naming of "publish_bytes" and "BytesOutput" I am not that set on, we
>> can change to "BinaryXXX" if preferred.

> I don't think we need a new custom "BytesOutput" class when in the long run
> we will use standard io classes.

By this, do you mean "BytesIO" et al? I think using this would be sensible, introducing the new class is (as I see it) a bridging mechanism, as dropping in "BytesIO" does not fit the expected API at the moment.

-----

>> Regarding the "core.publish_string()" function, I see three possibilities:

> Alternatives forward:

>> 1. Just keep as is.

>>    This is core API behaviour, so we need a very strong indication
>>    that the advantages outweigh the hassle.

>> 2. Add a new boolean argument: "encode". 

>>    With ``publish_string(encode=False)``, the "output_encoding" setting
>>    would be ignored and the return value is a `bytes` object.

>>    With ``publish_string(encode=True)``, the behaviour is as currently,
>>    encoding with "output_encoding" or returning `bytes` or `str` object
>>    (the latter for ``output_encoding == "unicode"``).

>>    The default value could change from True to False in Docutils 2.0.

>> 3. Deprecate¹ "publish_string()" in favour of new, separate
>>    "publish_unicode_str()" and "publish_bytes()" functions.

>>    ¹PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0.

>   4. Keep "publish_string()" as-is.
>      New function "publish_str_instance()", say.

>      The "output_encoding" setting will be used for encoding declarations
>      and to determine required character replacements in the LaTeX writer
>      but the return value is guaranteed to be a `str` instance.
>     
>     
> My current favourites are 1 or 4.

By (1) do you mean keep it as is as of [r9217], where returning bytes from "publish_string" is deprecated? If so, I would vote this option. My rationale is as above, I think for the ordinary Python programmer, one would expect a "string" return type. Anecdotally, I found this behaviour confusing when I first used Docutils.

-----

>>>>> r9178
>>> 8 lines of doctest code seems overly verbose in the context of an
>>> "overview of the Docutils architecture".

>> I tend to agree, "core.publish_string" would be nicer, but there is no
>> way in Docutils to disable applying the default transforms.

>>> It can still be useful for development: testing the reST parser (including
>>> the option to generate input/output pairs for parser test scripts).
>>> I'd move it to tools/dev and remove the EasyDialogs part.

>> Done in [r9201]

> So now the "hacking" guide can again refer to "quicktest.py".

> BTW: the fact that it "has not been substantively modified since 2006" is
>      no argument for the move (the `cat` shell command, say, has not been
>      substantially modified for ages, too but moving it to a developers only
>      path would be bad.)

>      Rather, "quicktest.py" has never been installed into the "PATH"
>      in setup.py and is a tool for developers rather than end-users.

I will look to add back the quicktest method, I think it might be valuable to keep the Python method too as it shows how to achieve the same result only using the Python API.

-----

> r9207

> There is an established format in the HISTORY file:

> General changes, followed by changes listed in alphabetical order of
> affected file (or directory for related changes in one directory)

> Name patch authors only for "external" patches (where the name differs
> from the committer in the SVN log).

Thank you for making me aware of this. What should be done for changes that are cross-cutting, e.g. the test changes where the same change is being made to many test files, though perhaps not in the same directory?

To explain my original approach, I am more used to an approach which enumerates history by patch or changeset, so for example describing a set of changes and the rationale / effect, rather than enumerating by filename.

-----

> On 2022-11-03, Guenter Milde via Docutils-develop wrote:

> New discovered bug with ``output_encoding == "unicode"``.

> r9202 contains a "quick and dirty fix" for a hithero unrecognized problem
> with the pseudo-encoding name "unicode":

> This non-standard name is used in encoding declarations in XML and HTML and
> in a package call in LaTeX.

> The problem is, that we don't know which encoding an application calling
> `publish_string()` will finally use for storing/transferring the output
> outside the realms of Python.
> Given this, an encoding declaration with hard-coded "utf-8" is worse than no
> encoding declaration.

OK, your fix seems reasonable. I would note though that LaTeX has used UTF-8 as the default encoding since 2018, so no "inputenc" and a declaration of UTF-8 encoding have exactly the same effect.

-----

> In the unit tests, I suggest decoding the `bytes` returned by
> `publish_string()` over the "unicode" pseudo-encoding, e.g.

> diff --git a/docutils/test/DocutilsTestSupport.py b/docutils/test/DocutilsTestSupport.py
> [snip]

This is to reflect where "publish_string" will end up (returning Python strings), but I don't see a large difference between the two approaches so happy to switch if you would prefer.

-----

> As an aside:

> The `docutils.io.StringInput` class can transparently handle input from
> `str` and `bytes` instances.
> The following patch makes this more explicit:

For similar reasons, I would prefer to make this more explicit and separate ``StringInput`` from a hypothetical ``BytesInput``. This can be discussed later though, when (if) we move away from all of the ``io`` classes.

Your documentation patch looks OK.

-----

> The only effect of setting ``input_encoding = 'unicode'`` is a warning if
> the input object is not a `str` instance. Given that 'unicode' is a
> non-standard encoding name that may be deprecated later, it is, IMV, not
> worth recommending this in the docstring to `publish_string`:

> diff --git a/docutils/docutils/core.py b/docutils/docutils/core.py
> [snip]

I agree with this.

-----

Thanks,
Adam