From: Adam T. <aat...@ou...> - 2022-11-06 00:35:07
|
Dear Günter, I am replying to all three notes in the below, apologies for the rather long note that follows. ----- >>> (In the long run, I'd like to replace the "homegrown" docutils.io classes >>> with standard objects, but this is a separate topic.) > r9208 ... r9211: > "In the long run" is not next day! Please ask before such changes. I saw these as fairly unobtrusive changes that didn't touch the core ``io`` classes -- thank you for noting the incompatibilities with input encoding, though I think this reveals some deficiencies in our test suite -- I run all my changes against the test suite on Python 3.7, 3.8, 3.9, 3.10, and 3.11 before committing. ----- >> Replacing the "FileInput/StringInput/FileOutput" classes can currently >> only be done in cases, where we are sure not to introduce backwards >> incompatibilities. >> For the "include" and "csv-table" directives, the changes definitely change >> the behaviour for the "input_encoding" default value! >> For the other changes, I am not sure. > Checked r9208 ... r9211 and reverted the incompatible changes in r9217. I should like to revert the reversion, though I shall look to see what needs to be done to address the custom input encoding logic. ----- >>> The naming of the `core.publish_string()` API function pre-dates Python3 >>> and uses the term "string" as a superordinate for "Unicode strings" >>> (`str`/`unicode`) and "encoded strings" (`bytes`/`str`). >>> The documentation is explicit about the return value >>> (defaulting to 'utf-8' encoded `bytes` instances). >>> The function is part of the "first level API": intended for use by >>> 3rd-party code. Changing the return value has the potential of many >>> breaks. >>> Please revert the changes to core.py and io.py. >> I have reverted the change (thank you) but introduced a deprecation for >> returning non-(unicode) strings from "publish_string". I feel this is a >> reasonable step towards having a more intuitive API. > IMV, the name `publish_string` uses the generic notation of a "string" > as "sequence of alphanumeric text or other characters, either as a > literal constant or as some kind of variable", > cf. https://en.wikipedia.org/wiki/String_(computer_science). I see your point with respect to "string", and it is an overloaded term. My primary argument is to do the "most obvious" (or "least surprising") thing given the domain context -- I firmly believe that in the modern Python 3 ecosystem, when advertising a method with "string" in the name we should return strings as defined by Python, which are sequences of Unicode code points. ----- >> Naming of "publish_bytes" and "BytesOutput" I am not that set on, we >> can change to "BinaryXXX" if preferred. > I don't think we need a new custom "BytesOutput" class when in the long run > we will use standard io classes. By this, do you mean "BytesIO" et al? I think using this would be sensible, introducing the new class is (as I see it) a bridging mechanism, as dropping in "BytesIO" does not fit the expected API at the moment. ----- >> Regarding the "core.publish_string()" function, I see three possibilities: > Alternatives forward: >> 1. Just keep as is. >> This is core API behaviour, so we need a very strong indication >> that the advantages outweigh the hassle. >> 2. Add a new boolean argument: "encode". >> With ``publish_string(encode=False)``, the "output_encoding" setting >> would be ignored and the return value is a `bytes` object. >> With ``publish_string(encode=True)``, the behaviour is as currently, >> encoding with "output_encoding" or returning `bytes` or `str` object >> (the latter for ``output_encoding == "unicode"``). >> The default value could change from True to False in Docutils 2.0. >> 3. Deprecate¹ "publish_string()" in favour of new, separate >> "publish_unicode_str()" and "publish_bytes()" functions. >> ¹PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0. > 4. Keep "publish_string()" as-is. > New function "publish_str_instance()", say. > The "output_encoding" setting will be used for encoding declarations > and to determine required character replacements in the LaTeX writer > but the return value is guaranteed to be a `str` instance. > > > My current favourites are 1 or 4. By (1) do you mean keep it as is as of [r9217], where returning bytes from "publish_string" is deprecated? If so, I would vote this option. My rationale is as above, I think for the ordinary Python programmer, one would expect a "string" return type. Anecdotally, I found this behaviour confusing when I first used Docutils. ----- >>>>> r9178 >>> 8 lines of doctest code seems overly verbose in the context of an >>> "overview of the Docutils architecture". >> I tend to agree, "core.publish_string" would be nicer, but there is no >> way in Docutils to disable applying the default transforms. >>> It can still be useful for development: testing the reST parser (including >>> the option to generate input/output pairs for parser test scripts). >>> I'd move it to tools/dev and remove the EasyDialogs part. >> Done in [r9201] > So now the "hacking" guide can again refer to "quicktest.py". > BTW: the fact that it "has not been substantively modified since 2006" is > no argument for the move (the `cat` shell command, say, has not been > substantially modified for ages, too but moving it to a developers only > path would be bad.) > Rather, "quicktest.py" has never been installed into the "PATH" > in setup.py and is a tool for developers rather than end-users. I will look to add back the quicktest method, I think it might be valuable to keep the Python method too as it shows how to achieve the same result only using the Python API. ----- > r9207 > There is an established format in the HISTORY file: > General changes, followed by changes listed in alphabetical order of > affected file (or directory for related changes in one directory) > Name patch authors only for "external" patches (where the name differs > from the committer in the SVN log). Thank you for making me aware of this. What should be done for changes that are cross-cutting, e.g. the test changes where the same change is being made to many test files, though perhaps not in the same directory? To explain my original approach, I am more used to an approach which enumerates history by patch or changeset, so for example describing a set of changes and the rationale / effect, rather than enumerating by filename. ----- > On 2022-11-03, Guenter Milde via Docutils-develop wrote: > New discovered bug with ``output_encoding == "unicode"``. > r9202 contains a "quick and dirty fix" for a hithero unrecognized problem > with the pseudo-encoding name "unicode": > This non-standard name is used in encoding declarations in XML and HTML and > in a package call in LaTeX. > The problem is, that we don't know which encoding an application calling > `publish_string()` will finally use for storing/transferring the output > outside the realms of Python. > Given this, an encoding declaration with hard-coded "utf-8" is worse than no > encoding declaration. OK, your fix seems reasonable. I would note though that LaTeX has used UTF-8 as the default encoding since 2018, so no "inputenc" and a declaration of UTF-8 encoding have exactly the same effect. ----- > In the unit tests, I suggest decoding the `bytes` returned by > `publish_string()` over the "unicode" pseudo-encoding, e.g. > diff --git a/docutils/test/DocutilsTestSupport.py b/docutils/test/DocutilsTestSupport.py > [snip] This is to reflect where "publish_string" will end up (returning Python strings), but I don't see a large difference between the two approaches so happy to switch if you would prefer. ----- > As an aside: > The `docutils.io.StringInput` class can transparently handle input from > `str` and `bytes` instances. > The following patch makes this more explicit: For similar reasons, I would prefer to make this more explicit and separate ``StringInput`` from a hypothetical ``BytesInput``. This can be discussed later though, when (if) we move away from all of the ``io`` classes. Your documentation patch looks OK. ----- > The only effect of setting ``input_encoding = 'unicode'`` is a warning if > the input object is not a `str` instance. Given that 'unicode' is a > non-standard encoding name that may be deprecated later, it is, IMV, not > worth recommending this in the docstring to `publish_string`: > diff --git a/docutils/docutils/core.py b/docutils/docutils/core.py > [snip] I agree with this. ----- Thanks, Adam |