From: Guenter M. <mi...@us...> - 2022-10-31 23:35:24
|
Dear Docutils developers, dear Adam, after a long period of calm, I realize a flurry of commits over the last couple of days -- far more than I can reasonably review in time. It is nice to see development activity but also alarming to see it at this pace, as even a short looking over revealed some pitfalls. For other changes I would have preferred a discussion or at least some rationale in the commit messages (not just a one-liner telling what is done). Changes to the test suite should not only keep the tests running but also ensure intelligable feedback in case of test failures. Did you test how the changes affect error reports? r9135 tex2unichar.py is a generated file, changes will be lost with an update from the database. (I updated the generating script and prepared a fixup patch.) r9140, r9141: Import ``DocutilsTestSupport`` from ``test`` This commit breaks the ability to run individual test scripts from their respective directories. This is (was) an important help when working on the test scripts. r9145 Since when is _format_str() unused? Does this affect reporting of test failures? r9146 DocutilsTestSupport must be imported before docutils so that imports from docutils use the test-dirs parent directory (in case we want to test a not installed Docutils version) r9163 What happend to sys.stdout.flush() ? r9165 compare_output() no longer prints the corresponding input in case of assertion errors which makes finding problems in parser test suites harder. r9167 IMV, an encoded string ("bytes" object) is no "binary" output. I need some more time to understand the changes here. A StringOutput.encode() method that *decodes* bytes data seems odd: @@ -581,6 +600,11 @@ class StringOutput(Output): + def encode(self, data): + if isinstance(data, bytes): + return data.decode(self.encoding, self.error_handler) + return str(data) + r9168 Use ``.rstrip()`` instead This fails with multi-lined output if line endings in output and extended differ. r9178 In case we don't want to ship "quicktest.py" with the Docutils distribution, we can still keep it in the Docutils repo and simply change: -can be found in the ``tools/`` directory of the Docutils distribution) +can be found in the ``tools/`` directory of the Docutils repository) r9179, r9180 there should be a way to avoid this code repetition ... Günter |
From: Adam T. <aat...@ou...> - 2022-11-01 01:10:42
|
Dear Günter, Thank you for the note. I shall aim to add more rationale in commit messages in future. Thank you also for the comments, which I've briefly addressed below. The overall aim with the test changes is to be able to use the standard library's ``unittest`` runner, which will alleviate some of the points you raise. > r9135 > tex2unichar.py is a generated file, changes will be lost with an > update from the database. > (I updated the generating script and prepared a fixup patch.) I saw the note that it is generated, but I could not find the generating script in the repository (even in ``sandbox/``) -- please may you let me know where it is? > r9140, r9141: Import ``DocutilsTestSupport`` from ``test`` > > This commit breaks the ability to run individual test scripts from > their respective directories. This is (was) an important help when > working on the test scripts. The unittest runner resolves this, though we are not quite at that point. We could add an option to ``alltests.py`` to run only tests in a specific directory? > r9145 > Since when is _format_str() unused? > Does this affect reporting of test failures? r9143 removed the redefinitions of ``assertEqual``, ``assertNotEqual``, etc, which were the only uses of _format_str() This allows using ``TestCase``'s specialised equality methods by type. Differences in strings with newlines (the reported purpose of _format_str()) are handled well by TestCase > r9146 > DocutilsTestSupport must be imported before docutils > so that imports from docutils use the test-dirs parent directory > (in case we want to test a not installed Docutils version) If alltests.py is used to run tests, this is still the case. We could add an import of DTS to those modules again, though my aim eventually is to remove DTS. Is testing an uninstalled Docutils a usecase that is widely relied upon? > r9163 > What happend to sys.stdout.flush() ? ``.flush()`` is still called before running tests, I didn't see a reason to call it before the environment information print-outs, though if it is needed we should add it back in. > r9165 > compare_output() no longer prints the corresponding input in case of > assertion errors which makes finding problems in parser test suites > harder. My aim for parser test suites is to improve the test harnesses further, so this is not a permanent removal. > r9167 > IMV, an encoded string ("bytes" object) is no "binary" output. > > I need some more time to understand the changes here. > > A StringOutput.encode() method that *decodes* bytes data seems > odd: > > @@ -581,6 +600,11 @@ class StringOutput(Output): > > + def encode(self, data): > + if isinstance(data, bytes): > + return data.decode(self.encoding, self.error_handler) > + return str(data) > + Python 3 bytes objects are sequences of 8-bit bytes, so I would say that it is binary output, but perhaps we could improve here. I think the naming of ``.encode()`` is unfortunate in this context, but StringOutput should return strings (text content) and BytesOutput should return bytes (binary content). > r9168 Use ``.rstrip()`` instead > > This fails with multi-lined output if line endings in output and extended > differ. For tests, I believe we should ensure that we have as little 'magic' as possible -- what we say we expect should be bit-for-bit identical to what we generate. In this sense I fixed the newline errors to ensure correctness. > r9178 > In case we don't want to ship "quicktest.py" with the Docutils > distribution, we can still keep it in the Docutils repo and simply change: > -can be found in the ``tools/`` directory of the Docutils distribution) > +can be found in the ``tools/`` directory of the Docutils repository) I was actually going to suggest deleting "quicktest.py" entirely--finding the version can now be done with "docutils --version" and the last substantive change in functionality was 2006. "EasyDialogs" was removed in Python 2.7. > r9179, r9180 > there should be a way to avoid this code repetition In general for tests I am more open to repetition as it is useful to be explicit about what is being tested, though I intend to refactor the functional tests, which hopefully will help achieve this goal. Thanks, Adam |
From: Guenter M. <mi...@us...> - 2022-11-01 17:00:23
|
Dear Adam, thank you for the fast reply. On 2022-11-01, Adam Turner wrote: ... > The overall aim with the test changes is to be able to use the standard > library's ``unittest`` runner, which will alleviate some of the points > you raise. I understand and support the overall aim to use standard tools for the tests. I would have preferred a "feature branch" (as mandated by the "policies") for this change set, though. (I agree that with git this would have been easier and have to apologise for the delay of my respective reviews.) >> r9135 >> tex2unichar.py is a generated file, changes will be lost with an >> update from the database. > >> (I updated the generating script and prepared a fixup patch.) > I saw the note that it is generated, but I could not find the > generating script in the repository (even in ``sandbox/``) -- please > may you let me know where it is? It is at the place indicated in the script, http://milde.users.sourceforge.net/LUCR/Math/ in the "tools" subdir. The latest version is not published yet, though, and I plan moving the project to codeberg.org. Generally, if a file is created or maintained by an active member of the Docutils developer team, I'd prefer an "ask before change" (except for emergencies and small typos). >> r9140, r9141: Import ``DocutilsTestSupport`` from ``test`` >> This commit breaks the ability to run individual test scripts from >> their respective directories. This is (was) an important help when >> working on the test scripts. > The unittest runner resolves this, though we are not quite at that > point. We could add an option to ``alltests.py`` to run only tests in a > specific directory? When editing a test script, I regularly run it from my editor (it is just one click). This is a feature I don't want to miss, the actual implementation is less important (but an option to alltests.py would not help, it should work from the scrip's "shebang line"). Is there a way to call the "unittest runner" in the shebang line of the individual scripts? ... >> r9146 >> DocutilsTestSupport must be imported before docutils >> so that imports from docutils use the test-dirs parent directory >> (in case we want to test a not installed Docutils version) > If alltests.py is used to run tests, this is still the case. > We could add an import of DTS to those modules again, though my aim > eventually is to remove DTS. Is testing an uninstalled Docutils a > usecase that is widely relied upon? At least for me, it is a "natural" thing to expect (as I got used to it over the last decade). ... >> r9167 Introduces a backwards incompatible API change! (see below) >> IMV, an encoded string ("bytes" object) is no "binary" output. ... > Python 3 bytes objects are sequences of 8-bit bytes, so I would say > that it is binary output, but perhaps we could improve here. The Python 3.10 documentation for `bytes` calls it a "binary sequence type", so I may have to adapt. The naming of the `core.publish_string()` API function pre-dates Python3 and uses the term "string" as a superordinate for "Unicode strings" (`str`/`unicode`) and "encoded strings" (`bytes`/`str`). The documentation is explicit about the return value (defaulting to 'utf-8' encoded `bytes` instances). The function is part of the "first level API": intended for use by 3rd-party code. Changing the return value has the potential of many breaks. Please revert the changes to core.py and io.py. Unicode `str` output can be achieved passing ``settings_overwrites={'output_encoding': 'unicode'}`` to publish_string(). (In the long run, I'd like to replace the "homegrown" docutils.io classes with standard objects, but this is a separate topic.) >> r9168 Use ``.rstrip()`` instead >> This fails with multi-lined output if line endings in output and extended >> differ. > For tests, I believe we should ensure that we have as little 'magic' as > possible -- what we say we expect should be bit-for-bit identical to > what we generate. In this sense I fixed the newline errors to ensure > correctness. This depends on the context: over-specified tests contribute to noise due to false positives. Test results should not depend on the "native" line endings of the system running the tests. >> r9178 8 lines of doctest code seems overly verbose in the context of an "overview of the Docutils architecture". >> In case we don't want to ship "quicktest.py" with the Docutils >> distribution, we can still keep it in the Docutils repo and simply change: > >> -can be found in the ``tools/`` directory of the Docutils distribution) >> +can be found in the ``tools/`` directory of the Docutils repository) > I was actually going to suggest deleting "quicktest.py" > entirely--finding the version can now be done with "docutils --version" > and the last substantive change in functionality was 2006. > "EasyDialogs" was removed in Python 2.7. It can still be useful for development: testing the reST parser (including the option to generate input/output pairs for parser test scripts). I'd move it to tools/dev and remove the EasyDialogs part. Günter |
From: Adam T. <aat...@ou...> - 2022-11-03 00:10:55
|
Dear Günter, >> The overall aim with the test changes is to be able to use the standard >> library's ``unittest`` runner, which will alleviate some of the points >> you raise. > I understand and support the overall aim to use standard tools for the tests. > I would have preferred a "feature branch" (as mandated by the "policies") > for this change set, though. (I agree that with git this would have been > easier and have to apologise for the delay of my respective reviews.) I'll bear in mind for future projects, thank you. > Generally, if a file is created or maintained by an active member of the > Docutils developer team, I'd prefer an "ask before change" (except for > emergencies and small typos). Sounds good! >> r9140, r9141: Import ``DocutilsTestSupport`` from ``test`` > When editing a test script, I regularly run it from my editor (it is > just one click). This is a feature I don't want to miss, the actual > implementation is less important ... OK, I will look to ensure this is still possible. >>> r9167 > Introduces a backwards incompatible API change! (see below) Partially reverted in [r9202] (see below) >>> IMV, an encoded string ("bytes" object) is no "binary" output. >> Python 3 bytes objects are sequences of 8-bit bytes, so I would say >> that it is binary output, but perhaps we could improve here. > The Python 3.10 documentation for `bytes` calls it a "binary sequence > type", so I may have to adapt. > The naming of the `core.publish_string()` API function pre-dates Python3 > and uses the term "string" as a superordinate for "Unicode strings" > (`str`/`unicode`) and "encoded strings" (`bytes`/`str`). > The documentation is explicit about the return value > (defaulting to 'utf-8' encoded `bytes` instances). > The function is part of the "first level API": intended for use by > 3rd-party code. Changing the return value has the potential of many > breaks. > Please revert the changes to core.py and io.py. > Unicode `str` output can be achieved passing > ``settings_overwrites={'output_encoding': 'unicode'}`` to > publish_string(). > (In the long run, I'd like to replace the "homegrown" docutils.io classes > with standard objects, but this is a separate topic.) I have reverted the change (thank you) but introduced a deprecation for returning non-(unicode) strings from "publish_string". I feel this is a reasonable step towards having a more intuitive API. Naming of "publish_bytes" and "BytesOutput" I am not that set on, we can change to "BinaryXXX" if preferred. >>> r9168 Use ``.rstrip()`` instead > This depends on the context: over-specified tests contribute to noise due to > false positives. Test results should not depend on the "native" line endings > of the system running the tests. Python converts all line-endings to "\n" when opening a file in text-mode (or in string literals), so this should not be an issue. [1] [2] [1]: https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files [2]: https://peps.python.org/pep-0278/ >>> r9178 > 8 lines of doctest code seems overly verbose in the context of an > "overview of the Docutils architecture". I tend to agree, "core.publish_string" would be nicer, but there is no way in Docutils to disable applying the default transforms. > It can still be useful for development: testing the reST parser (including > the option to generate input/output pairs for parser test scripts). > I'd move it to tools/dev and remove the EasyDialogs part. Done in [r9201] Thanks, Adam |
From: Guenter M. <mi...@us...> - 2022-11-03 23:34:11
|
is still too fast... On 2022-11-02, Adam Turner wrote: >> Generally, if a file is created or maintained by an active member of the >> Docutils developer team, I'd prefer an "ask before change" (except for >> emergencies and small typos). > Sounds good! ... >> (In the long run, I'd like to replace the "homegrown" docutils.io classes >> with standard objects, but this is a separate topic.) r9208 ... r9211: "In the long run" is not next day! Please ask before such changes. Replacing the "FileInput/StringInput/FileOutput" classes can currently only be done in cases, where we are sure not to introduce backwards incompatibilities. For the "include" and "csv-table" directives, the changes definitely change the behaviour for the "input_encoding" default value! For the other changes, I am not sure. >>>> r9167 >> Introduces a backwards incompatible API change! (see below) > Partially reverted in [r9202] (see below) So we are half way. Please no new patches on top before this is sorted out. ... >> The naming of the `core.publish_string()` API function pre-dates Python3 >> and uses the term "string" as a superordinate for "Unicode strings" >> (`str`/`unicode`) and "encoded strings" (`bytes`/`str`). >> The documentation is explicit about the return value >> (defaulting to 'utf-8' encoded `bytes` instances). >> The function is part of the "first level API": intended for use by >> 3rd-party code. Changing the return value has the potential of many >> breaks. >> Please revert the changes to core.py and io.py. ... > I have reverted the change (thank you) but introduced a deprecation for > returning non-(unicode) strings from "publish_string". I feel this is a > reasonable step towards having a more intuitive API. IMV, the name `publish_string` uses the generic notation of a "string" as "sequence of alphanumeric text or other characters, either as a literal constant or as some kind of variable", cf. https://en.wikipedia.org/wiki/String_(computer_science). > Naming of "publish_bytes" and "BytesOutput" I am not that set on, we > can change to "BinaryXXX" if preferred. I don't think we need a new custom "BytesOutput" class when in the long run we will use standard io classes. Regarding the "core.publish_string()" function, I see three possibilities: 1. Just keep as is. This is core API behaviour, so we need a very strong indication that the advantages outweigh the hassle. 2. Add a new boolean argument: "encode". With ``publish_string(encode=False)``, the "output_encoding" setting would be ignored and the return value is a `bytes` object. With ``publish_string(encode=True)``, the behaviour is as currently, encoding with "output_encoding" or returning `bytes` or `str` object (the latter for ``output_encoding == "unicode"``). The default value could change from True to False in Docutils 2.0. 3. Deprecate¹ "publish_string()" in favour of new, separate "publish_unicode_str()" and "publish_bytes()" functions. ¹PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0. >>>> r9178 >> 8 lines of doctest code seems overly verbose in the context of an >> "overview of the Docutils architecture". > I tend to agree, "core.publish_string" would be nicer, but there is no > way in Docutils to disable applying the default transforms. >> It can still be useful for development: testing the reST parser (including >> the option to generate input/output pairs for parser test scripts). >> I'd move it to tools/dev and remove the EasyDialogs part. > Done in [r9201] So now the "hacking" guide can again refer to "quicktest.py". BTW: the fact that it "has not been substantively modified since 2006" is no argument for the move (the `cat` shell command, say, has not been substantially modified for ages, too but moving it to a developers only path would be bad.) Rather, "quicktest.py" has never been installed into the "PATH" in setup.py and is a tool for developers rather than end-users. r9207 There is an established format in the HISTORY file: General changes, followed by changes listed in alphabetical order of affected file (or directory for related changes in one directory) Name patch authors only for "external" patches (where the name differs from the committer in the SVN log). This is what I found at first glance, still not convinced this is all. Günter |
From: Guenter M. <mi...@us...> - 2022-11-04 15:54:37
|
Dear Adam, a follow up with some more discoveries and thoughts... On 2022-11-03, Guenter Milde via Docutils-develop wrote: > On 2022-11-02, Adam Turner wrote: >>>>> r9167 >> Partially reverted in [r9202] ... > Regarding the "core.publish_string()" function ... Alternatives forward: > 1. Just keep as is. > This is core API behaviour, so we need a very strong indication > that the advantages outweigh the hassle. > 2. Add a new boolean argument: "encode". > With ``publish_string(encode=False)``, the "output_encoding" setting > would be ignored and the return value is a `bytes` object. > With ``publish_string(encode=True)``, the behaviour is as currently, > encoding with "output_encoding" or returning `bytes` or `str` object > (the latter for ``output_encoding == "unicode"``). > The default value could change from True to False in Docutils 2.0. > 3. Deprecate¹ "publish_string()" in favour of new, separate > "publish_unicode_str()" and "publish_bytes()" functions. > ¹PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0. 4. Keep "publish_string()" as-is. New function "publish_str_instance()", say. The "output_encoding" setting will be used for encoding declarations and to determine required character replacements in the LaTeX writer but the return value is guaranteed to be a `str` instance. My current favourites are 1 or 4. New discovered bug with ``output_encoding == "unicode"``. r9202 contains a "quick and dirty fix" for a hithero unrecognized problem with the pseudo-encoding name "unicode": This non-standard name is used in encoding declarations in XML and HTML and in a package call in LaTeX. The problem is, that we don't know which encoding an application calling `publish_string()` will finally use for storing/transferring the output outside the realms of Python. Given this, an encoding declaration with hard-coded "utf-8" is worse than no encoding declaration. A fix for LaTeX would be diff --git a/docutils/docutils/writers/latex2e/__init__.py b/docutils/docutils/writers/latex2e/__init__.py index 45086ea09..a306fef9a 100644 --- a/docutils/docutils/writers/latex2e/__init__.py +++ b/docutils/docutils/writers/latex2e/__init__.py @@ -1304,7 +1304,8 @@ class LaTeXTranslator(nodes.NodeVisitor): # ~~~~~~~~~~~~~~~~ # Encodings: # Docutils' output-encoding => TeX input encoding - if self.latex_encoding != 'ascii': + if self.latex_encoding not in ('ascii', 'unicode'): + # TODO: also don't insert for 'utf8' (cf. RELEASE-NOTES) self.requirements['_inputenc'] = (r'\usepackage[%s]{inputenc}' % self.latex_encoding) # TeX font encoding @@ -1459,8 +1460,7 @@ # 'iso-8859-7': '' # greek # 'iso-8859-8': '' # hebrew # 'iso-8859-10': '' # latin6, more complete iso-8859-4 - 'unicode': 'utf8', # TEMPORARY, remove in Docutils 0.21 } encoding = docutils_encoding.lower() In the unit tests, I suggest decoding the `bytes` returned by `publish_string()` over the "unicode" pseudo-encoding, e.g. diff --git a/docutils/test/DocutilsTestSupport.py b/docutils/test/DocutilsTestSupport.py index 496e68dd7..2df660d85 100644 --- a/docutils/test/DocutilsTestSupport.py +++ b/docutils/test/DocutilsTestSupport.py @@ -493,7 +493,7 @@ class WriterPublishTestCase(CustomTestCase, docutils.SettingsSpec): settings_default_overrides = {'_disable_config': True, 'strict_visitor': True, - 'output_encoding': 'unicode'} + } writer_name = '' # set in subclasses or constructor def __init__(self, *args, writer_name='', **kwargs): @@ -509,7 +509,11 @@ class WriterPublishTestCase(CustomTestCase, docutils.SettingsSpec): writer_name=self.writer_name, settings_spec=self, settings_overrides=self.suite_settings) - self.assertEqual(str(output), str(self.expected)) + try: + output = output.decode() + except AttributeError: + pass + self.assertEqual(output, self.expected) class PublishTestSuite(CustomTestSuite): As an aside: The `docutils.io.StringInput` class can transparently handle input from `str` and `bytes` instances. The following patch makes this more explicit: diff --git a/docutils/docutils/io.py b/docutils/docutils/io.py index 6714ca22b..53e93886a 100644 --- a/docutils/docutils/io.py +++ b/docutils/docutils/io.py @@ -576,15 +576,15 @@ class BytesOutput(Output): class StringInput(Input): - - """ - Direct string input. - """ + """Input from a `str` or `bytes` instance.""" default_source_path = '<string>' def read(self): - """Decode and return the source string.""" + """Return the source as `str` instance. + + Decode, if required (see `Input.decode`). + """ return self.decode(self.source) The only effect of setting ``input_encoding = 'unicode'`` is a warning if the input object is not a `str` instance. Given that 'unicode' is a non-standard encoding name that may be deprecated later, it is, IMV, not worth recommending this in the docstring to `publish_string`: diff --git a/docutils/docutils/core.py b/docutils/docutils/core.py index 4d50e7fc7..798ffc5a1 100644 --- a/docutils/docutils/core.py +++ b/docutils/docutils/core.py @@ -437,10 +437,6 @@ def publish_string(source, source_path=None, destination_path=None, publish_string(..., settings_overrides={'output_encoding': 'unicode'}) - Similarly for Unicode string input (`source`):: - - publish_string(..., settings_overrides={'input_encoding': 'unicode'}) - Parameters: see `publish_programmatically`. """ warnings.warn('The return type of publish_string will change to ' hope to hear from you soon, Günter |
From: Guenter M. <mi...@us...> - 2022-11-04 18:35:19
|
On 2022-11-03, Guenter Milde via Docutils-develop wrote: > On 2022-11-02, Adam Turner wrote: > Replacing the "FileInput/StringInput/FileOutput" classes can currently > only be done in cases, where we are sure not to introduce backwards > incompatibilities. > For the "include" and "csv-table" directives, the changes definitely change > the behaviour for the "input_encoding" default value! > For the other changes, I am not sure. Checked r9208 ... r9211 and reverted the incompatible changes in r9217. Günter |
From: Adam T. <aat...@ou...> - 2022-11-06 00:35:07
|
Dear Günter, I am replying to all three notes in the below, apologies for the rather long note that follows. ----- >>> (In the long run, I'd like to replace the "homegrown" docutils.io classes >>> with standard objects, but this is a separate topic.) > r9208 ... r9211: > "In the long run" is not next day! Please ask before such changes. I saw these as fairly unobtrusive changes that didn't touch the core ``io`` classes -- thank you for noting the incompatibilities with input encoding, though I think this reveals some deficiencies in our test suite -- I run all my changes against the test suite on Python 3.7, 3.8, 3.9, 3.10, and 3.11 before committing. ----- >> Replacing the "FileInput/StringInput/FileOutput" classes can currently >> only be done in cases, where we are sure not to introduce backwards >> incompatibilities. >> For the "include" and "csv-table" directives, the changes definitely change >> the behaviour for the "input_encoding" default value! >> For the other changes, I am not sure. > Checked r9208 ... r9211 and reverted the incompatible changes in r9217. I should like to revert the reversion, though I shall look to see what needs to be done to address the custom input encoding logic. ----- >>> The naming of the `core.publish_string()` API function pre-dates Python3 >>> and uses the term "string" as a superordinate for "Unicode strings" >>> (`str`/`unicode`) and "encoded strings" (`bytes`/`str`). >>> The documentation is explicit about the return value >>> (defaulting to 'utf-8' encoded `bytes` instances). >>> The function is part of the "first level API": intended for use by >>> 3rd-party code. Changing the return value has the potential of many >>> breaks. >>> Please revert the changes to core.py and io.py. >> I have reverted the change (thank you) but introduced a deprecation for >> returning non-(unicode) strings from "publish_string". I feel this is a >> reasonable step towards having a more intuitive API. > IMV, the name `publish_string` uses the generic notation of a "string" > as "sequence of alphanumeric text or other characters, either as a > literal constant or as some kind of variable", > cf. https://en.wikipedia.org/wiki/String_(computer_science). I see your point with respect to "string", and it is an overloaded term. My primary argument is to do the "most obvious" (or "least surprising") thing given the domain context -- I firmly believe that in the modern Python 3 ecosystem, when advertising a method with "string" in the name we should return strings as defined by Python, which are sequences of Unicode code points. ----- >> Naming of "publish_bytes" and "BytesOutput" I am not that set on, we >> can change to "BinaryXXX" if preferred. > I don't think we need a new custom "BytesOutput" class when in the long run > we will use standard io classes. By this, do you mean "BytesIO" et al? I think using this would be sensible, introducing the new class is (as I see it) a bridging mechanism, as dropping in "BytesIO" does not fit the expected API at the moment. ----- >> Regarding the "core.publish_string()" function, I see three possibilities: > Alternatives forward: >> 1. Just keep as is. >> This is core API behaviour, so we need a very strong indication >> that the advantages outweigh the hassle. >> 2. Add a new boolean argument: "encode". >> With ``publish_string(encode=False)``, the "output_encoding" setting >> would be ignored and the return value is a `bytes` object. >> With ``publish_string(encode=True)``, the behaviour is as currently, >> encoding with "output_encoding" or returning `bytes` or `str` object >> (the latter for ``output_encoding == "unicode"``). >> The default value could change from True to False in Docutils 2.0. >> 3. Deprecate¹ "publish_string()" in favour of new, separate >> "publish_unicode_str()" and "publish_bytes()" functions. >> ¹PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0. > 4. Keep "publish_string()" as-is. > New function "publish_str_instance()", say. > The "output_encoding" setting will be used for encoding declarations > and to determine required character replacements in the LaTeX writer > but the return value is guaranteed to be a `str` instance. > > > My current favourites are 1 or 4. By (1) do you mean keep it as is as of [r9217], where returning bytes from "publish_string" is deprecated? If so, I would vote this option. My rationale is as above, I think for the ordinary Python programmer, one would expect a "string" return type. Anecdotally, I found this behaviour confusing when I first used Docutils. ----- >>>>> r9178 >>> 8 lines of doctest code seems overly verbose in the context of an >>> "overview of the Docutils architecture". >> I tend to agree, "core.publish_string" would be nicer, but there is no >> way in Docutils to disable applying the default transforms. >>> It can still be useful for development: testing the reST parser (including >>> the option to generate input/output pairs for parser test scripts). >>> I'd move it to tools/dev and remove the EasyDialogs part. >> Done in [r9201] > So now the "hacking" guide can again refer to "quicktest.py". > BTW: the fact that it "has not been substantively modified since 2006" is > no argument for the move (the `cat` shell command, say, has not been > substantially modified for ages, too but moving it to a developers only > path would be bad.) > Rather, "quicktest.py" has never been installed into the "PATH" > in setup.py and is a tool for developers rather than end-users. I will look to add back the quicktest method, I think it might be valuable to keep the Python method too as it shows how to achieve the same result only using the Python API. ----- > r9207 > There is an established format in the HISTORY file: > General changes, followed by changes listed in alphabetical order of > affected file (or directory for related changes in one directory) > Name patch authors only for "external" patches (where the name differs > from the committer in the SVN log). Thank you for making me aware of this. What should be done for changes that are cross-cutting, e.g. the test changes where the same change is being made to many test files, though perhaps not in the same directory? To explain my original approach, I am more used to an approach which enumerates history by patch or changeset, so for example describing a set of changes and the rationale / effect, rather than enumerating by filename. ----- > On 2022-11-03, Guenter Milde via Docutils-develop wrote: > New discovered bug with ``output_encoding == "unicode"``. > r9202 contains a "quick and dirty fix" for a hithero unrecognized problem > with the pseudo-encoding name "unicode": > This non-standard name is used in encoding declarations in XML and HTML and > in a package call in LaTeX. > The problem is, that we don't know which encoding an application calling > `publish_string()` will finally use for storing/transferring the output > outside the realms of Python. > Given this, an encoding declaration with hard-coded "utf-8" is worse than no > encoding declaration. OK, your fix seems reasonable. I would note though that LaTeX has used UTF-8 as the default encoding since 2018, so no "inputenc" and a declaration of UTF-8 encoding have exactly the same effect. ----- > In the unit tests, I suggest decoding the `bytes` returned by > `publish_string()` over the "unicode" pseudo-encoding, e.g. > diff --git a/docutils/test/DocutilsTestSupport.py b/docutils/test/DocutilsTestSupport.py > [snip] This is to reflect where "publish_string" will end up (returning Python strings), but I don't see a large difference between the two approaches so happy to switch if you would prefer. ----- > As an aside: > The `docutils.io.StringInput` class can transparently handle input from > `str` and `bytes` instances. > The following patch makes this more explicit: For similar reasons, I would prefer to make this more explicit and separate ``StringInput`` from a hypothetical ``BytesInput``. This can be discussed later though, when (if) we move away from all of the ``io`` classes. Your documentation patch looks OK. ----- > The only effect of setting ``input_encoding = 'unicode'`` is a warning if > the input object is not a `str` instance. Given that 'unicode' is a > non-standard encoding name that may be deprecated later, it is, IMV, not > worth recommending this in the docstring to `publish_string`: > diff --git a/docutils/docutils/core.py b/docutils/docutils/core.py > [snip] I agree with this. ----- Thanks, Adam |
From: Guenter M. <mi...@us...> - 2022-11-07 17:43:52
|
Dear Adam, dear Docutils developers, On 2022-11-06, Adam Turner wrote: >> r9208 ... r9211: ... > I saw these as fairly unobtrusive changes that didn't touch the core > ``io`` classes -- thank you for noting the incompatibilities with input > encoding, though I think this reveals some deficiencies in our test > suite -- I run all my changes against the test suite on Python 3.7, > 3.8, 3.9, 3.10, and 3.11 before committing. Yes, the test suite is incomplete: it does not test all fields of the "syntax - settings - components" matrix. This is why non-breaking tests are no guarantee for backwards compatibility and one reason why a review of all commits is so important. I added some more tests for input encoding handling in included files in r9220. > ----- >>> For the "include" and "csv-table" directives, the changes definitely change >>> the behaviour for the "input_encoding" default value! ... >> Checked r9208 ... r9211 and reverted the incompatible changes in r9217. > I should like to revert the reversion, though I shall look to see what > needs to be done to address the custom input encoding logic. Handling of the character encoding should be kept synchronous for the main source and included files. As long as the custom classes from `docutils.io` are used for the main document, there is no much sense in replicating their behaviour for included files. If using the "with" statement is viewed as an important bonus, adding a context manager to the custom io classes could be considered (after the test suite refactoring is finished). > ----- >>>> The naming of the `core.publish_string()` API function ... >> IMV, the name `publish_string` uses the generic notation of a "string" >> as "sequence of alphanumeric text or other characters, either as a >> literal constant or as some kind of variable", >> cf. https://en.wikipedia.org/wiki/String_(computer_science). > I see your point with respect to "string", and it is an overloaded > term. My primary argument is to do the "most obvious" (or "least > surprising") thing given the domain context -- I firmly believe that in > the modern Python 3 ecosystem, when advertising a method with "string" > in the name we should return strings as defined by Python, which are > sequences of Unicode code points. I see your point. OTOH: * `output-encoding`__ is a *general* setting defined as "The text encoding for output". This raises the expectation that all Docutils output "has" the specified encoding and makes the "most obvious" return value a bit less obvious. __ https://docutils.sourceforge.io/docs/user/config.html#output-encoding * The behaviour of `publish_string()` has been stable for many years. Long-time users are familiar with it and expect it to remain stable. >>> Regarding the "core.publish_string()" function, I see three possibilities: >> Alternatives forward: >>> 1. Just keep as is. >>> This is core API behaviour, so we need a very strong indication >>> that the advantages outweigh the hassle. >>> 2. Add a new boolean argument: "encode". ... >>> 3. Deprecate¹ "publish_string()" in favour of new, separate >>> "publish_unicode_str()" and "publish_bytes()" functions. >>> ¹PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0. >> 4. Keep "publish_string()" as-is. >> New function "publish_str_instance()", say. >> The "output_encoding" setting will be used for encoding declarations >> and to determine required character replacements in the LaTeX writer >> but the return value is guaranteed to be a `str` instance. >> My current favourites are 1 or 4. > By (1) do you mean keep it as is as of [r9217], where returning bytes > from "publish_string" is deprecated? > If so, I would vote this option. > My rationale is as above, I think for the ordinary Python programmer, > one would expect a "string" return type. Anecdotally, I found this > behaviour confusing when I first used Docutils. By 1), I mean keeping the behaviour in 0.18, without any changes (except for a more clear documentation). This would be the simplest alternative for experienced Docutils users. If the documentation is clear about possible return values (and even more after adding type hints) users should be able to live with the unfortunate naming. OTOH, I see a use-case for a convenience function returning a `str` instance also in cases where an "intended encoding" of the output is given in the "output-encoding" setting. This way, a program using this function can export a HTML, XML or LaTeX with an encoding declaration as `str` instance, post-process and finally encode it before handing it to storage or a non-Python processor. If we are going to change the core API functionality regarding the convenience function(s) to publish the output as `str` or `bytes` instance, then we should: * do not start this in the middle of a major refactoring of the test suite (where it is hard to spot the changes in expected output from "cosmetic" changes in the test code). * do it in a "quasi static" manner: both, old and the new behaviour must be accessible over a sequence of two or more stable releases. Variant 4) has the disadvantage that the "wart" of the ambiguous/misleading name is not adressed. Changing the behaviour of `publish_string()` breaks existing code and the expectations of long-term users. This brings us back to variant 3): Define 2 new functions with non-ambiguous names, e.g., `publish_str_instance()` and `publish_bytes_instance()` that return the output as a `str` or `bytes` instance respectively. Document, that the "output-encoding" setting is used for encoding declarations in HTML, XML, and LaTeX also with `publish_str_instance()`. It is up to the user to ensure the declared encoding and the actual encoding of output stored in a file or passed to a non-Python program match. Support "output-encoding" value ``''`` (and for backwards compatibility 'unicode') with `publish_str_instance()`: In this case there is no encoding declaration in XML, HTML, and LaTeX. Deprecate "publish_string()": PendingDeprecationWarning now, DeprecationWarning in Docutils 1.0. ... >> I don't think we need a new custom "BytesOutput" class when in the long run >> we will use standard io classes. > By this, do you mean "BytesIO" et al? I think using this would be > sensible, introducing the new class is (as I see it) a bridging > mechanism, as dropping in "BytesIO" does not fit the expected API at > the moment. I see. The custom `docutils.io` classes are deeply embedded in the overall structure of Docutils components and their API. Any move to change/replace them touches Docutils Design Specification (PEP 258) and should be well justified and well-considered, i.e. nothing for the near future. This means that if we want to introduce an explicit `publish_bytes()` convenience function, a corresponding `BytesOutput` class is appropriate. Implementation details would need further discussion. Again, I'd prefer not to do this in the middle of a major refactoring of the test suite. >>>>>> r9178 ... > I will look to add back the quicktest method, I think it might be > valuable to keep the Python method too as it shows how to achieve the > same result only using the Python API. Maybe as a footnote or side-box, so it is less distracting from the main course of the discussion. >> r9207 >> There is an established format in the HISTORY file: ... > Thank you for making me aware of this. What should be done for changes > that are cross-cutting, e.g. the test changes where the same change is > being made to many test files, though perhaps not in the same > directory? The test suite refactoring could be listed under the "general" changes or under "test/*" (or, if you prefer, "test/**"). > To explain my original approach, I am more used to an approach which > enumerates history by patch or changeset, so for example describing a > set of changes and the rationale / effect, rather than enumerating by > filename. I did not invent the HISTORY file format, but I learned to accept and value it as complimentary to the more detailled and chronological SVN log. >> On 2022-11-03, Guenter Milde via Docutils-develop wrote: >> New discovered bug with ``output_encoding == "unicode"``. >> r9202 contains a "quick and dirty fix" for a hithero unrecognized problem >> with the pseudo-encoding name "unicode": >> This non-standard name is used in encoding declarations in XML and HTML and >> in a package call in LaTeX. >> The problem is, that we don't know which encoding an application calling >> `publish_string()` will finally use for storing/transferring the output >> outside the realms of Python. >> Given this, an encoding declaration with hard-coded "utf-8" is worse than no >> encoding declaration. > OK, your fix seems reasonable. I would note though that LaTeX has used > UTF-8 as the default encoding since 2018, so no "inputenc" and a > declaration of UTF-8 encoding have exactly the same effect. >> r9202 >> ... bug with ``output_encoding == "unicode"``. >> This non-standard name is used in encoding declarations in XML and HTML and >> in a package call in LaTeX. >> The problem is, that we don't know which encoding an application calling >> `publish_string()` will finally use for storing/transferring the output >> outside the realms of Python. >> Given this, an encoding declaration with hard-coded "utf-8" is worse than no >> encoding declaration. > OK, your fix seems reasonable. Committed in r9219, together with the planned dropping of the "inputenc" call with UTF-8 encoded output. > I would note though that LaTeX has used UTF-8 as the default encoding > since 2018, so no "inputenc" and a declaration of UTF-8 encoding have > exactly the same effect. The difference is, that the case of encoding the output to something other than utf-8 is easier to handle, e.g., with ``setting_overrides[latex-preamble] = '\usepackage[latin1]{inputenc}\n'``. https://docutils.sourceforge.io/docs/user/config.html#latex-preamble I will follow up with a fix dropping the encoding declaration for HTML and XML output if "output_encoding" is "unicode". > ----- >> In the unit tests, I suggest decoding the `bytes` returned by >> `publish_string()` over the "unicode" pseudo-encoding, e.g. >> diff --git a/docutils/test/DocutilsTestSupport.py b/docutils/test/DocutilsTestSupport.py > This is to reflect where "publish_string" will end up (returning Python > strings), but I don't see a large difference between the two approaches > so happy to switch if you would prefer. As discusses above, I don't think it is a good idea to change the behaviour of `publish_string()` (but eventually replace it by 2 new functions in the long run). Comitted in r9218. > Your documentation patch looks OK. Commited as part of r9221 Günter |
From: Adam T. <aat...@ou...> - 2022-11-13 19:49:52
|
Dear Günter, >> ----- >>>>> The naming of the `core.publish_string()` API function > I see your point. OTOH: > * `output-encoding`__ is a *general* setting defined as > "The text encoding for output". > This raises the expectation that all Docutils output "has" the > specified encoding and makes the "most obvious" return value a bit less > obvious. > > __ https://docutils.sourceforge.io/docs/user/config.html#output-encoding > * The behaviour of `publish_string()` has been stable for many years. > Long-time users are familiar with it and expect it to remain stable. There are 1283 projects published on PyPI that depend on Docutils. I have gone through each of these projects, there are 35 (2.7%) that use the ``docutils.core.publish_string`` function. Of those: * 8 set ``'output_encoding': 'unicode'``, so would be unaffected by the eventual change to return strings. (traitsui, odoo-tools, CodeChat, benchmarkstt, pyfda, pyretis, anna, resplendent) * 14 use the "UTF-8" encoding and call ``.decode()`` straight after, so would have to refactor but want to use Python strings. (Orange-Canvas-Core, galaxy-util, vb2py, gluetool, madgui, pydoc-fork, ScopeSim, bluewhale-canvas-core, rstdoc, galaxy-lib, orange-canvas-core-ml, meditor, jarn.viewdoc, formiko) * 4 are agnostic to output type, as they pass the output of ``publish_string`` straight into ``BeautifulSoup()`` or ``xml.etree.ElementTree.fromstring``, both of which accept either bytes or str. (doc-warden, testimony, pyLanguagetool, turq) * 3 use custom writers or the document tree, and don't use the returned output (pydoctor, restview, fairy-slipper) * 2 are broken by calling ``str()`` on a bytes instance without an ``encoding`` argument. (prettyqt, cornice_sphinx) * 1 ignores output and just uses the call to check it doesn't raise any exceptions (rstcheck-core) * 3 expect ``bytes`` and could use the proposed ``publish_bytes()`` function (awscli, bugrest, quorachallenge) I am happy to work with the ~17 (14 + 3) that would be affected to help them to refactor, should we agree on a way forwards. Out of interest, none used an ``output_encoding`` setting other than "unicode" or "utf-8". I would be content to delay the switch-over of return type from ``publish_string`` to Docutils 1.0 or 2.0 should more time be needed, but I suppose I see the other scenarios as sub-optimal for the long-term in one way or another -- e.g. ``publish_str_instance`` is unweidly to use regularly, and using e.g. ``publish_str`` instead would be confusing when ``publish_string`` still exists. >>> Regarding the "core.publish_string()" function, I see three possibilities: >>> Alternatives forward: >>> 1. [Revert to Docutils 0.19 behaviour, with clearer documentation]. >>> 2. Add a new boolean argument: "encode". >>> 3. Deprecate "publish_string()" in favour of new, separate >>> "publish_unicode_str()" and "publish_bytes()" functions. >>> 4. [Revert] "publish_string()" [to Docutils 0.19 behaviour]. >>> New function "publish_str_instance()", say. > If the documentation is clear about possible return values (and even more after adding type hints) users should be able to live with the unfortunate naming. Unfortunatley as far as I am aware type hints are unable to code for a setting within a dictionary affecting the return type of the function. I agree the documentation should be made clearer. > OTOH, I see a use-case for a convenience function returning a `str` instance also in cases where an "intended encoding" of the output is given in the "output-encoding" setting. This way, a program using this function can export a HTML, XML or LaTeX with an encoding declaration as `str` instance, post-process and finally encode it before handing it to storage or a non-Python processor. Yes, this is my general view too -- I see ``publish_string`` as a function to be called from other Python programmes. > If we are going to change the core API functionality regarding the convenience function(s) to publish the output as `str` or `bytes` instance, then we should: > * do not start this in the middle of a major refactoring of the test suite > (where it is hard to spot the changes in expected output from > "cosmetic" changes in the test code). I agree with this, in retrospect it was a poor choice. > * do it in a "quasi static" manner: both, old and the new behaviour must > be accessible over a sequence of two or more stable releases. This of course makes sense. > This means that if we want to introduce an explicit `publish_bytes()` convenience function, a corresponding `BytesOutput` class is appropriate. OK, though it seems this is dependent on the outcome of the ``publish_string`` decision. ----- Thanks, Adam |
From: Guenter M. <mi...@us...> - 2022-11-10 15:17:30
|
Only add warnigns to core functions, if users can avoid this warnings following a recommended practice. For core API functions and classes: don't change the behaviour but replace with new function(s) showing new behaviour. This way users can switch to the new behaviour at some stage during the transition period. Add a deprecation warning once the new functions are in place. # Suggestion for the next step to solve the "string I/O" problem. # OK to commit? --- docutils/docutils/core.py | 2 -- docutils/docutils/io.py | 27 +++++++++++++-------------- 2 files changed, 13 insertions(+), 16 deletions(-) diff --git a/docutils/docutils/core.py b/docutils/docutils/core.py index a54ed7a90..edbcfc3df 100644 --- a/docutils/docutils/core.py +++ b/docutils/docutils/core.py @@ -443,8 +443,6 @@ def publish_string(source, source_path=None, destination_path=None, Parameters: see `publish_programmatically`. """ - warnings.warn('The return type of publish_string will change to ' - '"str" from Docutils 0.21.', FutureWarning, stacklevel=2) output, pub = publish_programmatically( source_class=io.StringInput, source=source, source_path=source_path, destination_class=io.StringOutput, diff --git a/docutils/docutils/io.py b/docutils/docutils/io.py index 53e93886a..12342025d 100644 --- a/docutils/docutils/io.py +++ b/docutils/docutils/io.py @@ -247,6 +247,11 @@ class Output(TransformSpec): raise NotImplementedError def encode(self, data): + """Encode `data` with `self.encoding` (unless it is already encoded). + + If `self.encoding` is set to the pseudo encoding name "unicode", + `data` must be a `str` instance and is returned unchanged. + """ if self.encoding and self.encoding.lower() == 'unicode': assert isinstance(data, str), ( 'the encoding given is "unicode" but the output is not ' @@ -581,7 +586,7 @@ class StringInput(Input): default_source_path = '<string>' def read(self): - """Return the source as `str` instance. + """Return the source as `str` object. Decode, if required (see `Input.decode`). """ @@ -589,26 +594,20 @@ class StringInput(Input): class StringOutput(Output): - - """ - Direct string output. - """ + """Output to a `bytes` or `str` instance.""" default_destination_path = '<string>' def write(self, data): - """Encode `data`, store it in `self.destination`, and return it.""" + """Encode `data`, store it in `self.destination`, and return it. + + If `self.encoding` is set to the pseudo encoding name "unicode", + `data` must be a `str` instance and is returned unchanged. + (cf. `Output.encode`) + """ self.destination = self.encode(data) return self.destination - def encode(self, data): - data = super().encode(data) - if not isinstance(data, str): - warnings.warn("StringOutput.encode()'s return type will change to " - f'``str`` from Docutils 0.21, got type {type(data)}', - FutureWarning, stacklevel=2) - return data - class NullInput(Input): -- 2.30.2 |
From: Guenter M. <mi...@us...> - 2022-11-16 22:50:50
|
Dear Adam, dear Docutils developers, On 2022-11-13, Adam Turner wrote: >> * `output-encoding`__ is a *general* setting defined as >> "The text encoding for output". This raises the expectation that all >> Docutils output "has" the specified encoding ... >> * The behaviour of `publish_string()` has been stable for many years. >> Long-time users are familiar with it and expect it to remain stable. > There are 1283 projects published on PyPI that depend on Docutils. I > have gone through each of these projects, there are 35 (2.7%) that use > the ``docutils.core.publish_string`` function. > Of those: > * 8 set ``'output_encoding': 'unicode'``, so would be unaffected by the > eventual change to return strings. (traitsui, odoo-tools, CodeChat, > benchmarkstt, pyfda, pyretis, anna, resplendent) > * 14 use the "UTF-8" encoding and call ``.decode()`` straight after, so > would have to refactor but want to use Python strings. > (Orange-Canvas-Core, galaxy-util, vb2py, gluetool, madgui, pydoc-fork, > ScopeSim, bluewhale-canvas-core, rstdoc, galaxy-lib, > orange-canvas-core-ml, meditor, jarn.viewdoc, formiko) > * 4 are agnostic to output type, as they pass the output of > ``publish_string`` straight into ``BeautifulSoup()`` or > ``xml.etree.ElementTree.fromstring``, both of which accept either bytes > or str. (doc-warden, testimony, pyLanguagetool, turq) Good news for Docutils: we are not alone. "xml.etree" also uses: "string" or "string constant" as a superordinate term for a sequence of characters, either encoded (`bytes`) or as Unicode code points (`str`). "unicode" as a pseudo-encoding name for "no encoding" (i.e. "return as `str` instance"). Maybe we can agree with the etree team on a compatible terminology and way forward. > * 3 use custom writers or the document tree, and don't use the returned > output (pydoctor, restview, fairy-slipper) > * 2 are broken by calling ``str()`` on a bytes instance without an > ``encoding`` argument. (prettyqt, cornice_sphinx) > * 1 ignores output and just uses the call to check it doesn't raise any > exceptions (rstcheck-core) > * 3 expect ``bytes`` and could use the proposed ``publish_bytes()`` > function (awscli, bugrest, quorachallenge) One more, `pyreport`_, is currently unmaintained and Python2 only... .. _pyreport: https://github.com/joblib/pyreport There may be more use cases in unpublished packages/modules/scripts or helper scripts in non-Python projects. > I am happy to work with the ~17 (14 + 3) that would be affected to help > them to refactor, should we agree on a way forwards. I am quite confident that we will find a consensus. I would still want to revert the FutureWarnings until there is a stable alternative in place. > Out of interest, none used an ``output_encoding`` setting other than > "unicode" or "utf-8". Did you also check the configuration files? Docutils also defaults to "output_encoding: utf-8" but allows users to change this to any valid encoding via settings_spec or settings_overrides or in a configuration file. (Just checked: publish_string() respects the "output_encoding" set in a docutils.conf file.) > I would be content to delay the switch-over of return type from > ``publish_string`` to Docutils 1.0 or 2.0 should more time be needed, > but I suppose I see the other scenarios as sub-optimal for the > long-term in one way or another -- e.g. ``publish_str_instance`` is > unweidly to use regularly, and using e.g. ``publish_str`` instead would > be confusing when ``publish_string`` still exists. OTOH, `publish_string()` vs. `publish_bytes()` mismatch: one uses a Python3 datatype name while the other an overloaded general term. >>>> Regarding the "core.publish_string()" function, I see three possibilities: >>>> Alternatives forward: >>>> 1. [Revert to Docutils 0.19 behaviour, with clearer documentation]. >>>> 2. Add a new boolean argument: "encode". >>>> 3. Deprecate "publish_string()" in favour of new, separate "publish_str()" and "publish_bytes()" functions. During the transition period, editors with name completion will show both `publish_str()` and `publish_string()` in the expansion list and coders will likely look up the docstring for the difference. >> If the documentation is clear about possible return values (and even >> more after adding type hints) users should be able to live with the >> unfortunate naming. > Unfortunatley as far as I am aware type hints are unable to code for a > setting within a dictionary affecting the return type of the function. > I agree the documentation should be made clearer. Type hints should support documenting the fact that a function accepts or returns any of a set of data types (e.g. "`int` or `str`", "`int` or `float`", or "`str` or `bytes`"). The details/conditions should be given in the docstring. >> OTOH, I see a use-case for a convenience function returning a `str` >> instance also in cases where an "intended encoding" of the output is >> given in the "output-encoding" setting. This way, a program using this >> function can export a HTML, XML or LaTeX with an encoding declaration >> as `str` instance, post-process and finally encode it before handing >> it to storage or a non-Python processor. > Yes, this is my general view too -- I see ``publish_string`` as a > function to be called from other Python programmes. >> If we are going to change the core API functionality regarding the >> convenience function(s) to publish the output as `str` or `bytes` >> instance, then we should: >> * do not start this in the middle of a major refactoring of the test suite >> (where it is hard to spot the changes in expected output from >> "cosmetic" changes in the test code). > I agree with this, in retrospect it was a poor choice. >> * do it in a "quasi static" manner: both, old and the new behaviour must >> be accessible over a sequence of two or more stable releases. > This of course makes sense. >> This means that if we want to introduce an explicit `publish_bytes()` >> convenience function, a corresponding `BytesOutput` class is >> appropriate. > OK, though it seems this is dependent on the outcome of the > ``publish_string`` decision. Maybe we can also directly support `str` as `destination_class` value in core.Publisher and publish_programmatically(). The whole problem looks like a candidate for one more enhancement proposal ;) Günter |