From: Guenter M. <mi...@us...> - 2022-11-16 22:50:50
|
Dear Adam, dear Docutils developers, On 2022-11-13, Adam Turner wrote: >> * `output-encoding`__ is a *general* setting defined as >> "The text encoding for output". This raises the expectation that all >> Docutils output "has" the specified encoding ... >> * The behaviour of `publish_string()` has been stable for many years. >> Long-time users are familiar with it and expect it to remain stable. > There are 1283 projects published on PyPI that depend on Docutils. I > have gone through each of these projects, there are 35 (2.7%) that use > the ``docutils.core.publish_string`` function. > Of those: > * 8 set ``'output_encoding': 'unicode'``, so would be unaffected by the > eventual change to return strings. (traitsui, odoo-tools, CodeChat, > benchmarkstt, pyfda, pyretis, anna, resplendent) > * 14 use the "UTF-8" encoding and call ``.decode()`` straight after, so > would have to refactor but want to use Python strings. > (Orange-Canvas-Core, galaxy-util, vb2py, gluetool, madgui, pydoc-fork, > ScopeSim, bluewhale-canvas-core, rstdoc, galaxy-lib, > orange-canvas-core-ml, meditor, jarn.viewdoc, formiko) > * 4 are agnostic to output type, as they pass the output of > ``publish_string`` straight into ``BeautifulSoup()`` or > ``xml.etree.ElementTree.fromstring``, both of which accept either bytes > or str. (doc-warden, testimony, pyLanguagetool, turq) Good news for Docutils: we are not alone. "xml.etree" also uses: "string" or "string constant" as a superordinate term for a sequence of characters, either encoded (`bytes`) or as Unicode code points (`str`). "unicode" as a pseudo-encoding name for "no encoding" (i.e. "return as `str` instance"). Maybe we can agree with the etree team on a compatible terminology and way forward. > * 3 use custom writers or the document tree, and don't use the returned > output (pydoctor, restview, fairy-slipper) > * 2 are broken by calling ``str()`` on a bytes instance without an > ``encoding`` argument. (prettyqt, cornice_sphinx) > * 1 ignores output and just uses the call to check it doesn't raise any > exceptions (rstcheck-core) > * 3 expect ``bytes`` and could use the proposed ``publish_bytes()`` > function (awscli, bugrest, quorachallenge) One more, `pyreport`_, is currently unmaintained and Python2 only... .. _pyreport: https://github.com/joblib/pyreport There may be more use cases in unpublished packages/modules/scripts or helper scripts in non-Python projects. > I am happy to work with the ~17 (14 + 3) that would be affected to help > them to refactor, should we agree on a way forwards. I am quite confident that we will find a consensus. I would still want to revert the FutureWarnings until there is a stable alternative in place. > Out of interest, none used an ``output_encoding`` setting other than > "unicode" or "utf-8". Did you also check the configuration files? Docutils also defaults to "output_encoding: utf-8" but allows users to change this to any valid encoding via settings_spec or settings_overrides or in a configuration file. (Just checked: publish_string() respects the "output_encoding" set in a docutils.conf file.) > I would be content to delay the switch-over of return type from > ``publish_string`` to Docutils 1.0 or 2.0 should more time be needed, > but I suppose I see the other scenarios as sub-optimal for the > long-term in one way or another -- e.g. ``publish_str_instance`` is > unweidly to use regularly, and using e.g. ``publish_str`` instead would > be confusing when ``publish_string`` still exists. OTOH, `publish_string()` vs. `publish_bytes()` mismatch: one uses a Python3 datatype name while the other an overloaded general term. >>>> Regarding the "core.publish_string()" function, I see three possibilities: >>>> Alternatives forward: >>>> 1. [Revert to Docutils 0.19 behaviour, with clearer documentation]. >>>> 2. Add a new boolean argument: "encode". >>>> 3. Deprecate "publish_string()" in favour of new, separate "publish_str()" and "publish_bytes()" functions. During the transition period, editors with name completion will show both `publish_str()` and `publish_string()` in the expansion list and coders will likely look up the docstring for the difference. >> If the documentation is clear about possible return values (and even >> more after adding type hints) users should be able to live with the >> unfortunate naming. > Unfortunatley as far as I am aware type hints are unable to code for a > setting within a dictionary affecting the return type of the function. > I agree the documentation should be made clearer. Type hints should support documenting the fact that a function accepts or returns any of a set of data types (e.g. "`int` or `str`", "`int` or `float`", or "`str` or `bytes`"). The details/conditions should be given in the docstring. >> OTOH, I see a use-case for a convenience function returning a `str` >> instance also in cases where an "intended encoding" of the output is >> given in the "output-encoding" setting. This way, a program using this >> function can export a HTML, XML or LaTeX with an encoding declaration >> as `str` instance, post-process and finally encode it before handing >> it to storage or a non-Python processor. > Yes, this is my general view too -- I see ``publish_string`` as a > function to be called from other Python programmes. >> If we are going to change the core API functionality regarding the >> convenience function(s) to publish the output as `str` or `bytes` >> instance, then we should: >> * do not start this in the middle of a major refactoring of the test suite >> (where it is hard to spot the changes in expected output from >> "cosmetic" changes in the test code). > I agree with this, in retrospect it was a poor choice. >> * do it in a "quasi static" manner: both, old and the new behaviour must >> be accessible over a sequence of two or more stable releases. > This of course makes sense. >> This means that if we want to introduce an explicit `publish_bytes()` >> convenience function, a corresponding `BytesOutput` class is >> appropriate. > OK, though it seems this is dependent on the outcome of the > ``publish_string`` decision. Maybe we can also directly support `str` as `destination_class` value in core.Publisher and publish_programmatically(). The whole problem looks like a candidate for one more enhancement proposal ;) Günter |