From: Adam T. <aat...@ou...> - 2022-11-13 19:49:52
|
Dear Günter, >> ----- >>>>> The naming of the `core.publish_string()` API function > I see your point. OTOH: > * `output-encoding`__ is a *general* setting defined as > "The text encoding for output". > This raises the expectation that all Docutils output "has" the > specified encoding and makes the "most obvious" return value a bit less > obvious. > > __ https://docutils.sourceforge.io/docs/user/config.html#output-encoding > * The behaviour of `publish_string()` has been stable for many years. > Long-time users are familiar with it and expect it to remain stable. There are 1283 projects published on PyPI that depend on Docutils. I have gone through each of these projects, there are 35 (2.7%) that use the ``docutils.core.publish_string`` function. Of those: * 8 set ``'output_encoding': 'unicode'``, so would be unaffected by the eventual change to return strings. (traitsui, odoo-tools, CodeChat, benchmarkstt, pyfda, pyretis, anna, resplendent) * 14 use the "UTF-8" encoding and call ``.decode()`` straight after, so would have to refactor but want to use Python strings. (Orange-Canvas-Core, galaxy-util, vb2py, gluetool, madgui, pydoc-fork, ScopeSim, bluewhale-canvas-core, rstdoc, galaxy-lib, orange-canvas-core-ml, meditor, jarn.viewdoc, formiko) * 4 are agnostic to output type, as they pass the output of ``publish_string`` straight into ``BeautifulSoup()`` or ``xml.etree.ElementTree.fromstring``, both of which accept either bytes or str. (doc-warden, testimony, pyLanguagetool, turq) * 3 use custom writers or the document tree, and don't use the returned output (pydoctor, restview, fairy-slipper) * 2 are broken by calling ``str()`` on a bytes instance without an ``encoding`` argument. (prettyqt, cornice_sphinx) * 1 ignores output and just uses the call to check it doesn't raise any exceptions (rstcheck-core) * 3 expect ``bytes`` and could use the proposed ``publish_bytes()`` function (awscli, bugrest, quorachallenge) I am happy to work with the ~17 (14 + 3) that would be affected to help them to refactor, should we agree on a way forwards. Out of interest, none used an ``output_encoding`` setting other than "unicode" or "utf-8". I would be content to delay the switch-over of return type from ``publish_string`` to Docutils 1.0 or 2.0 should more time be needed, but I suppose I see the other scenarios as sub-optimal for the long-term in one way or another -- e.g. ``publish_str_instance`` is unweidly to use regularly, and using e.g. ``publish_str`` instead would be confusing when ``publish_string`` still exists. >>> Regarding the "core.publish_string()" function, I see three possibilities: >>> Alternatives forward: >>> 1. [Revert to Docutils 0.19 behaviour, with clearer documentation]. >>> 2. Add a new boolean argument: "encode". >>> 3. Deprecate "publish_string()" in favour of new, separate >>> "publish_unicode_str()" and "publish_bytes()" functions. >>> 4. [Revert] "publish_string()" [to Docutils 0.19 behaviour]. >>> New function "publish_str_instance()", say. > If the documentation is clear about possible return values (and even more after adding type hints) users should be able to live with the unfortunate naming. Unfortunatley as far as I am aware type hints are unable to code for a setting within a dictionary affecting the return type of the function. I agree the documentation should be made clearer. > OTOH, I see a use-case for a convenience function returning a `str` instance also in cases where an "intended encoding" of the output is given in the "output-encoding" setting. This way, a program using this function can export a HTML, XML or LaTeX with an encoding declaration as `str` instance, post-process and finally encode it before handing it to storage or a non-Python processor. Yes, this is my general view too -- I see ``publish_string`` as a function to be called from other Python programmes. > If we are going to change the core API functionality regarding the convenience function(s) to publish the output as `str` or `bytes` instance, then we should: > * do not start this in the middle of a major refactoring of the test suite > (where it is hard to spot the changes in expected output from > "cosmetic" changes in the test code). I agree with this, in retrospect it was a poor choice. > * do it in a "quasi static" manner: both, old and the new behaviour must > be accessible over a sequence of two or more stable releases. This of course makes sense. > This means that if we want to introduce an explicit `publish_bytes()` convenience function, a corresponding `BytesOutput` class is appropriate. OK, though it seems this is dependent on the outcome of the ``publish_string`` decision. ----- Thanks, Adam |