From: Guenter M. <mi...@us...> - 2023-04-24 16:10:14
|
Dear Docutils developers, On 2023-04-23, Adam Turner wrote: ... > For me, the point of this exercise and deprecation process is to > reach an end-state where ``publish_string`` always returns ``str``. This would be a clean and simple end state. It is problematic for ``output_encoding != 'utf-8'`` and/or ``output_encoding_error_handler != 'strict'``. > To summarise the problem as I understand it: > * Some output formats may contain information about the encoding of > the document > - SGML based markup languages (XML, HTML) may contain an internal > encoding declaration. > - TeX based languages (LaTeX, XeLaTeX, etc) may contain an internal > encoding macro. > * All of these formats have default encodings > - XML defaults to a UTF-8 encoding if the encoding attribute is not > specified, since XML 1.0 (2008) > https://www.w3.org/TR/xml/#charencoding > - HTML 5 requires a UTF-8 charset > https://html.spec.whatwg.org/#charset > - LaTeX's default encoding is UTF-8, since 2018 > https://tug.org/TUGboat/tb39-1/tb121ltnews28.pdf > - XeTeX I believe has always defaulted to UTF-8. * (g|n|t)roff (used for man pages) has no default encoding and (AFAIK) no universal syntax for an encoding declaration in the source. groff has no built-in support for UTF-8. https://www.gnu.org/software/groff/manual/groff.html#Input-Encodings There is a pre-processor for UTF-8 encoded sources. https://stackoverflow.com/questions/23138930/text-codepage-in-groff https://stackoverflow.com/questions/52732988/nroff-groff-does-not-properly-convert-utf-8-encoded-file * ODT and epub are binary formats without a universal "natural" representation as `str` (the output may include bitmap graphics). * The "output_encoding" setting also decides over the use of literal characters vs. a macro representation for several non-ASCII characters in LaTeX., e.g. ``\dag{}`` for the footnote mark † (0x2020). * For XML and HTML, the "output_encoding_error_handler" setting may decide over "make or break" in case of non-encodable characters. (With "xmlcharrefreplace", unencodable characters can be used with XML/HTML output. The unencodable characters are still present in the `str` representation of the output document.) > * If a user asks for output as a Unicode ``str``, I believe it is > reasonable to assume these defaults (UTF-8 encoding). > > * If a user asks for output as a Unicode ``str``, but overrides the > ``output_encoding`` setting, I believe it is reasonable to assume > that the user is now responsible for conversion of the ``str`` to > ``bytes`` for serialisation to disk, and we should not support an > output format that does this by 'magic'. We could declare this as > unsupported behaviour as an alternative, and just issue an error. Users of applications that utilise the Docutils publisher API may be unaware of the internals (whether this application calls ``publish_string()`` or uses another part of the API). Currently an end user of such applications can customise the output encoding and error handling in a "docutils.conf" config file (unless explicitly forbidden by the application). Changing the behaviour of the "string I/O" interface should not silently start ignoring the configuration settings. Application developers should be made aware of this change before it bites their downstream users (e.g. in the docstring of the new function, the "future changes" announcement and the API docs). > * If a user asks for binary output (a ``bytes`` instance), I think it > is reasonable to use ``output_encoding`` and ``output_encoding_error_handler`` > to encode the ``str`` > instance we use internally to a ``bytes`` instance. > * We therefore need to decide the following end-state positions: > a) Do we want to support (long-term) outputting ``bytes`` from > the core publish API? I agree to not returning ``bytes`` from a "String I/O" interface. (The core publish API also provides two functions with "File I/O" and publish_parts() and publish_doctree() with alternative interfaces.) > b) Do we want to support (long-term) encodings other than UTF-8? At least for a medium-term time-frame, I'd keep support for other encodings (not necessarily for the "String I/O" interface). > * If (a) is true, we should decide if it is through a dedicated > function, or through an overloaded signature (the current status) ... or through publish_parts() (see below). > You have previously argued for keeping the "core" interface as > small as possible, and I would strongly advocate against overloaded > return types, perhaps leading to us not supporting returning > ``bytes`` from the core publish API. > This may be a reasonable position, as if a user knows that he wants > bytes output, he should set the output encoding explicitly anyway, > and therefore he has control over the encoding from ``str`` to > ``bytes`` as he can e.g. do: > .. code:: python > encoding = 'latin1' > out_str = publish_string(source, > settings_overrides={'output_encoding': encoding} > ) > assert isinstance(out_str, str) > out_bytes = out_str.encode(encoding) > > in a hypothetical future where ``publish_string`` always returns > ``str`` instances. There are problems with this approach: The "settings_override" dictionay only overrides the "Docutils defaults" with "programmatic defaults". A different value in a configuration file would still override this programmatic default. Applications can disable configuration file parsing, but not for individual settings. To keep configurability, the application would need to parse configuration settings on its own and call publish_string() with a ``settings`` object: ``publish_string(source, settings=settings, …)``. An application developer does not need to be the end user (and hence may not know the desired output encoding), e.g., a 3rd party Docutils extension application may want to provide a file I/O interface but do some post-processing on the document returned from the writer. However, see the alternative "bytes-output recipe" below. > * If (b) is false, we could simplify the I/O code a great deal. I > think it may be reasonable to expect the user to be responsible > for encoding conversions, or to move Docutils' code to handle that > away from the core and into the command-line interface, for example. At least the "File I/O" interface (which is part of the core API) should IMO, support a configurable output encoding for the next couple of versions/years. The command line interface (`core.publish_cmdline()`) is part of the core API, too. Proposal ======== Keep it simple: * replace `publish_string()` with a new function `publish_str()` that returns a `str` instance and raises an error - for binary writer output (e.g. ODT writer) - if 'output_encoding' is not in ("utf-8", "") * Accordingly replace `io.StringOutput` with a new `io.StrOutput` class. * Implement `publish_str()` and `StrOutput` in Docutils 0.21 to give them proper testing and time for implementation details to settle while getting the bugfixes out now. * Think about the future of `publish_from_doctree()`. Rationale: * The different behaviour of the new string I/O interface merits a new function name. Application developers using the string I/O API will have to change their code anyway. Applications will at some stage break with the old function name, (hopefully tested by their developers), not only with certain configuration values (which may be easily overlooked by developers). * The confusing co-existence of `publish_str()` vs. `publish_string()` is temporary and moderated by the deprecation warning that comes with the use of `publish_string()`. Steps towards 0.20 * Revert the introduction of the "OutString" class. * Revert the addition of the "auto_encode" attribute. * Add ['errors'] to the `parts provided by all writers`__. __ https://docutils.sourceforge.io/docs/api/publisher.html#parts-provided-by-all-writers * Mark `core.publish_string()` and `io.StringOutput` as deprecated. (This includes deprecation of the special pseudo-encoding value "unicode".) * Document upcoming changes - There will be a new "string I/O interface" in 0.21. - The already working and future-proof way to get `str` output is :: out_str = publish_parts(...)['whole'] assert isinstance(out_str, str) # ODT writer returns `bytes` This approach ignores the "output_encoding" and "output_encoding_error_handler" settings. - The future-proof and configuration-proof way to get `bytes` output is :: parts = publish_parts(...) out = parts['whole'] if isinstance(out, str): out_bytes = out_str.encode(parts['encoding'], parts['errors']) Alternatively, the return value of `publish_file()` (with a "dummy" file object) can be used. Would this be a sensible way forward? Günter |