Re: [Docutils-develop] Release 0.20

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Dear Günter, all,

> I finished my work on the preparations towards Docutils 0.20.

> Please check and test.

I have tested on Ubuntu 22.04 LTS and Windows 10, and against Sphinx.
All tests are passing, though I would strongly reccomend to apply the
attached patch to avoid a false-positive warning during testing.

----

> Engelbert, can you prepare a release for next week?

Please may we release a 0.20b1 first so that I might ask downstream
projects to test, as with the 0.19 release? I am happy to help with
this, though I don't have the ability to upload to Docutils on PyPI
at the moment.

----

>>> I left the decision about the end state of this transition open...

>> A decision to make later, and one that doesn't block the 0.20
>> release!

> Yes and no: if we want to give users advise on a stable recipe to
> avoid beeing hit by the default-change, we would need agreement of
> what will (most likely) be kept stable.

> The API documentation "publisher.txt" now has the example

>     output = bytes(publish_string(...))

> (which depends on `OutputString` features).

Can we mark this feature as provisional? Personally, I don't think
that we should support this form of ``bytes`` conversion long-term,
and I see the ``OutputString`` as a transitional class, again not
one that will be around for a long time.

For me, the point of this exercise and deprecation process is to
reach an end-state where ``publish_string`` always returns ``str``.
Perhaps we should return to discussing ``publish_str`` and
``publish_bytes`` functions?

To summarise the problem as I understand it:

* Some output formats may contain information about the encoding of
  the document

  - SGML based markup languages (XML, HTML) may contain an internal
    encoding declaration.
  - TeX based languages (LaTeX, XeLaTeX, etc) may contain an internal
    encoding macro.

* All of these formats have default encodings

  - XML defaults to a UTF-8 encoding if the encoding attribute is not
    specified, since XML 1.0 (2008)
	
	https://www.w3.org/TR/xml/#charencoding
  - HTML 5 requires a UTF-8 charset 
    https://html.spec.whatwg.org/#charset
  - LaTeX's default encoding is UTF-8, since 2018
    https://tug.org/TUGboat/tb39-1/tb121ltnews28.pdf
  - XeTeX I believe has always defaulted to UTF-8.
  
* If a user asks for output as a Unicode ``str``, I believe it is
  reasonable to assume these defaults (UTF-8 encoding).

* If a user asks for output as a Unicode ``str``, but overrides the
  ``output_encoding`` setting, I believe it is reasonable to assume
  that the user is now responsible for conversion of the ``str`` to
  ``bytes`` for serialisation to disk, and we should not support an
  output format that does this by 'magic'. We could declare this as
  unsupported behaviour as an alternative, and just issue an error.

* If a user asks for binary output (a ``bytes`` instance), I think it
  is reasonable to use ``output_encoding`` to encode the ``str``
  instance we use internally to a ``bytes`` instance.

* We therefore need to decide the following end-state positions:

  a) Do we want to support (long-term) outputting ``bytes`` from
     the core publish API?
	 
  b) Do we want to support (long-term) encodings other than UTF-8?
  
* If (a) is true, we should decide if it is through a dedicated
  function, or through an overloaded signature (the current status).
  You have previously argued for keeping the "core" interface as
  small as possible, and I would strongly advocate against overloaded
  return types, perhaps leading to us not supporting returning
  ``bytes`` from the core publish API.
  
  This may be a reasonable position, as if a user knows that he wants
  bytes output, he should set the output encoding explicitly anyway,
  and therefore he has control over the encoding from ``str`` to
  ``bytes`` as he can e.g. do:
  
  .. code:: python
  
     encoding = 'latin1'
	 out_str = publish_string(source,
	     settings_overrides={'output_encoding': encoding}
	 )
	 assert isinstance(out_str, str)
	 out_bytes = out_str.encode(encoding)
	 
  In a hypothetical future where ``publish_string`` always returns
  ``str`` instances.

* If (b) is false, we could simplify the I/O code a great deal. I
  think it may be reasonable to expect the user to be responsible
  for encoding conversions, or to move Docutils' code to handle that
  away from the core and into the command-line interface, for example.
  
Sorry for the rather long message appended to a release thread, but
as you note, perhaps the decision cannot be delayed, as the
documentation contains a recipie that we may later regret declaring
support for.

----

Thanks,
Adam

----------

>From 5031c0ff9923057a5a12a80551b67992dcb2b4df Mon Sep 17 00:00:00 2001
From: Adam Turner <908...@us...>
Date: Sun, 23 Apr 2023 16:50:15 +0100
Subject: [PATCH] Ignore ``CSVTable.HeaderDialect`` deprecation warning

---
 docutils/docutils/parsers/rst/directives/tables.py | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docutils/docutils/parsers/rst/directives/tables.py b/docutils/docutils/parsers/rst/directives/tables.py
index 446034828..8212f2cfc 100644
--- a/docutils/docutils/parsers/rst/directives/tables.py
+++ b/docutils/docutils/parsers/rst/directives/tables.py
@@ -64,8 +64,11 @@ def process_header_option(self):
         table_head = []
         max_header_cols = 0
         if 'header' in self.options:   # separate table header in option
+            with warnings.catch_warnings():
+                warnings.simplefilter('ignore')
+                header_dialect = self.HeaderDialect()
             rows, max_header_cols = self.parse_csv_data_into_rows(
-                self.options['header'].split('\n'), self.HeaderDialect(),
+                self.options['header'].split('\n'), header_dialect,
                 source)
             table_head.extend(rows)
         return table_head, max_header_cols
-- 
2.40.0.windows.1