[Docutils-checkins] SF.net SVN: docutils:[9691] trunk/docutils

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Revision: 9691
          http://sourceforge.net/p/docutils/code/9691
Author:   milde
Date:     2024-05-07 11:24:22 +0000 (Tue, 07 May 2024)
Log Message:
-----------
Doctree validation: new functions to validate Element attribute values.

Attribute validate functions:

* convert string representations to correct data type,
* normalize values,
* raise ValueError for invalid attribute names or values.

The `nodes.Element.validate()` function reports a warning
for validity problems if `self.document.reporter` is available
and raises a ValueError if not.

Testing revealed problems with the "recommonmark_wrapper" parser:

* Validating should be done *after* the "clean up" operations.

* One test case uses an invalid class argument (underscore not allowed
  by Docutils). As this sample tests an "only Sphinx" feature,
  we just drop it from the Docutils test suite.

Modified Paths:
--------------
    trunk/docutils/HISTORY.txt
    trunk/docutils/docs/ref/doctree.txt
    trunk/docutils/docutils/nodes.py
    trunk/docutils/docutils/parsers/recommonmark_wrapper.py
    trunk/docutils/test/test_nodes.py
    trunk/docutils/test/test_parsers/test_recommonmark/test_literal_blocks.py

Modified: trunk/docutils/HISTORY.txt
===================================================================

--- trunk/docutils/HISTORY.txt	2024-05-06 12:41:07 UTC (rev 9690)
+++ trunk/docutils/HISTORY.txt	2024-05-07 11:24:22 UTC (rev 9691)
@@ -27,15 +27,24 @@
   - New `SubStructural` element category class.
   - Fix element categories.
   - New method `Element.validate()` (work in progress).
+  - New "attribute validating functions"
+    convert string representations to correct data type,
+    normalize values,
+    raise ValueError for invalid attribute names or values.
 
+* docutils/parsers/recommonmark_wrapper.py
+
+  - New method `Parser.finish_parse()` to clean up (before validating).
+
 * docutils/transforms/frontmatter.py
 
-  - Adapt `DocInfo` to fixed element categories.
+  - Update `DocInfo` to work with corrected element categories.
 
 * docutils/writers/manpage.py
 
   - Remove code for unused emdash bullets.
 
+
 Release 0.21.2 (2024-04-23)
 ===========================
 

Modified: trunk/docutils/docs/ref/doctree.txt
===================================================================
--- trunk/docutils/docs/ref/doctree.txt	2024-05-06 12:41:07 UTC (rev 9690)
+++ trunk/docutils/docs/ref/doctree.txt	2024-05-07 11:24:22 UTC (rev 9691)
@@ -2587,10 +2587,10 @@
 
 :Parents:    Only the `\<document>`_ element contains <meta>.
 :Children:   The <meta> element has no content.
-:Attributes: The <meta> element contains the attributes *name*,
-             *content*, *http-equiv*, *lang*, *dir*, *media*, and
-             *scheme* that correspond to the respective attributes
-             of the `HTML <meta> element`_.
+:Attributes: The <meta> element contains the attributes
+             *content*, *dir*, *http-equiv*, *lang*, *media*, *name*, and
+             *scheme* that correspond to the respective attributes of the
+             `HTML <meta> element`_.
 
 See also the `\<docinfo>`_ element for displayed meta-data.
 The document's `title attribute`_ stores the metadata document title.
@@ -4630,7 +4630,7 @@
 elements but typically only used on the `root element`_.
 
 .. note:: All ``docutils.nodes.Node`` instances also support an
-   **internal** ``source`` attribute that is used when reporting
+   *internal* ``source`` attribute that is used when reporting
    processing problems.
 
 

Modified: trunk/docutils/docutils/nodes.py
===================================================================
--- trunk/docutils/docutils/nodes.py	2024-05-06 12:41:07 UTC (rev 9690)
+++ trunk/docutils/docutils/nodes.py	2024-05-07 11:24:22 UTC (rev 9691)
@@ -567,6 +567,8 @@
             if value is None:           # boolean attribute
                 parts.append('%s="True"' % name)
                 continue
+            if isinstance(value, bool):
+                value = str(int(value))
             if isinstance(value, list):
                 values = [serial_escape('%s' % (v,)) for v in value]
                 value = ' '.join(values)
@@ -1093,22 +1095,48 @@
         return attr not in cls.common_attributes
 
     def validate_attributes(self):
-        # check for undeclared attributes
-        # TODO: check attribute values
+        """Normalize and validate element attributes.
+
+        Convert string values to expected datatype.
+        Normalize values.
+
+        Raise `ValueError` for invalid attributes or attribute values.
+
+        Provisional.
+        """
+        messages = []
         for key, value in self.attributes.items():
             if key.startswith('internal:'):
                 continue  # see docs/user/config.html#expose-internals
             if key not in self.valid_attributes:
-                raise ValueError(
-                    f'Element <{self.tagname}> has invalid attribute "{key}".')
+                va = ' '.join(self.valid_attributes)
+                messages.append(f'Attribute "{key}" not one of "{va}".')
+                continue
+            try:
+                self.attributes[key] = ATTRIBUTE_VALIDATORS[key](value)
+            except (ValueError, TypeError, KeyError) as e:
+                messages.append(
+                    f'Attribute "{key}" has invalid value "{value}".\n'
+                    + e.args[0])  # message argument
+        if messages:
+            raise ValueError('\n'.join(messages))
 
     def validate(self):
-        # print(f'validating', self.tagname)
-        self.validate_attributes()
+        messages = []
+        try:
+            self.validate_attributes()
+        except ValueError as e:
+            messages.append(e.args[0])  # the message argument
         # TODO: check number of children
         for child in self.children:
             # TODO: check whether child has allowed type
             child.validate()
+        if messages:
+            msg = f'Element <{self.tagname}> invalid:\n' + '\n'.join(messages)
+            try:
+                self.document.reporter.warning(msg)
+            except AttributeError:
+                raise ValueError(msg)
 
 
 # ========
@@ -2443,6 +2471,229 @@
     return value.replace('\\', r'\\').replace(' ', r'\ ')
 
 
+def split_name_list(s):
+    r"""Split a string at non-escaped whitespace.
+
+    Backslashes escape internal whitespace (cf. `serial_escape()`).
+    Return list of "names" (after removing escaping backslashes).
+
+    >>> split_name_list(r'a\ n\ame two\\ n\\ames'),
+    ['a name', 'two\\', r'n\ames']
+
+    Provisional.
+    """
+    s = s.replace('\\', '\x00')         # escape with NULL char
+    s = s.replace('\x00\x00', '\\')     # unescape backslashes
+    s = s.replace('\x00 ', '\x00\x00')  # escaped spaces -> NULL NULL
+    names = s.split(' ')
+    # restore internal spaces, drop other escaping characters
+    return [name.replace('\x00\x00', ' ').replace('\x00', '')
+            for name in names]
+
+
 def pseudo_quoteattr(value):
     """Quote attributes for pseudo-xml"""
     return '"%s"' % value
+
+
+# Methods to validate `Element attribute`__ values.
+
+# Ensure the expected Python `data type`__, normalize, and check for
+# restrictions.
+#
+# The methods can be used to convert `str` values (eg. from an XML
+# representation) or to validate an existing document tree or node.
+#
+# Cf. `Element.validate_attributes()`, `docutils.parsers.docutils_xml`,
+# and the `attribute_validating_functions` mapping below.
+#
+# __ https://docutils.sourceforge.io/docs/ref/doctree.html#attribute-reference
+# __ https://docutils.sourceforge.io/docs/ref/doctree.html#attribute-types
+
+def validate_enumerated_type(*keywords):
+    """
+    Return a function that validates a `str` against given `keywords`.
+
+    Provisional.
+    """
+    def validate_keywords(value):
+        if value not in keywords:
+            allowed = '", \"'.join(keywords)
+            raise ValueError(f'"{value}" is not one of "{allowed}".')
+        return value
+    return validate_keywords
+
+
+def validate_identifier(value):
+    """
+    Validate identifier key or class name.
+
+    Used in `idref.type`__ and for the tokens in `validate_identifier_list()`.
+
+    __ https://docutils.sourceforge.io/docs/ref/doctree.html#idref-type
+
+    Provisional.
+    """
+    if value != make_id(value):
+        raise ValueError(f'"{value}" is no valid id or class name.')
+    return value
+
+
+def validate_identifier_list(value):
+    """
+    A (space-separated) list of ids or class names.
+
+    `value` may be a `list` or a `str` with space separated
+    ids or class names (cf. `validate_identifier()`).
+
+    Used in `classnames.type`__, `ids.type`__, and `idrefs.type`__.
+
+    __ https://docutils.sourceforge.io/docs/ref/doctree.html#classnames-type
+    __ https://docutils.sourceforge.io/docs/ref/doctree.html#ids-type
+    __ https://docutils.sourceforge.io/docs/ref/doctree.html#idrefs-type
+
+    Provisional.
+    """
+    if isinstance(value, str):
+        value = value.split()
+    for token in value:
+        validate_identifier(token)
+    return value
+
+
+def validate_measure(value):
+    """
+    Validate a length measure__ (number + recognized unit).
+
+    __ https://docutils.sourceforge.io/docs/ref/doctree.html#measure
+
+    Provisional.
+    """
+    units = 'em|ex|px|in|cm|mm|pt|pc|%'
+    if not re.fullmatch(f'[-0-9.]+ *({units}?)', value):
+        raise ValueError(f'"{value}" is no valid measure. '
+                         f'Valid units: {units.replace("|", " ")}.')
+    return value.replace(' ', '').strip()
+
+
+def validate_NMTOKEN(value):
+    """
+    Validate a "name token": a `str` of letters, digits, and [-._].
+
+    Provisional.
+    """
+    if not re.fullmatch('[-._A-Za-z0-9]+', value):
+        raise ValueError(f'"{value}" is no NMTOKEN.')
+    return value
+
+
+def validate_NMTOKENS(value):
+    """
+    Validate a list of "name tokens".
+
+    Provisional.
+    """
+    if isinstance(value, str):
+        value = value.split()
+    for token in value:
+        validate_NMTOKEN(token)
+    return value
+
+
+def validate_refname_list(value):
+    """
+    Validate a list of `reference names`__.
+
+    Reference names may contain all characters;
+    whitespace is normalized (cf, `whitespace_normalize_name()`).
+
+    `value` may be either a `list` of names or a `str` with
+    space separated names (with internal spaces backslash escaped
+    and literal backslashes doubled cf. `serial_escape()`).
+
+    Return a list of whitespace-normalized, unescaped reference names.
+
+    Provisional.
+
+    __ https://docutils.sourceforge.io/docs/ref/doctree.html#reference-name
+    """
+    if isinstance(value, str):
+        value = split_name_list(value)
+    return [whitespace_normalize_name(name) for name in value]
+
+
+def validate_yesorno(value):
+    if value == "0":
+        return False
+    return bool(value)
+
+
+ATTRIBUTE_VALIDATORS = {
+    'alt': str,  # CDATA
+    'align': str,
+    'anonymous': validate_yesorno,
+    'auto': str,  # CDATA (only '1' or '*' are used in rST)
+    'backrefs': validate_identifier_list,
+    'bullet': str,  # CDATA (only '-', '+', or '*' are used in rST)
+    'classes': validate_identifier_list,
+    'char': str,  # from Exchange Table Model (CALS), currently ignored
+    'charoff': validate_NMTOKEN,  # from CALS, currently ignored
+    'colname': validate_NMTOKEN,  # from CALS, currently ignored
+    'colnum': int,  # from CALS, currently ignored
+    'cols': int,  # from CALS: "NMTOKEN, […] must be an integer > 0".
+    'colsep': validate_yesorno,
+    'colwidth': int,  # sic! CALS: CDATA (measure or number+'*')
+    'content': str,  # <meta>
+    'delimiter': str,
+    'depth': int,
+    'dir': validate_enumerated_type('ltr', 'rtl', 'auto'),  # <meta>
+    'dupnames': validate_refname_list,
+    'enumtype': validate_enumerated_type('arabic', 'loweralpha', 'lowerroman',
+                                         'upperalpha', 'upperroman'),
+    'format': str,  # CDATA (space separated format names)
+    'frame': validate_enumerated_type('top', 'bottom', 'topbot', 'all',
+                                      'sides', 'none'),  # from CALS, ignored
+    'height': validate_measure,
+    'http-equiv': str,  # <meta>
+    'ids': validate_identifier_list,
+    'lang': str,  # <meta>
+    'level': int,
+    'line': int,
+    'local': validate_yesorno,
+    'ltrim': validate_yesorno,
+    'loading': validate_enumerated_type('embed', 'link', 'lazy'),
+    'media': str,  # <meta>
+    'morecols': int,
+    'morerows': int,
+    'name': whitespace_normalize_name,  # in <reference> (deprecated)
+    # 'name': node_attributes.validate_NMTOKEN,  # in <meta>
+    'names': validate_refname_list,
+    'namest': validate_NMTOKEN,  # start of span, from CALS, currently ignored
+    'nameend': validate_NMTOKEN,  # end of span, from CALS, currently ignored
+    'pgwide': validate_yesorno,  # from CALS, currently ignored
+    'prefix': str,
+    'refid': validate_identifier,
+    'refname': whitespace_normalize_name,
+    'refuri': str,
+    'rowsep': validate_yesorno,
+    'rtrim': validate_yesorno,
+    'scale': int,
+    'scheme': str,
+    'source': str,
+    'start': int,
+    'stub': validate_yesorno,
+    'suffix': str,
+    'title': str,
+    'type': validate_NMTOKEN,
+    'uri': str,
+    'valign': validate_enumerated_type('top', 'middle', 'bottom'),  # from CALS
+    'width': validate_measure,
+    'xml:space': validate_enumerated_type('default', 'preserve'),
+    }
+"""
+Mapping of `attribute names`__ to validating functions.
+
+Provisional.
+
+__ https://docutils.sourceforge.io/docs/ref/doctree.html#attribute-reference
+"""

Modified: trunk/docutils/docutils/parsers/recommonmark_wrapper.py
===================================================================
--- trunk/docutils/docutils/parsers/recommonmark_wrapper.py	2024-05-06 12:41:07 UTC (rev 9690)
+++ trunk/docutils/docutils/parsers/recommonmark_wrapper.py	2024-05-07 11:24:22 UTC (rev 9691)
@@ -75,7 +75,9 @@
         return Component.get_transforms(self)  # + [AutoStructify]
 
     def parse(self, inputstring, document):
-        """Use the upstream parser and clean up afterwards.
+        """Wrapper of upstream method.
+
+        Ensure "line-length-limt". Report errors with `document.reporter`.
         """
         # check for exorbitantly long lines
         for i, line in enumerate(inputstring.split('\n')):
@@ -95,9 +97,14 @@
                                             'returned the error:\n%s'%err)
             document.append(error)
 
-        # Post-Processing
-        # ---------------
+    # Post-Processing
+    # ---------------
 
+    def finish_parse(self):
+        """Finalize parse details.  Call at end of `self.parse()`."""
+
+        document = self.document
+
         # merge adjoining Text nodes:
         for node in document.findall(nodes.TextElement):
             children = node.children
@@ -142,6 +149,8 @@
                 reference['name'] = nodes.fully_normalize_name(
                                                     reference.astext())
             node.parent.replace(node, reference)
+        # now we are ready to call the upstream function:
+        super().finish_parse()
 
     def visit_document(self, node):
         """Dummy function to prevent spurious warnings.

Modified: trunk/docutils/test/test_nodes.py
===================================================================
--- trunk/docutils/test/test_nodes.py	2024-05-06 12:41:07 UTC (rev 9690)
+++ trunk/docutils/test/test_nodes.py	2024-05-07 11:24:22 UTC (rev 9691)
@@ -474,14 +474,33 @@
         node.append(nodes.emphasis('', 'emphasised text', ids='emphtext'))
         node.validate()
 
+    def test_validate_attributes(self):
+        # Convert to expected data-type, normalize values,
+        # cf. AttributeTypeTests below for attribute validating function tests.
+        node = nodes.image(classes='my  test-classes',
+                           names='My teST\n\\ \xA0classes',
+                           width='30 mm')
+        node.validate_attributes()
+        self.assertEqual(node['classes'], ['my', 'test-classes'])
+        self.assertEqual(node['names'], ['My', 'teST classes'])
+        self.assertEqual(node['width'], '30mm')
+
     def test_validate_wrong_attribute(self):
         node = nodes.paragraph('', 'text', id='test-paragraph')
         with self.assertRaisesRegex(ValueError,
-                                     'Element <paragraph> '
-                                     'has invalid attribute "id".'):
+                                    'Element <paragraph> invalid:\n'
+                                    'Attribute "id" not one of "ids '):
             node.validate()
 
+    def test_validate_wrong_attribute_value(self):
+        node = nodes.image(uri='test.png', width='20 inch')  # invalid unit
+        with self.assertRaisesRegex(ValueError,
+                                    'Element <image> invalid:\n'
+                                    '.*"width" has invalid value "20 inch".\n'
+                                    '.*Valid units: em ex '):
+            node.validate()
 
+
 class MiscTests(unittest.TestCase):
 
     def test_node_class_names(self):
@@ -807,6 +826,102 @@
             result = nodes.fully_normalize_name(sample)
             self.assertEqual(result, fully)
 
+    def test_split_name_list(self):
+        self.assertEqual(nodes.split_name_list(r'a\ n\ame two\\ n\\ames'),
+                         ['a name', 'two\\', r'n\ames'])
 
+
+class AttributeTypeTests(unittest.TestCase):
+
+    def test_validate_enumerated_type(self):
+        # function factory for "choice validators"
+        food = nodes.validate_enumerated_type('ham', 'spam')
+        self.assertEqual(food('ham'), 'ham')
+        with self.assertRaisesRegex(ValueError,
+                                    '"bacon" is not one of "ham", "spam".'):
+            food('bacon')
+
+    def test_validate_identifier(self):
+        # Identifiers must start with an ASCII letter and may contain
+        # letters, digits and the hyphen
+        # https://docutils.sourceforge.io/docs/ref/doctree.html#idref-type
+        self.assertEqual(nodes.validate_identifier('mo-8b'), 'mo-8b')
+        with self.assertRaisesRegex(ValueError, '"8b-mo" is no valid id'):
+            nodes.validate_identifier('8b-mo')
+
+    def test_validate_identifier_list(self):
+        # list of identifiers (cf. above)
+        # or a `str` of space-separated identifiers.
+        l1 = ['m8-b', 'm8-c']
+        s1 = 'm8-b m8-c'
+        self.assertEqual(nodes.validate_identifier_list(l1), l1)
+        self.assertEqual(nodes.validate_identifier_list(s1), l1)
+        l2 = ['m8-b', 'm8_c']
+        s2 = 'm8-b #8c'
+        with self.assertRaises(ValueError):
+            nodes.validate_identifier_list(l2)
+        with self.assertRaises(ValueError):
+            nodes.validate_identifier_list(s2)
+
+    def test_validate_measure(self):
+        # number (may be decimal fraction) + optional CSS2 length unit
+        self.assertEqual(nodes.validate_measure('8ex'), '8ex')
+        self.assertEqual(nodes.validate_measure('3.5 %'), '3.5%')
+        self.assertEqual(nodes.validate_measure('2'), '2')
+        with self.assertRaisesRegex(ValueError, '"2km" is no valid measure. '
+                                    'Valid units: em ex '):
+            nodes.validate_measure('2km')
+        # negative numbers are currently not supported
+        # TODO: allow? the spec doesnot mention negative numbers.
+        # but a negative width or height of an image is odd.
+        # nodes.validate_measure('-2')
+
+    def test_validate_NMTOKEN(self):
+        # str with ASCII-letters, digits, hyphen, underscore, and full-stop.
+        self.assertEqual(nodes.validate_NMTOKEN('-8x_.'), '-8x_.')
+        with self.assertRaises(ValueError):
+            nodes.validate_NMTOKEN('why me')
+
+    def test_validate_NMTOKENS(self):
+        # list of NMTOKENS or string with space-separated NMTOKENS
+        l1 = ['8_b', '8.c']
+        s1 = '8_b 8.c'
+        l2 = ['8_b', '8/c']
+        s2 = '8_b #8'
+        self.assertEqual(nodes.validate_NMTOKENS(l1), l1)
+        self.assertEqual(nodes.validate_NMTOKENS(s1), l1)
+        with self.assertRaises(ValueError):
+            nodes.validate_NMTOKENS(l2)
+        with self.assertRaises(ValueError):
+            nodes.validate_NMTOKENS(s2)
+
+    def test_validate_refname_list(self):
+        # list or string of "reference names".
+        l1 = ['*:@', r'"more"\ & \x!']
+        s1 = r'*:@ \"more"\\\ &\ \\x!'  # unescaped backslash is ignored
+        self.assertEqual(nodes.validate_refname_list(l1), l1)
+        self.assertEqual(nodes.validate_refname_list(s1), l1)
+        # whitspace is normalized, case is not normalized
+        l2 = ['LARGE', 'a\t \tc']
+        s2 = r'LARGE a\ \ \c'
+        normalized = ['LARGE', 'a c']
+
+        self.assertEqual(nodes.validate_refname_list(l2), normalized)
+        self.assertEqual(nodes.validate_refname_list(s2), normalized)
+
+    def test_validate_yesorno(self):
+        # False if '0', else bool
+        # TODO: The docs say '0' is false:
+        # * Also return `True` for values that evaluate to `False`?
+        #   Even for `False` and `None`?
+        # * Also return `False` for 'false', 'off', 'no'
+        #   like boolean config settings?
+        self.assertFalse(nodes.validate_yesorno('0'))
+        self.assertFalse(nodes.validate_yesorno(0))
+        self.assertTrue(nodes.validate_yesorno('*'))
+        self.assertTrue(nodes.validate_yesorno(1))
+        # self.assertFalse(nodes.validate_yesorno('no'))
+
+
 if __name__ == '__main__':
     unittest.main()

Modified: trunk/docutils/test/test_parsers/test_recommonmark/test_literal_blocks.py
===================================================================
--- trunk/docutils/test/test_parsers/test_recommonmark/test_literal_blocks.py	2024-05-06 12:41:07 UTC (rev 9690)
+++ trunk/docutils/test/test_parsers/test_recommonmark/test_literal_blocks.py	2024-05-07 11:24:22 UTC (rev 9691)
@@ -204,20 +204,6 @@
         A literal block (fenced code block)
         with *info string*.
 """],
-["""\
-~~~eval_rst
-Evaluating embedded rST blocks requires the AutoStructify component
-in recommonmark. Otherwise this is just a code block
-with class ``eval_rst``.
-~~~
-""",
-"""\
-<document source="test data">
-    <literal_block classes="code eval_rst" xml:space="preserve">
-        Evaluating embedded rST blocks requires the AutoStructify component
-        in recommonmark. Otherwise this is just a code block
-        with class ``eval_rst``.
-"""],
 ]
 
 

This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.