From: Guenter M. <mi...@us...> - 2025-07-21 17:48:54
|
Dear Viktor, welcome to the list and thanks for your feedback. On 2025-06-30, Viktor Ransmayr wrote: ... > I noted issues with certain IETF mailarchive URIs already a while ago - but > - only now took the time to follow up & create a simple test file (see > attachment) demonstrating the issue. > This file contains two IETF mailarchive URI instances. - The first one is > processed without an issue - and - the second one is processed with an > error. ... > test-IETF-URI-issue.rst:18: (ERROR/3) Unknown target name: "k4-l4mk7qa". > For me it is not clear, if the second mailarchive URI really does 'violate' > the reStructuredText Markup Specification - or - if it is a 'docutils' > issue. The parsing result conforms with the reStructuredText specification https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html. Why? ==== Let us simplify the example to the two paragraphs:: works: https://example.org/msg/TljW9V_sIzQJ1PpO4axkKmiWCZI/ fails: https://example.org/msg/k4-L4mK7Qa_-F3svmF6uFKKPZ6I/ Each paragraph is parsed for *inline markup*. According to the `recognition order`_ standalone hyperlinks are last to be recognised. I.e., before looking for an URI, the paragraph is checked for emphasis, literals, ..., hyperlink references, and interpreted text. The second URI contains inline markup consistent with the `hyperlink reference`_ "k4-l4mk7qa". Unfortunately, the `inline markup recognition rules`_ are rather complex.¹ * A hyperlink reference has no start-string and the end-string "_". * end-strings must end the text block or be followed by whitespace or punctuation (ASCII characters - : / ' " < ( [ { or similar non-ASCII characters). In the working example, the underscore is followed by a letter, so it is not recognised as a hyperlink reference end-string. In the failing example, the underscore is followed by "-", so it is interpreted as end-string of a simple hyperlink reference. The text from "/" to "_" forms the `simple reference name`_ "k4-l4mk7qa". (The "violation" of proper rST syntax is, that there is no matching target_.)² In the first URI, the "random" directory name ``TljW9V_sIzQJ1PpO4axkKmiWCZI/`` contains an underscore, too. However, it is not recognised as inline markup end-string, because it is followed by a letter. Workarounds =========== escape_ the underscore: https://example.org/msg/k4-L4mK7Qa\_-F3svmF6uFKKPZ6I/ Mark up as hyperlink reference with `embedded URI`_:: `<https://example.org/msg/k4-L4mK7Qa_-F3svmF6uFKKPZ6I/>`__ Wrapping in angle brackets helps for standalone hyperlinks with trailing punctuation like <https://example.org/msg.> but does not help with underscores. I hope this helps to understand and work around the problem, Günter ¹They were devised to allow 90% of non-markup uses of *, `, _, and | without escaping. ²After parsing, the first paragraph contains: * the text "fails: ", * a reference to "https://example.org/", * the nonfunctional reference to the internal target "k4-l4mk7qa", * and the text "-F3svmF6uFKKPZ6I/". .. _recognition order: https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#recognition-order .. _inline markup recognition rules: https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#inline-markup-recognition-rules .. _simple reference name: https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#simple-reference-names .. _target: https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#hyperlink-targets .. _escape: https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#escaping-mechanism .. _embedded URI: https://docutils.sourceforge.io/docs/ref/rst/restructuredtext.html#embedded-uris-and-aliases |