Menu

#493 Test failure on Windows with embedded images

open-fixed
nobody
None
5
2025-04-14
2024-08-07
Adam Turner
No

xref [r9785], [r9853], [r9855]

Dear @milde,

Thank you for the fix to my recent patch. It seems neither my patch nor the fix addressed the root cause of the test failures, as tests have resumed failing on Windows.

I believe the following demonstrates the problem:

>>> import sys; print(sys.platform)
win32
>>> import urllib.parse, urllib.request
>>> urllib.request.url2pathname('test/data/circle-broken.svg')
'test\\data\\circle-broken.svg'
>>> urllib.parse.unquote('test/data/circle-broken.svg')
'test/data/circle-broken.svg'

Currently, we use imagepath = urllib.request.url2pathname(uri_parts.path), which converts path separators to their platform-native format. On UNIX, url2pathname simply calls unquote, but on Windows it handles UNC paths (\\host\path\) and escaped drive letters (///C|/users/).

I don't know what led to using url2pathname(), as it is quite specialised (the docstring notes "not recommended for general use"). Is it possible to use the simpler unquote() here?

For local file paths (e.g. without a file:/// scheme), should we even be using URI parsing? Perhaps we should use proper path handling if there is no URI scheme (i.e. the user has provided a file-path).

A

Related

Commit: [r9785]
Commit: [r9853]
Commit: [r9855]

Discussion

  • Günter Milde

    Günter Milde - 2024-08-07

    Unfortunately, I don't know what the "test failure on Windows" actually is, could you provide the relevant test output, please?

    Is it possible to use the simpler unquote() here?

    If it works on Windows, using unquote() is a good idea. Could you test with

    diff --git a/docutils/docutils/writers/_html_base.py b/docutils/docutils/writers/_html_base.py
    index ead930e31..2eab36510 100644
    --- a/docutils/docutils/writers/_html_base.py
    +++ b/docutils/docutils/writers/_html_base.py
    @@ -626,7 +626,7 @@ def uri2imagepath(self, uri):
             uri_parts = urllib.parse.urlparse(uri)
             if uri_parts.scheme not in ('', 'file'):
                 raise ValueError('Can only read local images.')
    -        imagepath = urllib.request.url2pathname(uri_parts.path)
    +        imagepath = urllib.parse.unquote(uri_parts.path)
             if imagepath.startswith('/'):  # cf. config.html#root-prefix
                 root_prefix = Path(self.settings.root_prefix)
                 imagepath = (root_prefix/imagepath[1:]).as_posix()
    

    and adapt the test script if required?

    Perhaps we should use proper path handling if there is no URI scheme (i.e. the user has provided a file-path).

    Per definitionem, there is no way to provide a system file-path as image ressource, both rST and the Docutils Doctree explicitely expect an URI.

    As URIs are capable of representing local file paths, I dont see the advantage.
    Interpreting an image URI with undefined scheme as system path would be a complication of both, specification and implementation and a backwards-incompatible change. We would als have to think about images that are only read to get the size but kept external in HTML output.

     
    • Adam  Turner

      Adam Turner - 2024-08-07

      could you provide the relevant test output, please?

      See https://github.com/AA-Turner/docutils/actions/runs/10284219635/job/28459783099 for an online view (I advise collapsing the "alltests" tab and scrolling up to the pytest one, which has better formatted output). The full failure is very verbose, but it is because if imagepath.startswith('/') fails.

      Using unquote() does make the current tests pass on Windows, my question was more about what is the range of URIs that we deem valid? If we use unquote(), then e.g. file:///C|users/john/mallard.png will be improperly translated:

      >>> from urllib.parse import urlparse, unquote
      >>> from urllib.request import url2pathname
      >>> print(url2pathname(urlparse('file:///C|users/john/mallard.png').path))
      C:\users\john\mallard.png
      >>> print(unquote(urlparse('file:///C|users/john/mallard.png').path))
      /C|users/john/mallard.png
      

      Now it seems that (at least from wikipedia) this scheme of using | is not standardised/used, and might just be a Python quirk -- in which case I think just using unquote() is easier. Of note, pathlib does not use the | replacement mechanism.

      Per definitionem, there is no way to provide a system file-path as image ressource, both rST and the Docutils Doctree explicitely expect an URI.

      This seems to be a problem of imprecise language. picture.png as used in the example is not a URI, as it has no scheme. Nor is /data/blue square.png or etc. Indeed, none of the .. image:: directives in docs/ have a scheme, and hence aren't URIs. The actual check that we do is in Image.run(), reference = directives.uri(self.arguments[0]), which just removes unescaped whitespace.

      Interpreting an image URI with undefined scheme as system path

      This goes directly to my point, as the scheme is what makes a URI! (well, the scheme and the path).

      I think that there is a strong argument based on historical practice in favour of documenting that we accept either a URI-with-scheme (file: or https: etc) or a local file path. However, I might be missing something, and I agree we should avoid backwards-incompatible changes. My main motivation is simply to fix the tests, and it seems that using unquote() will work, unless you forsee any other problems.

      A

       
  • Günter Milde

    Günter Milde - 2024-08-07

    I am in favour of using unquote().

    Now it seems that (at least from wikipedia) this scheme of using | is not standardised/used, and might just be a Python quirk

    The "|" seems to be not Python specific, however, it is obsolete. Wiki says:

    On Microsoft Windows systems, the normal colon (:) after a device letter has sometimes been replaced by a vertical bar (|) in file URLs. This reflected the original URL syntax, which made the colon a reserved character in a path part.
    Since Internet Explorer 4, file URIs have been standardized on Windows, and should follow the following scheme.

    RFC 8089 says:

    Historically, some usages of file URIs have included a vertical line
    character "|" instead of a colon ":" in the drive letter construct.
    [RFC3986] forbids the use of the vertical line; however, it may be
    necessary to interpret or update old URIs.

    URI vs. Path

    This seems to be a problem of imprecise language. picture.png as used in the example is not a URI, as it has no scheme.
    ...
    we accept either a URI-with-scheme (file: or https: etc) or a local file path.

    Actually not: by default, the "uri" attribute is written as-is to the HTML <img> element's "src" attribute which expects an URL and may fail for a number of valid system paths.

    It seems what Docutils expects is a URI-Reference¹:

    A URI-reference is either a URI or a relative reference.
    If the URI-reference's prefix does not match the syntax of a scheme followed
    by its colon separator, then the URI-reference is a relative reference.

    In contrast to a system file path, a relative reference still requires forward slashes and quoting of some characters.

    ¹ NOTE: Previous specifications used the terms "partial URI" and
    "relative URI" to denote a relative reference to a URI. [RFC 3989]

     

    Last edit: Günter Milde 2024-08-11
  • Günter Milde

    Günter Milde - 2024-08-07

    Fixed in [r9866]. Documentation updated in [r9865].
    Thank you for report and analysis.

     

    Related

    Commit: [r9865]
    Commit: [r9866]

  • Günter Milde

    Günter Milde - 2024-08-08
    • status: open --> open-fixed
     
  • Adam  Turner

    Adam Turner - 2024-08-08
    • status: open-fixed --> open
     
  • Adam  Turner

    Adam Turner - 2024-08-08

    Sorry, another failure.

    The root cause of this one seems to be the uri_parts.scheme not in ('', 'file') check. On Windows, we get the following, because drives in absolute paths are parsed as URI schemes:

    >>> uri
    'S:/Development/docutils/docutils/test/data/circle-broken.svg'
    >>> urllib.parse.urlparse(uri).scheme
    's'
    

    We could do something like Path(uri).is_file(), but that has the problems with using pathlib that you mentioned earlier...

    Full failure below, also on https://github.com/AA-Turner/docutils/actions/runs/10308701375/job/28536730749

    ============================================================================================================= FAILURES ==============================================================================================================
    _________________________________________________________________________ Html5WriterPublishPartsTestCase.test_publish (id="totest['system_messages'][1]") __________________________________________________________________________
    
    self = <test.test_writers.test_html5_polyglot_parts.Html5WriterPublishPartsTestCase testMethod=test_publish>
    
        def test_publish(self):
            if not with_pygments:
                del totest['syntax_highlight']
            writer = 'html5'
            for name, (settings_overrides, cases) in totest.items():
                for casenum, (case_input, case_expected) in enumerate(cases):
                    with self.subTest(id=f'totest[{name!r}][{casenum}]'):
                        parts = docutils.core.publish_parts(
                            source=case_input,
                            writer=writer,
                            settings_overrides={
                                '_disable_config': True,
                                'strict_visitor': True,
                                'stylesheet': '',
                                'section_self_link': True,
                                **settings_overrides,
                            }
                        )
    >                   self.assertEqual(case_expected, self.format_output(parts))
    E                   AssertionError: {'fragment': '<svg xmlns="http://www.w3.org/2000/svg"\n [410 chars]>\n'} != {'fragment': '<img alt="S:/Development/docutils/docutils[396 chars]>\n'}
    E                   + {'fragment': '<img '
    E                   +              'alt="S:/Development/docutils/docutils/test/data/circle-broken.svg" '
    E                   +              'src="S:/Development/docutils/docutils/test/data/circle-broken.svg" '
    E                   - {'fragment': '<svg xmlns="http://www.w3.org/2000/svg"\n'
    E                   -              '     viewBox="0 0 10 10">\n'
    E                   -              '  <circle cx="5" cy="5" r="4" fill="lightblue" x/>\n'
    E                   -              '</svg>\n'
    E                   -              '\n'
    E                   +              '/>\n'
    E                   ?               ++
    E                   
    E                                  '<aside class="system-message">\n'
    E                                  '<p class="system-message-title">System Message: ERROR/3 (<span '
    E                                  'class="docutils literal">&lt;string&gt;</span>, line 1)</p>\n'
    E                   -              '<p>Cannot parse SVG image '
    E                   ?                         ---- ^^^^
    E                   
    E                   +              '<p>Cannot embed image '
    E                   ?                          ^^^^
    E                   
    E                                  '&quot;S:/Development/docutils/docutils/test/data/circle-broken.svg&quot;:\n'
    E                   -              '  not well-formed (invalid token): line 3, column 48</p>\n'
    E                   +              '  Can only read local images.</p>\n'
    E                                  '</aside>\n'}
    
    docutils\test\test_writers\test_html5_polyglot_parts.py:92: AssertionError
    ________________________________________________________________________ Html5WriterPublishPartsTestCase.test_publish (id="totest['no_system_messages'][1]") ________________________________________________________________________
    
    self = <test.test_writers.test_html5_polyglot_parts.Html5WriterPublishPartsTestCase testMethod=test_publish>
    
        def test_publish(self):
            if not with_pygments:
                del totest['syntax_highlight']
            writer = 'html5'
            for name, (settings_overrides, cases) in totest.items():
                for casenum, (case_input, case_expected) in enumerate(cases):
                    with self.subTest(id=f'totest[{name!r}][{casenum}]'):
                        parts = docutils.core.publish_parts(
                            source=case_input,
                            writer=writer,
                            settings_overrides={
                                '_disable_config': True,
                                'strict_visitor': True,
                                'stylesheet': '',
                                'section_self_link': True,
                                **settings_overrides,
                            }
                        )
    >                   self.assertEqual(case_expected, self.format_output(parts))
    E                   AssertionError: {'fragment': '<svg xmlns="http://www.w3.org/2000/svg"\n [85 chars]n\n'} != {'fragment': '<img alt="S:/Development/docutils/docutils[98 chars]>\n'}
    E                   + {'fragment': '<img '
    E                   +              'alt="S:/Development/docutils/docutils/test/data/circle-broken.svg" '
    E                   +              'src="S:/Development/docutils/docutils/test/data/circle-broken.svg" '
    E                   - {'fragment': '<svg xmlns="http://www.w3.org/2000/svg"\n'
    E                   -              '     viewBox="0 0 10 10">\n'
    E                   -              '  <circle cx="5" cy="5" r="4" fill="lightblue" x/>\n'
    E                   -              '</svg>\n'
    E                   -              '\n'}
    E                   +              '/>\n'}
    E                   ?               ++
    
    docutils\test\test_writers\test_html5_polyglot_parts.py:92: AssertionError
    

    A

     
  • Günter Milde

    Günter Milde - 2024-08-09

    The parsing is correct: The image argument is a "URI reference":

    A URI-reference is either a URI or a relative reference. If the
    URI-reference's prefix does not match the syntax of a scheme followed
    by its colon separator, then the URI-reference is a relative
    reference. (RFC 3986)

    Hence, the example
    ~~~
    .. image:: S:/Development/data/circle-broken.svg
    ~~~
    describes an image accessible under scheme "S".

    Our problem is, that we cannot easily use relative paths (to keep flexibility regarding where to run the tests from).

    A simple solution would be using a "file" scheme in the offending examples.
    If [r9880] fixes the problem, we can mark this bug as fixed.

     

    Related

    Commit: [r9880]


    Last edit: Günter Milde 2024-08-11
  • Adam  Turner

    Adam Turner - 2024-08-09

    The parsing is correct: The image argument is a "URI reference" ...

    The issue here now becomes why the tests were not failing on Linux, I believe. Both Windows and Linux would have had an absolute path (e.g. .. image:: /home/user/development/docutils/docutils/test/data/circle-broken.svg). On Windows, this failed due to misinterpreting the drive letter as a scheme. However, because the default root_prefix is /, when an absolute path is used on Unix, the scheme is '', imagepath.startswith('/') is True, and root_prefix/imagepath[1:] just strips and restores the leading slash.

    I believe that if we are only accepting URI-refereneces, then we should reject absolute paths on UNIX, and the tests should fail.

    --image-loading was added in Docutils 0.18 and :loading: was added in 0.21. Local testing shows that on Linux, absolute paths are allowed with --image-loading embed or :loading: embed in every released version and the current master branch. On Windows, 0.18-0.20 allowed absolute paths, but since 0.21 an error has been raised.

    The following table demonstrates the matrix. image_loading means using --image-loading embed and :loading: means using :loading: embed. My test commands were:

    $ printf ".. image:: /mnt/s/Development/docutils/docutils/test/data/circle.svg" | rst2html5 --link-stylesheet --image-loading embed
    $ printf ".. image:: /mnt/s/Development/docutils/docutils/test/data/circle.svg\n   :loading: embed" | rst2html5 --link-stylesheet
    PS> echo ".. image:: S:/Development/docutils/docutils/test/data/circle.svg" | python rst2html5.py --link-stylesheet --image-loading embed
    PS> echo ".. image:: S:/Development/docutils/docutils/test/data/circle.svg`n   :loading: embed" | python rst2html5.py --link-stylesheet
    

    Absolute path support in the image directive with embed mode

    Linux Linux Windows Windows
    image_loading :loading: image_loading :loading:
    0.18 base64 N/A base64 N/A
    0.19 base64 N/A base64 N/A
    0.20 base64 N/A base64 N/A
    0.21 direct direct error error
    0.22 direct direct error error

    I think we have two options. (1) support absolute paths on every platform, document as such. (2) make absolute paths return an error on Linux. The first restores backwards compatibility, the second is consistent with what we document.

    [r9880]

    A nit-pick here -- this commit leads to URIs like file://s:/... on Windows. From Wikipedia's page on file URIs "file://path (i.e. two slashes, without a hostname) is never correct, but is often used.".

    I think we should use Path.as_uri() instead of .as_posix().

    A

     

    Related

    Commit: [r9880]

    • Günter Milde

      Günter Milde - 2024-08-11

      I think we should use Path.as_uri() instead of .as_posix().

      Done in [r9882].

       

      Related

      Commit: [r9882]

  • Günter Milde

    Günter Milde - 2024-08-10

    I think we have two options.
    (1) support absolute paths on every platform, document as such.
    (2) make absolute paths return an error on Linux.

    (3) support URI-references: full URI | relative reference (relative to a "base URI"), e.g. an "absolute path reference" (starting with one slash) or a "relative path reference" (not starting with a slash).

    The first restores backwards compatibility,

    the second is needlessly restrictive and hard to explain.

    The third is consistent with what we document.
    The different handling of Windows system paths is due to the syntax definition of URIs where the "path" part resembles a POSIX path.

    Another reason to describe the image attribute as "URI reference" are paths with spaces ---
    to get a path with spaces, percent encoding is required:

    .. image:: My%20Documents/first%20drawing.png
    

    As spaces are not allowed in an URI anyway, they are dropped from a link block in reStructuredText to allow easy breaking of long URIs over several lines. For example:

    .. image:: My Documents/first drawing.png
    

    is parsed into the native XML

    <image uri="MyDocuments/firstdrawing.png">
    

    This behaviour is surprising when you describe an image attribute as "URI or local path".

     

    Last edit: Günter Milde 2024-09-11
  • Adam  Turner

    Adam Turner - 2024-08-10

    the second is needlessly restrictive and hard to explain.

    Suppose a user has a directive .. image:: /media/circle.svg. By default this will load the file under /media/circle.svg, as the user might expect if they think they are using filesystem paths, and don't realise the nuances of URI references. At some later date, this user sets root_prefix to /home/user/project/. The image directive will stop working, as it now looks for a file under /home/user/project/media/circle.svg.

    I think this behaviour is hard to explain and quite surprising.

    I have a proposed fix at https://github.com/AA-Turner/docutils/pull/18 (https://github.com/AA-Turner/docutils/pull/18.patch). This does the following:

    • Adds a helper function to convert image URIs to paths
    • Updates all writers (HTML, LaTeX, and ODF/ODT) that support images to use this helper for consistent behaviour
    • Correctly convert file:/// URIs to filesystem paths (using Python 3.13's Path.from_uri() where possible).
    • Validate that file: URIs are absolute paths
    • Correctly convert relative references to filesystem paths
    • Changes the default root_prefix to '' (the empty string), and only applies the root_prefix behaviour when it is non-empty (i.e. a user has configured the setting)

    The second commit in the patch goes further, by rejecting absolute paths when used in relative references. This is optional, but as I argue above I think allowing absolute paths in relative references leads to surprising behaviour, and the resolution is easy if you want to keep using absolute paths (just use file:///). I think this would be a straightforward end-state to explain to users: either the argument to .. image:: (and .. figure::) is a URI or it is a relative filesystem path.

    A

     
    • Günter Milde

      Günter Milde - 2024-08-15

      @aa-turner: Could you create a patch that just

      Changes the default root_prefix to '' (the empty string), and only applies the root_prefix behaviour when it is non-empty (i.e. a user has configured the setting)

      I agree that this is easier to explain (while the current implementation does not need a conditional). As the change is also backwards-compatible we can implement it without advance warning.

       
      • Günter Milde

        Günter Milde - 2024-11-25

        This part of the patch is applied in [r9991].

         

        Related

        Commit: [r9991]

    • Günter Milde

      Günter Milde - 2024-12-04

      A helper function that converts a URI-reference to a filesystem path for HTML and LaTeX writers based on this patch is introduced in [r9996]. The ODT writer will have to wait, as this is a backwards-incompatible change: : The ODT writer expects relative image paths to be relative to the source directory.

       

      Related

      Commit: [r9996]

  • Günter Milde

    Günter Milde - 2024-08-10

    Your analysis with "the uri_parts.scheme not in ('', 'file') check" beeing "the root cause" of the test failure is correct. This check was introduced in [r9501] "Refactor "_html_base" writer."

    However, even before (since introduction of image embedding in 0.18), the drive part was split from the path. To verify this, try PS> echo ".. image:: S:/Development/docutils/docutils/test/data/circle.svg" | python rst2html5.py --link-stylesheet --image-loading embed from a directory in a different drive.

    BTW: both, the introduction of the :loading: directive option and of direct embedding of SVG images did not change path handling (you should get the same behaviour with a PNG or JPG image).

     

    Related

    Commit: [r9501]

  • Günter Milde

    Günter Milde - 2024-08-11

    The second commit in the patch goes further, by rejecting absolute paths when used in relative references. This is optional, but as I argue above I think allowing absolute paths in relative references leads to surprising behaviour the resolution is easy if you want to keep using absolute paths (just use file:///).

    Using the "file" scheme does not work if you want to generate HTML and LaTeX/PDF output from the same source with the HTML to be served via https and linking to images.

    The rejection of absolute paths in "relative references" to the image URI also prevents a use where you read the image size from the file but not embed the image (cf. [feature-requests:#102]):

    .. image:: /pictures/parrot.jpg
              :scale: 100
    

    Both use cases currently work fine in a configuration that uses the same directory (e.g. "/public/wwwfiles/") in "root-prefix" and the http-server's "DocumentRoot".

     

    Last edit: Günter Milde 2024-08-12
  • Günter Milde

    Günter Milde - 2024-08-30

    @aa-turner: Do the tests still fail on Windows?

     
  • Günter Milde

    Günter Milde - 2024-10-15
    • status: open --> pending-works-for-me
     
  • Günter Milde

    Günter Milde - 2024-10-15

    @aa-turner: Can we close this bug?

     
  • Günter Milde

    Günter Milde - 2025-04-14
    • status: pending-works-for-me --> open-fixed
     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.