#6 select encoding of html files

Devel (cvs)
closed-fixed
Edward Loper
None
5
2006-03-17
2005-05-19
John R Lenton
No

I sent this to the epydoc-devel, and then realized that
I should've just posted it here. Sorry for the
duplication. Quoted from the email:

The attached patch adds an option, encoding, that
selects the encoding of the (x)html files.

I hope the patch format is ok.

Discussion

  • John R Lenton
    John R Lenton
    2005-05-19

     
    Attachments
  • Logged In: YES
    user_id=1053920

    On saving docs generated by reStructuredText, unicode
    strings are encoded with an ascii encoder. If there is any
    accented letter an error is raised.

    The following patch uses the same encoding to properly
    encode unicode strings.

    There are still problems with the euro sign, which is
    returned as \u20ac even if read as \x80, but i guess i can
    live with it.

    --- restructuredtext.py.bak 2005-10-18
    22:32:51.346059200 +0200
    +++ restructuredtext.py 2005-10-18 22:09:26.475955200 +0200
    @@ -152,17 +152,21 @@
    # Inherit docs
    visitor = _EpydocHTMLTranslator(self._document,
    docstring_linker)
    self._document.walkabout(visitor)
    - return ''.join(visitor.body)
    + # Even if Demeter is turning over inside her tomb...
    + return ''.join(visitor.body).encode(
    + docstring_linker._docformatter._encoding)

    def to_latex(self, docstring_linker, **options):
    # Inherit docs
    visitor = _EpydocLaTeXTranslator(self._document,
    docstring_linker)
    self._document.walkabout(visitor)
    - return ''.join(visitor.body)
    + return ''.join(visitor.body).encode(
    + docstring_linker._docformatter._encoding)

    def to_plaintext(self, docstring_linker, **options):
    # This is should be replaced by something better:
    - return self._document.astext()
    + return self._document.astext().encode(
    + docstring_linker._docformatter._encoding)

    def __repr__(self): return '<ParsedRstDocstring: ...>'

     
  • Logged In: YES
    user_id=1053920

    oops,

    the previous patch of the patch can only work with html.
    Some further work is needed for LaTeX.

    An "\usepackage[latin1]{inputenc}" is also required in the
    LaTeX output for anything but plain ascii.Don't know about
    other encodings support into LaTeX

     
  • Edward Loper
    Edward Loper
    2006-03-17

    Logged In: YES
    user_id=195958

    In epydoc3, I'm currently generating pure ascii output, by
    using encoding='ascii' and errors='xmlcharrefreplace.'
    I.e., any non-ascii character will be replaced by an xml
    entity such as "&#34952;". Is there a reason why you'd
    want to specify a specific encoding, rather than using
    xmlcharrefreplace for non-ascii characters? If so, I can
    certainly add this option to epydoc 3.0; but if it's not a
    useful option, then I'd rather there be one less option
    that people have to read through in the docs. :)

     
  • Logged In: YES
    user_id=1053920

    What about utf-8 output? Is there any issue about it?

     
  • Edward Loper
    Edward Loper
    2006-03-17

    Logged In: YES
    user_id=195958

    I'm not sure what you mean by "is there any issue about
    it." Are you asking why I chose to output using ascii w/
    xml charref entities, rather than using utf-8?

    If that's your question, then I had a couple reasons:
    first, ascii w/ charref entities is likely more widely
    supported; and second, to specify the encoding of the
    contents in the html files themselves, we need to add a
    meta tag to the head section. I've seen several places
    that say this should be avoided, eg:

    http://ppewww.ph.gla.ac.uk/%7Eflavell/charset/ns-burp.html

    The only advantage I see to using utf-8 instead of ascii
    w/ xml charrefs is that the file size will be smaller if
    the user has a lot of non-ascii characters in their
    docstrings. But it seems unlikely that it'll affect file
    size that much.

    Is there another advantage to using utf-8 as the encoding
    that I'm missing? Or do you think that the file size
    increase will be significant?

    If I misunderstood your question, then please elaborate on
    what you want to know.

     
  • Edward Loper
    Edward Loper
    2006-03-17

    Logged In: YES
    user_id=195958

    As evidence that utf-8 isn't as well supported, I just
    tried the following:

    s = (u'<?xml version="1.0" encoding="UTF-8"?>\n'
    u'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML '
    u'1.0 Transitional//EN" "DTD/xhtml1-
    transitional.dtd">\n'
    u'<html>\n<head>\n<META HTTP-EQUIV="Content-Type" '
    u'CONTENT="text/html; charset=UTF-
    8">\n</head>\n<body>\n'
    u'Unicode test: \u00A9\n</body>\n</html>\n')
    codecs.open('unicode-test.html','w', 'utf-8').write(s)

    Viewing the generated file via apache, neither IE nor
    Mozilla renders it correctly (unless I explicitly select
    utf-8 as the encoding). But the following does render
    correctly:

    s = (u'<?xml version="1.0" encoding="ascii"?>\n'
    u'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML '
    u'1.0 Transitional//EN" "DTD/xhtml1-
    transitional.dtd">\n'
    u'<html>\n<head>\n</head>\n<body>\n'
    u'Unicode test: \u00A9\n</body>\n</html>\n')
    codecs.open('unicode-test.html','w', 'ascii',
    'xmlcharrefreplace').write(s)

    So unless I'm somehow getting the necessary headers wrong,
    I think it's best to stick with ascii.

     
  • Logged In: YES
    user_id=1053920

    > I'm not sure what you mean by "is there any issue about
    > it." Are you asking why I chose to output using ascii w/
    > xml charref entities, rather than using utf-8?

    Yes, i was asking this.

    The more natural way to express non-ascii characters is to
    use an encoding where they are defined, and utf-8 works for
    any character. So i was just wondering what rationale was
    behind the choice of ascii+html escape.

    > As evidence that utf-8 isn't as well supported, I just
    > tried the following:

    This is a bit more strange: i just tried your utf-8 test
    and it worked as it should (a page with a (C) symbol was
    rendered). I tried with IE6 and Firefox 1.5. Firefox
    reports an UTF-8 encoding in the info page, but i didn't do
    anything to enforce it).

    I opened the page directly on the file system: maybe apache
    is enforcing an http header which overrides the META tag in
    the page? You may check the firefox "page property" to see
    what encoding it decided to apply.

    If you are concerned about older browsers, probably the
    best solution is an ascii page, anyway. It would be an hell
    to hand-update the resulting pages in languages with many
    accents (such as French) or not latin, but of course this
    is not how Epydoc pages are supposed to be used!

     
  • Edward Loper
    Edward Loper
    2006-03-17

    • assigned_to: nobody --> edloper
    • status: open --> closed-fixed
     
  • Edward Loper
    Edward Loper
    2006-03-17

    Logged In: YES
    user_id=195958

    Yes, opening it locally will give different results from
    having it served via apache, since apache will send HTTP
    headers, including one that specifies the charset to use.
    I suppose I could also write an .htaccess file to the
    output directory, telling apache what charset to claim the
    file has, but I think I prefer using ascii -- simple is
    good. :)

    So I'm going to go ahead and close this feature request.
    If anyone still wants to argue for why it's better to use
    an output encoding other than ascii+charrefs, post another
    comment here & I'll consider re-opening it.