I sent this to the epydoc-devel, and then realized that
I should've just posted it here. Sorry for the
duplication. Quoted from the email:
The attached patch adds an option, encoding, that
selects the encoding of the (x)html files.
I hope the patch format is ok.
Logged In: YES
user_id=1053920
On saving docs generated by reStructuredText, unicode
strings are encoded with an ascii encoder. If there is any
accented letter an error is raised.
The following patch uses the same encoding to properly
encode unicode strings.
There are still problems with the euro sign, which is
returned as \u20ac even if read as \x80, but i guess i can
live with it.
--- restructuredtext.py.bak 2005-10-18
22:32:51.346059200 +0200
+++ restructuredtext.py 2005-10-18 22:09:26.475955200 +0200
@@ -152,17 +152,21 @@
# Inherit docs
visitor = _EpydocHTMLTranslator(self._document,
docstring_linker)
self._document.walkabout(visitor)
- return ''.join(visitor.body)
+ # Even if Demeter is turning over inside her tomb...
+ return ''.join(visitor.body).encode(
+ docstring_linker._docformatter._encoding)
def to_latex(self, docstring_linker, **options):
# Inherit docs
visitor = _EpydocLaTeXTranslator(self._document,
docstring_linker)
self._document.walkabout(visitor)
- return ''.join(visitor.body)
+ return ''.join(visitor.body).encode(
+ docstring_linker._docformatter._encoding)
def to_plaintext(self, docstring_linker, **options):
# This is should be replaced by something better:
- return self._document.astext()
+ return self._document.astext().encode(
+ docstring_linker._docformatter._encoding)
def __repr__(self): return '<ParsedRstDocstring: ...>'
Logged In: YES
user_id=1053920
oops,
the previous patch of the patch can only work with html.
Some further work is needed for LaTeX.
An "\usepackage[latin1]{inputenc}" is also required in the
LaTeX output for anything but plain ascii.Don't know about
other encodings support into LaTeX
Logged In: YES
user_id=195958
In epydoc3, I'm currently generating pure ascii output, by
using encoding='ascii' and errors='xmlcharrefreplace.'
I.e., any non-ascii character will be replaced by an xml
entity such as "袈". Is there a reason why you'd
want to specify a specific encoding, rather than using
xmlcharrefreplace for non-ascii characters? If so, I can
certainly add this option to epydoc 3.0; but if it's not a
useful option, then I'd rather there be one less option
that people have to read through in the docs. :)
Logged In: YES
user_id=1053920
What about utf-8 output? Is there any issue about it?
Logged In: YES
user_id=195958
I'm not sure what you mean by "is there any issue about
it." Are you asking why I chose to output using ascii w/
xml charref entities, rather than using utf-8?
If that's your question, then I had a couple reasons:
first, ascii w/ charref entities is likely more widely
supported; and second, to specify the encoding of the
contents in the html files themselves, we need to add a
meta tag to the head section. I've seen several places
that say this should be avoided, eg:
http://ppewww.ph.gla.ac.uk/%7Eflavell/charset/ns-burp.html
The only advantage I see to using utf-8 instead of ascii
w/ xml charrefs is that the file size will be smaller if
the user has a lot of non-ascii characters in their
docstrings. But it seems unlikely that it'll affect file
size that much.
Is there another advantage to using utf-8 as the encoding
that I'm missing? Or do you think that the file size
increase will be significant?
If I misunderstood your question, then please elaborate on
what you want to know.
Logged In: YES
user_id=195958
As evidence that utf-8 isn't as well supported, I just
tried the following:
s = (u'<?xml version="1.0" encoding="UTF-8"?>\n'
u'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML '
u'1.0 Transitional//EN" "DTD/xhtml1-
transitional.dtd">\n'
u'<html>\n<head>\n<META HTTP-EQUIV="Content-Type" '
u'CONTENT="text/html; charset=UTF-
8">\n</head>\n<body>\n'
u'Unicode test: \u00A9\n</body>\n</html>\n')
codecs.open('unicode-test.html','w', 'utf-8').write(s)
Viewing the generated file via apache, neither IE nor
Mozilla renders it correctly (unless I explicitly select
utf-8 as the encoding). But the following does render
correctly:
s = (u'<?xml version="1.0" encoding="ascii"?>\n'
u'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML '
u'1.0 Transitional//EN" "DTD/xhtml1-
transitional.dtd">\n'
u'<html>\n<head>\n</head>\n<body>\n'
u'Unicode test: \u00A9\n</body>\n</html>\n')
codecs.open('unicode-test.html','w', 'ascii',
'xmlcharrefreplace').write(s)
So unless I'm somehow getting the necessary headers wrong,
I think it's best to stick with ascii.
Logged In: YES
user_id=1053920
> I'm not sure what you mean by "is there any issue about
> it." Are you asking why I chose to output using ascii w/
> xml charref entities, rather than using utf-8?
Yes, i was asking this.
The more natural way to express non-ascii characters is to
use an encoding where they are defined, and utf-8 works for
any character. So i was just wondering what rationale was
behind the choice of ascii+html escape.
> As evidence that utf-8 isn't as well supported, I just
> tried the following:
This is a bit more strange: i just tried your utf-8 test
and it worked as it should (a page with a (C) symbol was
rendered). I tried with IE6 and Firefox 1.5. Firefox
reports an UTF-8 encoding in the info page, but i didn't do
anything to enforce it).
I opened the page directly on the file system: maybe apache
is enforcing an http header which overrides the META tag in
the page? You may check the firefox "page property" to see
what encoding it decided to apply.
If you are concerned about older browsers, probably the
best solution is an ascii page, anyway. It would be an hell
to hand-update the resulting pages in languages with many
accents (such as French) or not latin, but of course this
is not how Epydoc pages are supposed to be used!
Logged In: YES
user_id=195958
Yes, opening it locally will give different results from
having it served via apache, since apache will send HTTP
headers, including one that specifies the charset to use.
I suppose I could also write an .htaccess file to the
output directory, telling apache what charset to claim the
file has, but I think I prefer using ascii -- simple is
good. :)
So I'm going to go ahead and close this feature request.
If anyone still wants to argue for why it's better to use
an output encoding other than ascii+charrefs, post another
comment here & I'll consider re-opening it.