[Epydoc-devel] Re: Testing input encodings
Brought to you by:
edloper
|
From: Edward L. <ed...@gr...> - 2006-03-17 20:49:46
|
Daniele Varrazzo wrote:
> Hello Edward,
>
> i can see in epydoc 3 there is support for encoding detection in
> source files, but i can't trigger it.
>
> I am trying to generate docs for the included package, but source
> encoding is not detected. I am using the following command line:
>
> epydoc -o out --docformat=restructuredtext -v inenctest
>
> You may use those files to extensively check if ascii+xml entities
> works in edge cases: there is a koi8r-encoded file and a fancy utf8
> file with some arabic, and some ebraic characters thrown in.
If you read pep 263 [1] carefully, the encoding directive applies to
unicode strings, comments, and identifiers, but *not* to non-unicode
strings. In particular, under "Concepts", bullet 3, sub-bullet 5, it
says that the tokenizer should:
... [create] string objects from the Unicode literal data
by first reencoding the UTF-8 data into 8-bit string data
using the given file encoding
You can see that this is indeed Python's behavior by introspecting one
of the objects in your test modules:
>>> import inenctest.encoding_test_utf8
>>> print `inenctest.encoding_test_utf8.__doc__`
'Encoding epydoc test.\n\nCharacters in 128-155
range:\n\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9\xc2\xa5\xc2\xa9\xc2\xae\n\nGoing
east\n\xd1\x82\xd0\xb7\xd0\xb3\xd1\x8f\xd0\xb8\xd1\x95\n\nSouth
east\n\xd7\x90\xd7\x91\xd7\x92\xd7\x93\n\nMore
south\n\xd8\xb3\xd8\xb4\xd8\xb5\xd8\xb6\xd8\xb7\n\n'
The correct fix, then, is to use unicode strings as docstrings. That
way, your docstrings will behave right both within epydoc & within
python. To me, this seems analagous to people who use:
def f(x):
"""Split x on newlines ('\n')"""
Where they should be using:
def f(x):
r"""Split x on newlines ('\n')"""
The introspection system, at least, gives a warning message that I hoped
would let people know how to fix their mistake:
Warning: inenctest.encoding_test_utf8's docstring is not a unicode
string, but it contains non-ascii data -- treating it as
latin-1.
It looks like the parser doesn't print out a similar warning, so if you
use --parse-only, you won't see this warning. Maybe I should add a
warning there. Although it would be non-trivial to do. :-/
All that being said, I can see an argument for interpreting the
docstrings according to the unicode encoding specified at the top,
despite the fact that that's *not* how Python interprets them -- because
that's almost certainly how the author intends for docstrings to be
interpreted. One problem, though, would be that it's not clear how I
would figure out the encoding of the module if a user uses
--inspect-only. The easiest way to do this would be to not re-encode
non-unicode strings; but then variables would be displayed with
incorrect values.
So at this point, I'm still undecided which way to go -- be compliant
with PEP 263 & Python, or be lenient and "do what I mean, not what I say."
-Edward
p.s., the two of your files that start with a BOM point out that my
parser system currently fails if the file starts with a BOM; that, I
will fix.
[1] http://www.python.org/doc/peps/pep-0263/
|