[Epydoc-devel] Re: Testing input encodings
Brought to you by:
edloper
From: Edward L. <ed...@gr...> - 2006-03-17 20:49:46
|
Daniele Varrazzo wrote: > Hello Edward, > > i can see in epydoc 3 there is support for encoding detection in > source files, but i can't trigger it. > > I am trying to generate docs for the included package, but source > encoding is not detected. I am using the following command line: > > epydoc -o out --docformat=restructuredtext -v inenctest > > You may use those files to extensively check if ascii+xml entities > works in edge cases: there is a koi8r-encoded file and a fancy utf8 > file with some arabic, and some ebraic characters thrown in. If you read pep 263 [1] carefully, the encoding directive applies to unicode strings, comments, and identifiers, but *not* to non-unicode strings. In particular, under "Concepts", bullet 3, sub-bullet 5, it says that the tokenizer should: ... [create] string objects from the Unicode literal data by first reencoding the UTF-8 data into 8-bit string data using the given file encoding You can see that this is indeed Python's behavior by introspecting one of the objects in your test modules: >>> import inenctest.encoding_test_utf8 >>> print `inenctest.encoding_test_utf8.__doc__` 'Encoding epydoc test.\n\nCharacters in 128-155 range:\n\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9\xc2\xa5\xc2\xa9\xc2\xae\n\nGoing east\n\xd1\x82\xd0\xb7\xd0\xb3\xd1\x8f\xd0\xb8\xd1\x95\n\nSouth east\n\xd7\x90\xd7\x91\xd7\x92\xd7\x93\n\nMore south\n\xd8\xb3\xd8\xb4\xd8\xb5\xd8\xb6\xd8\xb7\n\n' The correct fix, then, is to use unicode strings as docstrings. That way, your docstrings will behave right both within epydoc & within python. To me, this seems analagous to people who use: def f(x): """Split x on newlines ('\n')""" Where they should be using: def f(x): r"""Split x on newlines ('\n')""" The introspection system, at least, gives a warning message that I hoped would let people know how to fix their mistake: Warning: inenctest.encoding_test_utf8's docstring is not a unicode string, but it contains non-ascii data -- treating it as latin-1. It looks like the parser doesn't print out a similar warning, so if you use --parse-only, you won't see this warning. Maybe I should add a warning there. Although it would be non-trivial to do. :-/ All that being said, I can see an argument for interpreting the docstrings according to the unicode encoding specified at the top, despite the fact that that's *not* how Python interprets them -- because that's almost certainly how the author intends for docstrings to be interpreted. One problem, though, would be that it's not clear how I would figure out the encoding of the module if a user uses --inspect-only. The easiest way to do this would be to not re-encode non-unicode strings; but then variables would be displayed with incorrect values. So at this point, I'm still undecided which way to go -- be compliant with PEP 263 & Python, or be lenient and "do what I mean, not what I say." -Edward p.s., the two of your files that start with a BOM point out that my parser system currently fails if the file starts with a BOM; that, I will fix. [1] http://www.python.org/doc/peps/pep-0263/ |