[Epydoc-devel] Re: Testing input encodings

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Daniele Varrazzo wrote:
> Hello Edward,
> 
> i can see in epydoc 3 there is support for encoding detection in
> source files, but i can't trigger it.
> 
> I am trying to generate docs for the included package, but source
> encoding is not detected. I am using the following command line:
> 
>     epydoc -o out --docformat=restructuredtext -v inenctest
> 
> You may use those files to extensively check if ascii+xml entities
> works in edge cases: there is a koi8r-encoded file and a fancy utf8
> file with some arabic, and some ebraic characters thrown in.

If you read pep 263 [1] carefully, the encoding directive applies to 
unicode strings, comments, and identifiers, but *not* to non-unicode 
strings.  In particular, under "Concepts", bullet 3, sub-bullet 5, it 
says that the tokenizer should:

     ... [create] string objects from the Unicode literal data
     by first reencoding the UTF-8 data into 8-bit string data
     using the given file encoding

You can see that this is indeed Python's behavior by introspecting one 
of the objects in your test modules:

 >>> import inenctest.encoding_test_utf8
 >>> print `inenctest.encoding_test_utf8.__doc__`
'Encoding epydoc test.\n\nCharacters in 128-155 
range:\n\xc3\xa0\xc3\xa8\xc3\xac\xc3\xb2\xc3\xb9\xc2\xa5\xc2\xa9\xc2\xae\n\nGoing 
east\n\xd1\x82\xd0\xb7\xd0\xb3\xd1\x8f\xd0\xb8\xd1\x95\n\nSouth 
east\n\xd7\x90\xd7\x91\xd7\x92\xd7\x93\n\nMore 
south\n\xd8\xb3\xd8\xb4\xd8\xb5\xd8\xb6\xd8\xb7\n\n'

The correct fix, then, is to use unicode strings as docstrings.  That 
way, your docstrings will behave right both within epydoc & within 
python.  To me, this seems analagous to people who use:

     def f(x):
         """Split x on newlines ('\n')"""

Where they should be using:

     def f(x):
         r"""Split x on newlines ('\n')"""

The introspection system, at least, gives a warning message that I hoped 
would let people know how to fix their mistake:

Warning: inenctest.encoding_test_utf8's docstring is not a unicode
          string, but it contains non-ascii data -- treating it as
          latin-1.

It looks like the parser doesn't print out a similar warning, so if you 
use --parse-only, you won't see this warning.  Maybe I should add a 
warning there.  Although it would be non-trivial to do. :-/

All that being said, I can see an argument for interpreting the 
docstrings according to the unicode encoding specified at the top, 
despite the fact that that's *not* how Python interprets them -- because 
that's almost certainly how the author intends for docstrings to be 
interpreted.  One problem, though, would be that it's not clear how I 
would figure out the encoding of the module if a user uses 
--inspect-only.  The easiest way to do this would be to not re-encode 
non-unicode strings; but then variables would be displayed with 
incorrect values.

So at this point, I'm still undecided which way to go -- be compliant 
with PEP 263 & Python, or be lenient and "do what I mean, not what I say."

-Edward

p.s., the two of your files that start with a BOM point out that my 
parser system currently fails if the file starts with a BOM; that, I 
will fix.

[1] http://www.python.org/doc/peps/pep-0263/