Trouble with encoding french string

Help
drsss
2007-06-11
2013-04-26
  • drsss
    drsss
    2007-06-11

    Hi,

    I have trouble with encoding french string.
    My configuration about encoding in DrPython is : Options > Default Encoding : utf-8

    There is the code :

    #-*- coding:Latin-1 -*-
    good = u"é"
    print good

    And the output :
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

    If I change :

    #-*- coding:utf-8 -*-
    good = u"é"
    print good

    UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data

    If I remove the ("u" before bracket), it's ok for DrPython, but not in DOS console.
    With PyScripter and IDLE, I don't have any problem, but I want to continue my exercices with DrPython both in Windows and in Linux.

    Please, have you got an idea ? Thank.
    drsss

     
    • Sorry, ah, the unicode stuff again. I add a bug report about that, but I cannot promise, when I (we) are able to deliver a solution. Maybe you yourself have some experience about the unicode stuff? ;)

       
      • DomDom
        DomDom
        2007-06-12

        I have met the same kind of problems.
        A program may run well with drpython and not with IDLE and vic-versa.
        The same kind of behavior exists between Win XP and Mac OSX...
        I thought I was the only one to get this !
        I hope the bug will be fixed...
        Dominique

         
        • Unicode is very complex and confusing. Although there are faults in DrPython your main problem is that you may not be using your computer correctly.

          In Python there are two types of character string: the ordinary string consisting of bytes and the Unicode string consisting of integers. When you press a key on your keyboard the key position number is converted into a byte code in the computer. In order to display this character on the screen the byte code is converted to a symbol by looking up the byte code in what is sometimes called a "code page". To find out what encoding you are using enter:

          >>> import locale
          >>> locale.getdefaultlocale()

          In your case the encoding is probably either "latin-1" or "utf-8". 

          Python does nothing special with byte strings but Unicode strings are different. Python must be able convert byte strings into integer strings on input and do the reverse on output. The DrPython code gets the terminology the wrong way round. So far as Python is concerned the process of converting from byte string to integer string is called "decoding". If you put an accented character into a Unicode literal (u"...") then you must tell the Python interpreter how to decode it. You do this by a comment at the start of the file (but after any "#!" line). If you use a Unicode string in a print statement you must tell Python what encoding to use. You might think that Python should know what to do but unfortunately it defaults to ascii which does not contain any accented characters. To avoid the UnicodeDecodeError in your first example edit the file site.py (in my case the full file name is /etc/python2.5/site.py) and, in the setencoding() function change the first "if 0:" to "if True:". You can ignore the comment about this being experimental which seems to me to be out of date.

          In the second example you appear to have typed a latin-1 encoded character into a literal and then confused the Python interpreter by claiming that it is utf-8. When you saved the change in DrPython it should have, ideally, noticed your comment and saved the script in the appropriate encoding. You could try specifying the encoding in the DrPython's output dialog but this actually does nothing.

          Some differences between applications may occur because they use different edit controls and different means of communication with the terminal. In wxPython the styled text control (based on Scintilla) uses utf-8 but Unicode strings in Python are effectively utf-16. When child processes communicate via pipes then they have no means of telling what encoding the parent process requires. However, if a pseudo terminal is used instead, the child can query the terminal to determine what encoding to use.

           
    • drsss
      drsss
      2007-06-11

      No, I haven't. Sorry.
      I'm a beginner in programmation, a self-taught person.
      Don't worry, "The ox is slow, but the ground is patient" (Lao Tseu)

      Drsss

       
    • Unicode is very complex and confusing. Although there are faults in DrPython your main problem is that you are not using your computer correctly.

      In Python there are two types of character string: the ordinary string consisting of bytes and the Unicode string consisting of integers. When you press a key on your keyboard the key position number is converted into a byte code in the computer. In order to display this character on the screen the byte code is converted to a symbol by looking up the byte code in what is sometimes called a "code page".  To find out what encoding you are using enter:

      >>> import locale
      >>> locale.getdefaultlocale()

      In your case the encoding is probably either "latin-1" or "utf-8".

      Python does nothing special with byte strings but Unicode strings are different. Python must be able convert byte strings into integer strings on input and do the reverse on output. The DrPython code gets the terminology the wrong way round. So far as Python is concerned the process of converting from byte string to integer string is called "decoding". If you put an accented character into a Unicode literal (u"...") then you must tell the Python interpreter how to decode it. You do this by a comment at the start of the file (but after any "#!" line).  If you use a Unicode string in a print statement you must tell Python what encoding to use. You might think that Python should know what to do but unfortunately it defaults to ascii which does not contain any accented characters. To avoid the UnicodeDecodeError in your first example edit the file site.py (in my case the full file name is /etc/python2.5/site.py) and, in the setencoding() function change the first "if 0:" to "if True:". You can ignore the comment about this being experimental which seems to me to be out of date.

      In the second example you appear to have typed a latin-1 encoded character into a literal and then confused the Python interpreter by claiming that it is utf-8. When you saved the change in DrPython it should have, ideally, noticed your comment and saved the script in the appropriate encoding. You could try specifying the encoding in the DrPython's output dialog but this actually does nothing.

      Some differences between applications may occur because they use different edit controls and different means of communication with the terminal. In wxPython the styled text control (based on Scintilla) uses utf-8 but Unicode strings in Python are effectively utf-16. When child processes communicate via pipes then they have no means of telling what encoding the parent process requires. However, if a pseudo terminal is used instead, the child can query the terminal to determine what encoding to use.