UTF-8 Decode of scanString

Anonymous
2012-08-09
2013-05-14

  • Anonymous
    2012-08-09

    Hi All,

    I am in need of a bit of coding help to get the encoding of utf-8 strings correct after they come out of the scanString function.

    In my code :

    #current_line is a text line that may have encoded chars int

          for the_command, start_pos, end_pos in sentence.scanString(current_line):
            command_name = the_command
           
            print("the_command = %s") % the_command
           
            temp =         # decode all pieces
            print("temp = %s") % temp

            if self.debug:
              print("\n\n%2d) Command = %s  : %s\n") % (line_count, command_name, the_command)

    Here is some output from the above code :

    command_list =    <== Input list - item 1 has encoding in it

    the_command =     <== string has been truncated at the encoded chars

    temp =

    3) Command = CreateFolder  :

    What I need is to have all of the strings in the_command list to be properly encoded.

    TIA!

    Steve

     
  • Paul McGuire
    Paul McGuire
    2012-08-09

    Here is my test code trying to work with this Unicode text:

    s = ur'CreateFolder Simpl\xe9 in Sync'
    print s
    print repr(s)
    print s.decode('UTF-8')
    import codecs
    print codecs.decode(s,'UTF-8')
    print
    s = u'CreateFolder Simpl\xe9 in Sync'
    print s
    print s.decode('UTF-8')
    import codecs
    print codecs.decode(s,'UTF-8')
    

    The output I get is:

    CreateFolder Simpl\xe9 in Sync
    u'CreateFolder Simpl\\xe9 in Sync'
    CreateFolder Simpl\xe9 in Sync
    CreateFolder Simpl\xe9 in Sync
    CreateFolder Simplé in Sync
    Traceback (most recent call last):
      File "dd.py", line 11, in <module>
        print s.decode('UTF-8')
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 18: ordinal not in range(128)
    

    So I'm not sure what it is you are actually getting.  Can you post a snippet of the input text and the expression in your grammar that parses it?

    • Paul
     

  • Anonymous
    2012-08-09

    Hi There Saintly Paul!

    Thanks so much for your continued dedication to this project… It is just awesome!

    Here is a little more setup on the code that I'm used to parse out the input lines from the file :

        # arguements to the passed in commands - 1 or more single words, or quoted strings
       
        valid_extra_chars = '_$:\\.()|-#?*=+@!%&'                           # add valid argument chars
       
        alphaWord = pyparsing.Word(pyparsing.alphanums + pyparsing.punc8bit + valid_extra_chars)
        args = pyparsing.OneOrMore(alphaWord | pyparsing.quotedString.setParseAction(pyparsing.removeQuotes))
        knownKeyWords = pyparsing.oneOf( valid_command_list, caseless=True )

        sentence = pyparsing.OneOrMore( knownKeyWords + args )              # Generic Grammer syntax for a given command context

        for current_line in command_list:                                   # process/dispatch all commands in file in this loop

          if self.debug:
            print('\n\n Raw Command = %s') % (current_line)

          # scan the input line to see if it contains a known command + arguments
         
          for the_command, start_pos, end_pos in sentence.scanString(current_line):
            command_name = the_command
           
    #        the_command = codecs.decode(the_command, 'utf-8')
            print("the_command = %s") % the_command
           

            if self.debug:
              print("\n\n%2d) Command = %s  : %s\n") % (line_count, command_name, the_command)

    My Output :

    3) Raw Command = CreateFolder Simpl├⌐ in Sync  <=== raw input line
    the_command =     <=== truncated input after going through scanString()

    My desire is to get the parsed out command into an encoded state so I can pass that on to all of the defs() that process the commands.

     
  • Steve Reiss
    Steve Reiss
    2012-08-09

    Hey Paul!

    With A LOT of experimentation, I found that I needed to include a couple of changes to the grammar to allow arguments that have unicode characters in them.

        alphaWord = pyparsing.Word(pyparsing.alphanums + pyparsing.punc8bit + valid_extra_chars)

    TO

        alphaWord = pyparsing.Word(pyparsing.alphanums + pyparsing.punc8bit + pyparsing.alphas8bit + valid_extra_chars)

    This allowed the scanString() to work correctly on the input lines.

    Thanks again!

    Steve

     
  • Paul McGuire
    Paul McGuire
    2012-08-10

    Great, glad you were able to work it out!

    (You can also add in other unicode chars beyond the 8-bit range - just use the u'\u####' notation, or pyparsing's srange function is helpful for this too.)

    Welcome to Pyparsing!
    - Paul