Python parsing module / Discussion / Help/Open Discussion: UTF-8 Decode of scanString

Anonymous - 2012-08-09

Hi All,

I am in need of a bit of coding help to get the encoding of utf-8 strings correct after they come out of the scanString function.

In my code :

#current_line is a text line that may have encoded chars int

      for the_command, start_pos, end_pos in sentence.scanString(current_line):
        command_name = the_command

        print("the_command = %s") % the_command

        temp =         # decode all pieces
        print("temp = %s") % temp

        if self.debug:
          print("\n\n%2d) Command = %s : %s\n") % (line_count, command_name, the_command)

Here is some output from the above code :

command_list =    <== Input list - item 1 has encoding in it

the_command =     <== string has been truncated at the encoded chars

temp =

3) Command = CreateFolder :

What I need is to have all of the strings in the_command list to be properly encoded.

TIA!

Steve

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Here is my test code trying to work with this Unicode text:

s = ur'CreateFolder Simpl\xe9 in Sync'
print s
print repr(s)
print s.decode('UTF-8')
import codecs
print codecs.decode(s,'UTF-8')
print
s = u'CreateFolder Simpl\xe9 in Sync'
print s
print s.decode('UTF-8')
import codecs
print codecs.decode(s,'UTF-8')

The output I get is:

CreateFolder Simpl\xe9 in Sync
u'CreateFolder Simpl\\xe9 in Sync'
CreateFolder Simpl\xe9 in Sync
CreateFolder Simpl\xe9 in Sync
CreateFolder Simplé in Sync
Traceback (most recent call last):
  File "dd.py", line 11, in <module>
    print s.decode('UTF-8')
  File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 18: ordinal not in range(128)

So I'm not sure what it is you are actually getting. Can you post a snippet of the input text and the expression in your grammar that parses it?

Paul

Anonymous - 2012-08-09

Hi There Saintly Paul!

Thanks so much for your continued dedication to this project… It is just awesome!

Here is a little more setup on the code that I'm used to parse out the input lines from the file :

    # arguements to the passed in commands - 1 or more single words, or quoted strings

    valid_extra_chars = '_$:\\.()|-#?*=+@!%&'                           # add valid argument chars

    alphaWord = pyparsing.Word(pyparsing.alphanums + pyparsing.punc8bit + valid_extra_chars)
    args = pyparsing.OneOrMore(alphaWord | pyparsing.quotedString.setParseAction(pyparsing.removeQuotes))
    knownKeyWords = pyparsing.oneOf( valid_command_list, caseless=True )

    sentence = pyparsing.OneOrMore( knownKeyWords + args )              # Generic Grammer syntax for a given command context

    for current_line in command_list:                                   # process/dispatch all commands in file in this loop

      if self.debug:
        print('\n\n Raw Command = %s') % (current_line)

      # scan the input line to see if it contains a known command + arguments

      for the_command, start_pos, end_pos in sentence.scanString(current_line):
        command_name = the_command

#        the_command = codecs.decode(the_command, 'utf-8')
        print("the_command = %s") % the_command


        if self.debug:
          print("\n\n%2d) Command = %s : %s\n") % (line_count, command_name, the_command)

My Output :

3) Raw Command = CreateFolder Simpl├⌐ in Sync <=== raw input line
the_command =     <=== truncated input after going through scanString()

My desire is to get the parsed out command into an encoded state so I can pass that on to all of the defs() that process the commands.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Steve Reiss - 2012-08-09

Hey Paul!

With A LOT of experimentation, I found that I needed to include a couple of changes to the grammar to allow arguments that have unicode characters in them.

alphaWord = pyparsing.Word(pyparsing.alphanums + pyparsing.punc8bit + valid_extra_chars)

TO

alphaWord = pyparsing.Word(pyparsing.alphanums + pyparsing.punc8bit + pyparsing.alphas8bit + valid_extra_chars)

This allowed the scanString() to work correctly on the input lines.

Thanks again!

Steve

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Paul McGuire - 2012-08-10

Great, glad you were able to work it out!

(You can also add in other unicode chars beyond the 8-bit range - just use the u'\u####' notation, or pyparsing's srange function is helpful for this too.)

Welcome to Pyparsing!
- Paul

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

UTF-8 Decode of scanString

Forums

Help

UTF-8 Decode of scanString document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

UTF-8 Decode of scanString