Here is my test code trying to work with this Unicode text:
s=ur'CreateFolder Simpl\xe9 in Sync'printsprintrepr(s)prints.decode('UTF-8')importcodecsprintcodecs.decode(s,'UTF-8')prints=u'CreateFolder Simpl\xe9 in Sync'printsprints.decode('UTF-8')importcodecsprintcodecs.decode(s,'UTF-8')
The output I get is:
CreateFolder Simpl\xe9 in Sync
u'CreateFolder Simpl\\xe9 in Sync'
CreateFolder Simpl\xe9 in Sync
CreateFolder Simpl\xe9 in Sync
CreateFolder Simplé in Sync
Traceback (most recent call last):
File "dd.py", line 11, in <module>
print s.decode('UTF-8')
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 18: ordinal not in range(128)
So I'm not sure what it is you are actually getting. Can you post a snippet of the input text and the expression in your grammar that parses it?
Paul
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2012-08-09
Hi There Saintly Paul!
Thanks so much for your continued dedication to this project… It is just awesome!
Here is a little more setup on the code that I'm used to parse out the input lines from the file :
# arguements to the passed in commands - 1 or more single words, or quoted strings
With A LOT of experimentation, I found that I needed to include a couple of changes to the grammar to allow arguments that have unicode characters in them.
(You can also add in other unicode chars beyond the 8-bit range - just use the u'\u####' notation, or pyparsing's srange function is helpful for this too.)
Welcome to Pyparsing!
- Paul
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi All,
I am in need of a bit of coding help to get the encoding of utf-8 strings correct after they come out of the scanString function.
In my code :
#current_line is a text line that may have encoded chars int
for the_command, start_pos, end_pos in sentence.scanString(current_line):
command_name = the_command
print("the_command = %s") % the_command
temp = # decode all pieces
print("temp = %s") % temp
if self.debug:
print("\n\n%2d) Command = %s : %s\n") % (line_count, command_name, the_command)
Here is some output from the above code :
command_list = <== Input list - item 1 has encoding in it
the_command = <== string has been truncated at the encoded chars
temp =
3) Command = CreateFolder :
What I need is to have all of the strings in the_command list to be properly encoded.
TIA!
Steve
Here is my test code trying to work with this Unicode text:
The output I get is:
So I'm not sure what it is you are actually getting. Can you post a snippet of the input text and the expression in your grammar that parses it?
Hi There Saintly Paul!
Thanks so much for your continued dedication to this project… It is just awesome!
Here is a little more setup on the code that I'm used to parse out the input lines from the file :
# arguements to the passed in commands - 1 or more single words, or quoted strings
valid_extra_chars = '_$:\\.()|-#?*=+@!%&' # add valid argument chars
alphaWord = pyparsing.Word(pyparsing.alphanums + pyparsing.punc8bit + valid_extra_chars)
args = pyparsing.OneOrMore(alphaWord | pyparsing.quotedString.setParseAction(pyparsing.removeQuotes))
knownKeyWords = pyparsing.oneOf( valid_command_list, caseless=True )
sentence = pyparsing.OneOrMore( knownKeyWords + args ) # Generic Grammer syntax for a given command context
for current_line in command_list: # process/dispatch all commands in file in this loop
if self.debug:
print('\n\n Raw Command = %s') % (current_line)
# scan the input line to see if it contains a known command + arguments
for the_command, start_pos, end_pos in sentence.scanString(current_line):
command_name = the_command
# the_command = codecs.decode(the_command, 'utf-8')
print("the_command = %s") % the_command
if self.debug:
print("\n\n%2d) Command = %s : %s\n") % (line_count, command_name, the_command)
My Output :
3) Raw Command = CreateFolder Simplé in Sync <=== raw input line
the_command = <=== truncated input after going through scanString()
My desire is to get the parsed out command into an encoded state so I can pass that on to all of the defs() that process the commands.
Hey Paul!
With A LOT of experimentation, I found that I needed to include a couple of changes to the grammar to allow arguments that have unicode characters in them.
alphaWord = pyparsing.Word(pyparsing.alphanums + pyparsing.punc8bit + valid_extra_chars)
TO
alphaWord = pyparsing.Word(pyparsing.alphanums + pyparsing.punc8bit + pyparsing.alphas8bit + valid_extra_chars)
This allowed the scanString() to work correctly on the input lines.
Thanks again!
Steve
Great, glad you were able to work it out!
(You can also add in other unicode chars beyond the 8-bit range - just use the u'\u####' notation, or pyparsing's srange function is helpful for this too.)
Welcome to Pyparsing!
- Paul