Re: [Pyparsing] PayPal IPN message parsing
Brought to you by:
ptmcg
From: Eike W. <eik...@gm...> - 2011-01-06 18:41:04
|
Hello Werner! On Thursday 06.01.2011 12:44:53 Werner F. Bruhin wrote: > I am having some problems decoding these messages. > > The data comes in as an email message with a defined content type as > "Content-Type: text/plain", however it is really Content-Type: > text/plain; charset="windows-1252", so I read it in with > > thisfile = codecs.open(regFile, "r", "windows-1252"). I think this is correct. You convert the file from "windows-1252" to Unicode prior to parsing. You must write constants as `u"Göran"`. You should IMHO also encode your program's source code with UTF-8 and have the following as the first line: # -*- coding: utf-8 -*- IMHO IPython has additional Unicode problems. This has confused me when I wrote this E-mail, maybe something similar is happening on your computer too. > > The parsing works fine except on things like: > > address_name = Göran Petterson > > Which I parse with: > alphanums = pyp.Word(pyp.alphanums) > > # address > str_add_name = pyp.Literal("address_name =").suppress() +\ > alphanums + pyp.restOfLine > add_name = str_add_name.setParseAction(self.str_add_nameAction) > > But I get in str_add_nameAction: > ([u'G', u'\xf6ran Petterson\r'], {}) `pyp.alphanums` is a string, and it does not contain the character "ö". See: In [1]: import pyparsing as pyp In [2]: pyp.alphanums Out[2]: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' I think getting a suitable parser with the least typing would be something like: alphanums = pyp.CharsNotIn(" ,.") or str_add_name = pyp.Literal("address_name =").suppress() +\ pyp.restOfLine And keep in mind that foreigners write their names in funny ways. Older Germans, for example, frequently have forenames with hyphens, like "Karl- Heinz" or "Franz-Josef". > > The raw data at this point is "address_name = G\xf6ran Petterson" The code for "ö" in windows-1252 and in Unicode is F6. I think this is correct. It is repr(u"Göran Petterson") http://en.wikipedia.org/wiki/Windows-1252 http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF All the best, Eike. |