Thread: [Pyparsing] PayPal IPN message parsing
Brought to you by:
ptmcg
From: Werner F. B. <wer...@fr...> - 2011-01-06 14:10:18
|
I am having some problems decoding these messages. The data comes in as an email message with a defined content type as "Content-Type: text/plain", however it is really Content-Type: text/plain; charset="windows-1252", so I read it in with thisfile = codecs.open(regFile, "r", "windows-1252"). The parsing works fine except on things like: address_name = Göran Petterson Which I parse with: alphanums = pyp.Word(pyp.alphanums) # address str_add_name = pyp.Literal("address_name =").suppress() +\ alphanums + pyp.restOfLine add_name = str_add_name.setParseAction(self.str_add_nameAction) But I get in str_add_nameAction: ([u'G', u'\xf6ran Petterson\r'], {}) The raw data at this point is "address_name = G\xf6ran Petterson" What am I doing wrong in all this? I tried using pyp.printables instead of alphanums but with the same result. A tip would be very much appreciated. Werner P.S. Happy New Year to you all. |
From: Eike W. <eik...@gm...> - 2011-01-06 18:41:04
|
Hello Werner! On Thursday 06.01.2011 12:44:53 Werner F. Bruhin wrote: > I am having some problems decoding these messages. > > The data comes in as an email message with a defined content type as > "Content-Type: text/plain", however it is really Content-Type: > text/plain; charset="windows-1252", so I read it in with > > thisfile = codecs.open(regFile, "r", "windows-1252"). I think this is correct. You convert the file from "windows-1252" to Unicode prior to parsing. You must write constants as `u"Göran"`. You should IMHO also encode your program's source code with UTF-8 and have the following as the first line: # -*- coding: utf-8 -*- IMHO IPython has additional Unicode problems. This has confused me when I wrote this E-mail, maybe something similar is happening on your computer too. > > The parsing works fine except on things like: > > address_name = Göran Petterson > > Which I parse with: > alphanums = pyp.Word(pyp.alphanums) > > # address > str_add_name = pyp.Literal("address_name =").suppress() +\ > alphanums + pyp.restOfLine > add_name = str_add_name.setParseAction(self.str_add_nameAction) > > But I get in str_add_nameAction: > ([u'G', u'\xf6ran Petterson\r'], {}) `pyp.alphanums` is a string, and it does not contain the character "ö". See: In [1]: import pyparsing as pyp In [2]: pyp.alphanums Out[2]: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789' I think getting a suitable parser with the least typing would be something like: alphanums = pyp.CharsNotIn(" ,.") or str_add_name = pyp.Literal("address_name =").suppress() +\ pyp.restOfLine And keep in mind that foreigners write their names in funny ways. Older Germans, for example, frequently have forenames with hyphens, like "Karl- Heinz" or "Franz-Josef". > > The raw data at this point is "address_name = G\xf6ran Petterson" The code for "ö" in windows-1252 and in Unicode is F6. I think this is correct. It is repr(u"Göran Petterson") http://en.wikipedia.org/wiki/Windows-1252 http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF All the best, Eike. |
From: Paul M. <pt...@au...> - 2011-01-07 01:34:32
|
I'm not super-keen on your variable naming (using alphanums as a Word expression, overloading the alphanums string defined in pyparsing), but let's go with it. The alphanums string in pyparsing is purely 7-bit ASCII characters. As a first pass, try changing this to add the alphas8bit string: alphanums = Word(pyp.alphanums + pyp.alphas8bit) This should handle your posted question. If you need to handle more of the Unicode set (beyond chr(256)), then you'll need to use these definitions: >>> alphas = u''.join(unichr(c) for c in range(65536) if unichr(c).isalpha()) >>> len(alphas) 47672 >>> nums = u''.join(unichr(c) for c in range(65536) if unichr(c).isdigit()) >>> len(nums) 404 So if you go to embracing all Unicode strings, there are actually over 400 characters that are considered to be numeric digits. But I think alphas8bit should carry you along for a while. -- Paul -----Original Message----- From: Werner F. Bruhin [mailto:wer...@fr...] Sent: Thursday, January 06, 2011 5:45 AM To: pyp...@li... Subject: [Pyparsing] PayPal IPN message parsing I am having some problems decoding these messages. The data comes in as an email message with a defined content type as "Content-Type: text/plain", however it is really Content-Type: text/plain; charset="windows-1252", so I read it in with thisfile = codecs.open(regFile, "r", "windows-1252"). The parsing works fine except on things like: address_name = Göran Petterson Which I parse with: alphanums = pyp.Word(pyp.alphanums) # address str_add_name = pyp.Literal("address_name =").suppress() +\ alphanums + pyp.restOfLine add_name = str_add_name.setParseAction(self.str_add_nameAction) But I get in str_add_nameAction: ([u'G', u'\xf6ran Petterson\r'], {}) The raw data at this point is "address_name = G\xf6ran Petterson" What am I doing wrong in all this? I tried using pyp.printables instead of alphanums but with the same result. A tip would be very much appreciated. Werner P.S. Happy New Year to you all. ---------------------------------------------------------------------------- -- Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption http://p.sf.net/sfu/oracle-sfdevnl _______________________________________________ Pyparsing-users mailing list Pyp...@li... https://lists.sourceforge.net/lists/listinfo/pyparsing-users |
From: Werner F. B. <wer...@fr...> - 2011-01-07 09:07:28
|
Paul and Eike, Thanks for your pointers. On 07/01/2011 02:18, Paul McGuire wrote: > I'm not super-keen on your variable naming (using alphanums as a Word > expression, overloading the alphanums string defined in pyparsing) As I import pyparsing as pyp it didn't cause me any problems, but you are right, it is changed. , but > let's go with it. The alphanums string in pyparsing is purely 7-bit ASCII > characters. As a first pass, try changing this to add the alphas8bit string: > > alphanums = Word(pyp.alphanums + pyp.alphas8bit) > > This should handle your posted question. That did it but will probably go with below. Thanks Werner > > If you need to handle more of the Unicode set (beyond chr(256)), then you'll > need to use these definitions: > >>>> alphas = u''.join(unichr(c) for c in range(65536) if > unichr(c).isalpha()) >>>> len(alphas) > 47672 >>>> nums = u''.join(unichr(c) for c in range(65536) if unichr(c).isdigit()) >>>> len(nums) > 404 > > So if you go to embracing all Unicode strings, there are actually over 400 > characters that are considered to be numeric digits. But I think alphas8bit > should carry you along for a while. > > -- Paul > > > > -----Original Message----- > From: Werner F. Bruhin [mailto:wer...@fr...] > Sent: Thursday, January 06, 2011 5:45 AM > To: pyp...@li... > Subject: [Pyparsing] PayPal IPN message parsing > > I am having some problems decoding these messages. > > The data comes in as an email message with a defined content type as > "Content-Type: text/plain", however it is really Content-Type: > text/plain; charset="windows-1252", so I read it in with > > thisfile = codecs.open(regFile, "r", "windows-1252"). > > The parsing works fine except on things like: > > address_name = Göran Petterson > > Which I parse with: > alphanums = pyp.Word(pyp.alphanums) > > # address > str_add_name = pyp.Literal("address_name =").suppress() +\ > alphanums + pyp.restOfLine > add_name = str_add_name.setParseAction(self.str_add_nameAction) > > But I get in str_add_nameAction: > ([u'G', u'\xf6ran Petterson\r'], {}) > > The raw data at this point is "address_name = G\xf6ran Petterson" > > What am I doing wrong in all this? > > I tried using pyp.printables instead of alphanums but with the same result. > > A tip would be very much appreciated. > > Werner > > P.S. > Happy New Year to you all. > > > ---------------------------------------------------------------------------- > -- > Learn how Oracle Real Application Clusters (RAC) One Node allows customers > to consolidate database storage, standardize their database environment, > and, > should the need arise, upgrade to a full multi-node Oracle RAC database > without downtime or disruption > http://p.sf.net/sfu/oracle-sfdevnl > _______________________________________________ > Pyparsing-users mailing list > Pyp...@li... > https://lists.sourceforge.net/lists/listinfo/pyparsing-users > > > ------------------------------------------------------------------------------ > Gaining the trust of online customers is vital for the success of any company > that requires sensitive data to be transmitted over the Web. Learn how to > best implement a security strategy that keeps consumers' information secure > and instills the confidence they need to proceed with transactions. > http://p.sf.net/sfu/oracle-sfdevnl |