I am a total newbie with Python and especially PyParsing.
I am trying to write a little script that reads a file in the format:
name tel
street zip city
in between the fields there might be special characters from time to time. So far I have managed to write a PyParsing Grammar that recognizes names (single names, names made up from more than one words and names including more than one words and special characters), tel (always in a format like x-y, where x and why can vary in length), street (same as name, one or more words, special characters recognized), zip (always a 5-digit number) and city (one or more words, no special characters).
Everything works fine on test data, that has the 2 lines seperated
name tel
name tel
.
.
.
or
street zip city
street zip city
I have no been able to get this to work on files that are in the format required
name tel
street zip city
name tel
street zip city
.
.
.
I did read the post https://sourceforge.net/forum/forum.php?thread_id=1224566&forum_id=337293 but have not been able to get it to work. When I print the result, only the very last entry is printed (actually only the last 2 words of the last city, which should contain 3 words). Again, when parsing a test.file it works just fine.
The other big problem is getting everything written in a file. I tried to just pickle my result (pickle dump (output, result), but python just complains about " 'str' object is not callable ".
Any idea on either problem would be greatly appreciated.
Nils
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
1. Are you using scanString or parseString? Posting some code and/or grammar fragments, plus some sample data, may be helpful here.
2. Parsing results are returned as ParseResults objects, which may not pickle nicely. Try pickling results.asList(), which will collapse the tokens down to a nested list.
-- Paul
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What I basically need is just having the 5 fields in a single line, seperated by a semicolon. Is there something like "beginingofLine"? If there was it would be a lot easier for me. As it is now, I am struggling with getting the name as a single string (when I use combine, only the very first word is in the string, the rest of the name is thrown into the rest of the data randomly). Same goes for the street. How do I write a grammar, that parses the name and returns a single string? The tel-nr. does work, zip codes does (big thing :)) and the city is basically a restofLine in the second line. So if anyone could give me a hint (or a hand), I'd appreciate it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Okay, so here's how I interpret your grammar:
Field 1 - unpredictable, everything up to the phone number
Field 2 - phone number, format is some digits, a dash, and some more digits, all adjacent
Field 3 - unpredictable, everything up to a zip code
Field 4 - zip code, 5 adjacent digits
Field 5 - everything following the zip code, up to the end of line
Ok, so you have working grammar fragments for phone number and zip code, probably something like:
phoneNum = Combine(Word(nums) + "-" + Word(nums)).setResultsName("phone")
zipCode = Word(nums,exact=5).setResultsName("zip")
Let's try SkipTo for the fields that we can only define as "everything up to X":
name = SkipTo( phoneNum ).setResultsName("name")
street = SkipTo( zipCode ).setResultsName("street")
city = restOfLine.setResultsName("city")
Now a single entry looks like:
entry = Group( name + phoneNum + street + zipCode + city )
Here's your whole program with all the plumbing:
testdata = """
Burger King 1234-45679
Rue de Fontane 12 45458 Chablis
phoneNum = Combine(Word(nums) + "-" + Word(nums)).setResultsName("phone")
zipCode = Word(nums,exact=5).setResultsName("zip")
name = SkipTo( phoneNum ).setResultsName("name")
street = SkipTo( zipCode ).setResultsName("street")
city = restOfLine.setResultsName("city")
# Use these alternate forms to skip whitespace before characters
#~ name = Combine(empty + SkipTo( phoneNum ) ).setResultsName("name")
#~ street = Combine(empty + SkipTo( zipCode ) ).setResultsName("street")
#~ city = Combine( empty + restOfLine ).setResultsName("city")
entry = Group( name + phoneNum + street + zipCode + city )
I am a total newbie with Python and especially PyParsing.
I am trying to write a little script that reads a file in the format:
name tel
street zip city
in between the fields there might be special characters from time to time. So far I have managed to write a PyParsing Grammar that recognizes names (single names, names made up from more than one words and names including more than one words and special characters), tel (always in a format like x-y, where x and why can vary in length), street (same as name, one or more words, special characters recognized), zip (always a 5-digit number) and city (one or more words, no special characters).
Everything works fine on test data, that has the 2 lines seperated
name tel
name tel
.
.
.
or
street zip city
street zip city
I have no been able to get this to work on files that are in the format required
name tel
street zip city
name tel
street zip city
.
.
.
I did read the post https://sourceforge.net/forum/forum.php?thread_id=1224566&forum_id=337293 but have not been able to get it to work. When I print the result, only the very last entry is printed (actually only the last 2 words of the last city, which should contain 3 words). Again, when parsing a test.file it works just fine.
The other big problem is getting everything written in a file. I tried to just pickle my result (pickle dump (output, result), but python just complains about " 'str' object is not callable ".
Any idea on either problem would be greatly appreciated.
Nils
Sorry about this seconds post.. I meant to write pickle.dump(result, output)
1. Are you using scanString or parseString? Posting some code and/or grammar fragments, plus some sample data, may be helpful here.
2. Parsing results are returned as ParseResults objects, which may not pickle nicely. Try pickling results.asList(), which will collapse the tokens down to a nested list.
-- Paul
thanks for your replay, sorry that I could not get back earlier.
I do not have access to the actual data right now, but I'll try to give an example.
Burger King 1234-45679
Rue de Fontane 12 45458 Chablis
PennyArcade 2000 02315-4567897
Highway 15 32154 Dollarville
B&O 44444-7874564
Ruppstr. 44 45454 Whateverville
What I basically need is just having the 5 fields in a single line, seperated by a semicolon. Is there something like "beginingofLine"? If there was it would be a lot easier for me. As it is now, I am struggling with getting the name as a single string (when I use combine, only the very first word is in the string, the rest of the name is thrown into the rest of the data randomly). Same goes for the street. How do I write a grammar, that parses the name and returns a single string? The tel-nr. does work, zip codes does (big thing :)) and the city is basically a restofLine in the second line. So if anyone could give me a hint (or a hand), I'd appreciate it.
Okay, so here's how I interpret your grammar:
Field 1 - unpredictable, everything up to the phone number
Field 2 - phone number, format is some digits, a dash, and some more digits, all adjacent
Field 3 - unpredictable, everything up to a zip code
Field 4 - zip code, 5 adjacent digits
Field 5 - everything following the zip code, up to the end of line
Ok, so you have working grammar fragments for phone number and zip code, probably something like:
phoneNum = Combine(Word(nums) + "-" + Word(nums)).setResultsName("phone")
zipCode = Word(nums,exact=5).setResultsName("zip")
Let's try SkipTo for the fields that we can only define as "everything up to X":
name = SkipTo( phoneNum ).setResultsName("name")
street = SkipTo( zipCode ).setResultsName("street")
city = restOfLine.setResultsName("city")
Now a single entry looks like:
entry = Group( name + phoneNum + street + zipCode + city )
Here's your whole program with all the plumbing:
testdata = """
Burger King 1234-45679
Rue de Fontane 12 45458 Chablis
PennyArcade 2000 02315-4567897
Highway 15 32154 Dollarville
B&O 44444-7874564
Ruppstr. 44 45454 Whateverville
"""
from pyparsing import *
phoneNum = Combine(Word(nums) + "-" + Word(nums)).setResultsName("phone")
zipCode = Word(nums,exact=5).setResultsName("zip")
name = SkipTo( phoneNum ).setResultsName("name")
street = SkipTo( zipCode ).setResultsName("street")
city = restOfLine.setResultsName("city")
# Use these alternate forms to skip whitespace before characters
#~ name = Combine(empty + SkipTo( phoneNum ) ).setResultsName("name")
#~ street = Combine(empty + SkipTo( zipCode ) ).setResultsName("street")
#~ city = Combine( empty + restOfLine ).setResultsName("city")
entry = Group( name + phoneNum + street + zipCode + city )
results = OneOrMore(entry).parseString( testdata )
for r in results:
print "-",r.name
print "-",r.street
print "-",r.city
print "-",r.zip
print "-",r.phone
print
Good luck,
-- Paul