Thread: [Pyparsing] whitespace related question
Brought to you by:
ptmcg
From: stefaan.himpe <ste...@gm...> - 2008-06-22 05:06:47
|
Hello list, I am stumped by some unexpected behaviour. I want to parse tables of the following form: table = """ # NAME # col1 # col2 # col3 ## cola # colb # # Test1 # 1 # 2 # 3 ## a # b # # Test_2 # 4 # 5 # 6 ## c # d # """ For this, I have specified a TableParser (code follows after this mail). At first sight, the TableParser does exactly what I want. But I found out that parsing stops if one of the table rows contains a space after the last "#", and I do not understand why. I expected the p.restOfLine to take care of this. This is with pyparsing 1.4.12. Any ideas? Best regards, Stefaan. import pyparsing as p identifier = p.Word(p.alphas + "_", p.alphas + p.nums + "_") col = p.Literal("#").suppress() list_of_cols = p.delimitedList(p.CharsNotIn("#\n\r"), "#") left_table_header = col + p.ZeroOrMore(identifier).setResultsName("TestColumnName") + col + \ list_of_cols.setResultsName("HeaderSetupDataColumns") right_table_header = list_of_cols.setResultsName("HeaderCheckDataColumns") + \ p.restOfLine.suppress() table_header = left_table_header.setResultsName("LeftTableHeader") + \ p.Literal("##").suppress() + \ right_table_header.setResultsName("RightTableHeader") + \ p.lineEnd.suppress() left_table_row = col + \ identifier.setResultsName("TestName") + \ col + \ list_of_cols.setResultsName("RowSetupDataColumns") right_table_row = list_of_cols.setResultsName("RowCheckDataColumns") + \ p.restOfLine.suppress() table_row = left_table_row.setResultsName("LeftTableRow") + \ p.Literal("##").suppress() + \ right_table_row.setResultsName("RightTableRow") + \ p.lineEnd.suppress() TableParser = table_header + \ p.OneOrMore(p.Group(table_row)).setResultsName("Rows") |
From: Paul M. <pt...@au...> - 2008-06-23 05:44:13
|
Stefaan - You are correct, this has to do with whitespace skipping in pyparsing. The culprit turns out to be your loose definition of list_of_cols: list_of_cols = p.delimitedList(p.CharsNotIn("#\n\r"), "#") Pyparsing defaults *in most cases* to skipping whitespace before trying to match any expression. Whitespace skipping gets suppressed if you have wrapped code within a Combine, or have called leaveWhitespace, *OR* if you use CharsNotIn. CharsNotIn started out as a sort of AntiWord, in that you could define a Word composed of any characters *not* in the given set. When I created CharsNotIn, I decided that I would *not* automatically skip whitespace before matching one of these, since whitespace could conceivably be one of the the characters to be avoided, and if I skipped over it before matching, I would make a false positive. One alternative is to add "Empty()" (or the pyparsing constant "empty") to your expression of what can be found in a list of cols, as in the following: list_of_cols = p.delimitedList(p.empty+p.CharsNotIn("#\n\r"), "#") Empty() *does* advance past whitespace, consumes no actual characters, and always succeeds, so adding Empty() is a way to explicitly jump over some whitespace. Or you could use the Regex expression, also which skips over whitespace before matching, and use the re notation of "[^...]" replacing '...' with the characters to exclude from matching: list_of_cols = p.delimitedList(p.Regex(r"[^#\n\r]+"), "#") With the sample you sent, either of the options works, choose whichever you are more comfortable with. Alternatively, you could also try tightening up your definition of list_of_cols, too, to match just integers on the left side of the table, and contiguous alphanumeric words on the right side of the table. Best of luck, and keep on pyparsing! -- Paul -----Original Message----- From: pyp...@li... [mailto:pyp...@li...] On Behalf Of stefaan.himpe Sent: Saturday, June 21, 2008 10:09 AM To: pyp...@li... Subject: [Pyparsing] whitespace related question Hello list, I am stumped by some unexpected behaviour. I want to parse tables of the following form: table = """ # NAME # col1 # col2 # col3 ## cola # colb # # Test1 # 1 # 2 # 3 ## a # b # # Test_2 # 4 # 5 # 6 ## c # d # """ For this, I have specified a TableParser (code follows after this mail). At first sight, the TableParser does exactly what I want. But I found out that parsing stops if one of the table rows contains a space after the last "#", and I do not understand why. I expected the p.restOfLine to take care of this. This is with pyparsing 1.4.12. Any ideas? Best regards, Stefaan. import pyparsing as p identifier = p.Word(p.alphas + "_", p.alphas + p.nums + "_") col = p.Literal("#").suppress() list_of_cols = p.delimitedList(p.CharsNotIn("#\n\r"), "#") left_table_header = col + p.ZeroOrMore(identifier).setResultsName("TestColumnName") + col + \ list_of_cols.setResultsName("HeaderSetupDataColumns") right_table_header = list_of_cols.setResultsName("HeaderCheckDataColumns") + \ p.restOfLine.suppress() table_header = left_table_header.setResultsName("LeftTableHeader") + \ p.Literal("##").suppress() + \ right_table_header.setResultsName("RightTableHeader") + \ p.lineEnd.suppress() left_table_row = col + \ identifier.setResultsName("TestName") + \ col + \ list_of_cols.setResultsName("RowSetupDataColumns") right_table_row = list_of_cols.setResultsName("RowCheckDataColumns") + \ p.restOfLine.suppress() table_row = left_table_row.setResultsName("LeftTableRow") + \ p.Literal("##").suppress() + \ right_table_row.setResultsName("RightTableRow") + \ p.lineEnd.suppress() TableParser = table_header + \ p.OneOrMore(p.Group(table_row)).setResultsName("Rows") ------------------------------------------------------------------------- Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php _______________________________________________ Pyparsing-users mailing list Pyp...@li... https://lists.sourceforge.net/lists/listinfo/pyparsing-users |
From: Stefaan H. <ste...@gm...> - 2008-06-23 09:12:00
|
> > > Alternatively, you could also try tightening up your definition of > list_of_cols, too, to match just integers on the left side of the table, > and > contiguous alphanumeric words on the right side of the table. > > Hello, and thank you so much for your clarification. Tightening up the definition of the list_of_cols is not an option, however, as in the real-life application the cells can contain random characters/numbers/whitespace/... (templates for code generation). |
From: stefaan.himpe <ste...@gm...> - 2008-06-24 21:19:31
|
Hello, and I have one follow-up question, The solution I had was able to parse tables with empty cells, but after incorporating your suggestions this no longer works. At first I expected it would be trivial to extend the parser to handle empty cells, but so far I haven't managed to get something working :( I have tried to extend the list_of_cols definition in many ways and I have tried to replace + with ^ in some places... I am still missing some fundamental insights in how pyparsing works to unravel this little mystery. I'd be really grateful for some input. Best regards, Stefaan. to given an idea of some of many attempts (replace list_of_cols in the earlier posted code) list_of_cols = p.delimitedList(p.Word(" \t") | p.Regex(r"[^#\n\r]+"), "#") or list_of_cols = p.delimitedList(p.Regex(r"[^#\n\r]*"), "#") |
From: Paul M. <pt...@au...> - 2008-06-25 01:01:38
|
Stefaan - First off, '^' and '+' are not really interchangeable - '+' is used to indicate a succession of expressions that must occur in the given order. '^' indicates a list of alternatives, and that the parser should evaluate all of the alternatives and select the longest match. '|' is like '^', but short-cuts evaluation, stopping when the first alternative match is found. So replacing '+' with '^' will just make things worse. Secondly, Word("string of whitespace characters") does not work, and I should think would give you a compiler warning. If you absolutely *must* parse for whitespace, use the pyparsing White() class. (But read on - you don't really need White().) Overall, this *is* a mysterious parser, because you have a *lot* going on! Here was your expression for a list of columns the last time we spoke: list_of_cols = p.delimitedList(p.Regex(r"[^#\n\r]+"), "#") And here is a sample table: table = """ # NAME # col1 # col2 # col3 ## cola # colb # # Test1 # 1 # 2 # 3 ## a # b # # Test_2 # 4 # 5 # 6 ## c # d # You now want to add "optionality" to the entries in the table, so I've added another row with some blank cells: table = """ # NAME # col1 # col2 # col3 ## cola # colb # # Test1 # 1 # 2 # 3 ## a # b # # Test_2 # 4 # 5 # 6 ## c # d # # Test_3 # 7 # 8 # ## # e # """ My first pass was to modify the elements of the delimited list, to indicate that list elements could be blank - up til now, this was easily done by wrapping the expression in a pyparsing Optional: list_of_cols = p.delimitedList(p.Optional(p.Regex(r"[^#\n\r]+")), "#") But this results in the exception: pyparsing.ParseException: Expected "##" (at char 231), (line:6, col:5) Why? Because now, your "##" table separator is being interpreted as two column separators with an empty cell. So we need to expand our notion of a delimiter, that we *only* want to accept '#' delimiters after first determining that the '#' is not the first character of a '##' table separator: list_of_cols = p.delimitedList(p.Optional(p.Regex(r"[^#\n\r]+")), ~p.Literal("##")+"#") This now parses our table, but we lose track of the empty cells. I assume that the cell's presences is significant, so we add a default value to the definition of the Optional: list_of_cols = p.delimitedList(p.Optional(p.Regex(r"[^#\n\r]+"),default=""), ~p.Literal("##")+"#") We are also not properly handling the newlines, since p.Optional is skipping over them as its default whitespace-skipping behavior. So let's use another negated lookahead to prevent matching a LineEnd() as part of the content of the delimited list: list_of_cols = p.delimitedList(~p.LineEnd()+p.Optional(p.Regex(r"[^#\n\r]+"),default=""), ~p.Literal("##")+"#") This probably is enough for you to proceed. As a matter of style, I tend to group lists of things using a pyparsing Group: list_of_cols = p.Group(p.delimitedList(~p.LineEnd()+p.Optional(p.Regex(r"[^#\n\r]+"),defaul t=""), ~p.Literal("##")+"#")) Tables of data aren't ordinarily this complicated to parse - it's just that in this case that you've chosen/been given, there are some tricky stumbling blocks due to the nature of your delimiting punctuation. -- Paul |
From: Stefaan H. <ste...@gm...> - 2008-06-25 08:07:29
|
Hello Paul, Thanks a lot! I indeed had understood that ## caused problems in combination with the implicit whitespace parsing and I seriously doubt I could have come up with the full solution myself... As for mixing ^ and + -- I actually knew the difference -- but I was getting tired. (In maths and electronics, + usually means OR which probably explains my confusion ;) ) It seems to work now! (Well this part of my parsing problem at least -- but the I will first try to continue myself) Best regards, and thanks again, Stefaan. |