Re: [Pyparsing] whitespace related question
Brought to you by:
ptmcg
From: Paul M. <pt...@au...> - 2008-06-25 01:01:38
|
Stefaan - First off, '^' and '+' are not really interchangeable - '+' is used to indicate a succession of expressions that must occur in the given order. '^' indicates a list of alternatives, and that the parser should evaluate all of the alternatives and select the longest match. '|' is like '^', but short-cuts evaluation, stopping when the first alternative match is found. So replacing '+' with '^' will just make things worse. Secondly, Word("string of whitespace characters") does not work, and I should think would give you a compiler warning. If you absolutely *must* parse for whitespace, use the pyparsing White() class. (But read on - you don't really need White().) Overall, this *is* a mysterious parser, because you have a *lot* going on! Here was your expression for a list of columns the last time we spoke: list_of_cols = p.delimitedList(p.Regex(r"[^#\n\r]+"), "#") And here is a sample table: table = """ # NAME # col1 # col2 # col3 ## cola # colb # # Test1 # 1 # 2 # 3 ## a # b # # Test_2 # 4 # 5 # 6 ## c # d # You now want to add "optionality" to the entries in the table, so I've added another row with some blank cells: table = """ # NAME # col1 # col2 # col3 ## cola # colb # # Test1 # 1 # 2 # 3 ## a # b # # Test_2 # 4 # 5 # 6 ## c # d # # Test_3 # 7 # 8 # ## # e # """ My first pass was to modify the elements of the delimited list, to indicate that list elements could be blank - up til now, this was easily done by wrapping the expression in a pyparsing Optional: list_of_cols = p.delimitedList(p.Optional(p.Regex(r"[^#\n\r]+")), "#") But this results in the exception: pyparsing.ParseException: Expected "##" (at char 231), (line:6, col:5) Why? Because now, your "##" table separator is being interpreted as two column separators with an empty cell. So we need to expand our notion of a delimiter, that we *only* want to accept '#' delimiters after first determining that the '#' is not the first character of a '##' table separator: list_of_cols = p.delimitedList(p.Optional(p.Regex(r"[^#\n\r]+")), ~p.Literal("##")+"#") This now parses our table, but we lose track of the empty cells. I assume that the cell's presences is significant, so we add a default value to the definition of the Optional: list_of_cols = p.delimitedList(p.Optional(p.Regex(r"[^#\n\r]+"),default=""), ~p.Literal("##")+"#") We are also not properly handling the newlines, since p.Optional is skipping over them as its default whitespace-skipping behavior. So let's use another negated lookahead to prevent matching a LineEnd() as part of the content of the delimited list: list_of_cols = p.delimitedList(~p.LineEnd()+p.Optional(p.Regex(r"[^#\n\r]+"),default=""), ~p.Literal("##")+"#") This probably is enough for you to proceed. As a matter of style, I tend to group lists of things using a pyparsing Group: list_of_cols = p.Group(p.delimitedList(~p.LineEnd()+p.Optional(p.Regex(r"[^#\n\r]+"),defaul t=""), ~p.Literal("##")+"#")) Tables of data aren't ordinarily this complicated to parse - it's just that in this case that you've chosen/been given, there are some tricky stumbling blocks due to the nature of your delimiting punctuation. -- Paul |