Re: [Pyparsing] fixed-length field preceded by length?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

The countedText() pattern I posted on this thread did not work
when the literal text extended over multiple lines.  Below, an
updated version that fixes that.  Also expanded the comments
and cleaned up some details.

Best regards,
John Shipman (jo...@nm...), Applications Specialist, NM Tech Computer Center,
Speare 146, Socorro, NM 87801, (575) 835-5735, http://www.nmt.edu/~john
   ``Let's go outside and commiserate with nature.''  --Dave Farber
================================================================
#!/usr/bin/env python3
#================================================================
# countedtext: Using pyparsing for a string preceded by a count
#----------------------------------------------------------------
# Author: John W. Shipman (jo...@nm...)
#         New Mexico Tech Computer Center
#         Socorro, NM 87801
#
# Problem: Sanjay Ghemawat's venerable 'ical' calendar utility
# (http://en.wikipedia.org/wiki/Ical_%28Unix%29) saves events
# in a .calendar file, in which the description of an event
# is saved in a line like this:
#
#   Text [6 [Easter]]
#
# The problem is to write a pyparsing pattern that parses the
# count and the bracketed string.  The shortcut method is to
# use QuotedString(quoteChar='[', endQuoteChar=']'), but this
# fails if the literal string contains a ']' character.
#
# Paul McGuire responded immediately to my post on the pyparsing
# mailing list, suggesting that I study the implementation of the
# countedArray() helper.  Based on this advice, I offer this
# implementation of a countedText() pattern that matches an
# integer followed by a literal string in brackets, complete
# with a test driver.
#   2012-01-03: Now allows newlines in the literal string.
#     Also expanded the comments and simplified some code.
#     To convert to Python 2.7:
#       - Removed '3' from the end of the first line.
#       - Uncommend the __future__ just below.
#   2012-01-01: Initial version.
#----------------------------------------------------------------

####from __future__ import print_function
import sys
import re
import pyparsing as pp

def countedText():
     '''Defines a pattern of the form:
          N "[" char* "]"
        where N is an integer that specifies the length of the
        following bracketed string literal.
        Example: "6 [Easter]"
     '''
     #--
     # The basic trick is to use Forward to create a dummy token
     # whose content can be filled in later.  The time sequence:
     #   A. When countedText() is called:
     #      - Define a pattern 'intExpr' for N, and attach a parse
     #        action to it that converts to type int.
     #      - Use Forward() to create a dummy token 'stringExpr'
     #        that will eventually match the (char*) of the pattern.
     #      - Create a closure named 'countedParseAction' and attach it
     #        as a parse action to intExpr.
     #      - Return a pattern that matches the whole construct, with
     #        the dummy token at the position of the (char*) part.
     #   B. When intExpr is matched:
     #      - Its first parse action converts the value to type int.
     #      - Its second parse action is the countedParseAction()
     #        closure, which extracts N from the token list t.
     #      - It creates a pattern that matches exactly N characters,
     #        including newlines.  The '<<' operator for the Forward
     #        class is overloaded to drop the real pattern in place
     #        of the dummy pattern.
     #--
     intExpr = pp.Word(pp.nums).setParseAction(lambda t: int(t[0]))
     stringExpr = pp.Forward()

     def countedParseAction(s, l, t):
         '''Parse action that sets up the count in stringExpr.

           To match the part between the brackets, we use a regex of
           the form ".{N}".  This works even for N=0.  The re.DOTALL
           flag makes "." match any character, even newline.
         '''
         n = int(t[0])
         stringExpr << pp.Combine(
             pp.Suppress("[") +
             pp.Regex(".{{{0:d}}}".format(n), re.DOTALL) +
             pp.Suppress("]"))
         return []

     #--
     # The second parse action uses the count to define the
     # stringExpr pattern using the actual value of the count.
     #--
     intExpr.addParseAction(countedParseAction)
     return (intExpr + stringExpr)

# - - - - -   m a i n

testLines = [                # Test output
     "0 []",                  # ['']
     "11 [abcdefghijk]",      # ['abcdefghijk']
     "6 [Easter]",            # ['Easter']
     "4 []]]]]",              # [']]]]]]']
     "6 [123\n56]",           # ['123\n56']
     "6 [ abcdef]"            # Fails (leading whitespace not skipped)
     ]

LINE_PAT = countedText()

def main():
     """Main
     """
     for line in testLines:
         test(line)

def test(line):
     '''Test one line
     '''
     try:
         result = LINE_PAT.parseString(line, parseAll=True)
         print("/{0}/ -> {1}".format(line, result))
     except pp.ParseException as x:
         print("{0}\n{1}^ Fail".format(line, " "*(x.column-1)))

# - - - - -   E p i l o g u e

if __name__ == "__main__":
     main()