Problems with LineEnd()

  • Anonymous - 2011-12-03

    I am having trouble getting this code fragment to work properly. I have a grammar where newlines are important, but LineEnd and LineStart do not seem to work properly.

    eol = Suppress(LineEnd()).setDebug()
    sol = Suppress(LineStart())
    testFilePath = os.path.join("/Users","croaker","SampleData","2011.04.pairings.html")
    SMALL_TEST = 1
    BIG_TEST = 2 
    TestBNF = namedtuple("TestBNF", "testData bnf")
    def foo():
    #Hotel Phone
        (407) 996-9840 TRANS: Trio Trans Inc (407) 248-1900
        (775) 329-0711
        52-998-8915555 TRANS: Intermar Caribe 52-998-881-0000
        52-555-1305300 TRANS: Corad 52 1 55 35 22 85 21
        phoneDigits = Word(nums + "()-. ")
        phoneNumber = OneOrMore(phoneDigits)("PhoneNumber").setName("PhoneNumber")
        transName = OneOrMore(Word(alphas + simple_punc))("TransName")
        transportationNumber = Suppress("TRANS:") + transName + phoneNumber("TransPhone")
        hotelPhone = phoneNumber + Optional(Suppress("TRANS:") + transName + phoneNumber)
        hotelPhoneTuple = TestBNF(testTextHotelPhone, hotelPhone)
        return hotelPhoneTuple
    def testRig(sectionName, type):
        testText = None
        testHarness = None
        testHarness = foo()
        if type == SMALL_TEST:
            testText = testHarness.testData
        if type == BIG_TEST:
                testFile = open(testFilePath,'r')
                testData = "".join( testFile.readlines() )
                testText = testData
            except Exception as err:
            ParserElement.setDefaultWhitespaceChars(' \t\r')
            testBNF = testHarness.bnf.setDebug(False)
            count = 0
            for match, start, end in testBNF.scanString(testText):
                count = count + 1
                if (count % 10) == 0:
            print(sectionName + " Output: " + str(count))
        except Exception as err:

    Gets me this:
      <PhoneNumber>(407) 996-9840 </PhoneNumber>
      <PhoneNumber>(407) 248-1900</PhoneNumber>
      <ITEM>(775) 329-0711</ITEM>
      <ITEM>52-998-8915555 </ITEM>

      <ITEM>52-555-1305300 </ITEM>
      <PhoneNumber>52 1 55 35 22 85 21</PhoneNumber>
    hotelPhone Output: 2

    only two matches, and the second match steals a phone number from line 3.

    hotelPhone = phoneNumber + Optional(Suppress("TRANS:") + transName + phoneNumber)
    hotelPhone = phoneNumber + Optional(Suppress("TRANS:") + transName + phoneNumber) + eol

    gets this:
      <ITEM>52-555-1305300 </ITEM>
      <PhoneNumber>52 1 55 35 22 85 21</PhoneNumber>
    hotelPhone Output: 1

    and trying sol at the beginning of my line definition gets me nothing.

    I really need to get newline detection working to properly parse this data, Any suggestions would be appreciated.

    Also, this is a fragment of a larger parser, I hope the example runs :). This also my first Python project, so some things might need extra explaining….


  • Paul McGuire

    Paul McGuire - 2011-12-04

    First off, welcome to Pyparsing and Python!  I'm always gratified when newcomers to Python take on pyparsing in one of their first programs.

    To address your specific question, I was able to make significant improvement to your program simply by moving this line:

    ParserElement.setDefaultWhitespaceChars(' \t\r')

    to immediately follow the import of pyparsing.  This is important because the whitespace character set is a per-instance attribute, and is set at construction time.  So you want to set the default set of whitespace characters before defining any other elements of your parser.

    Here are some other pyparsing usage suggestions:

    I'm not sure whether you are using asXML as a way to quickly see your parsed output, or if you really want XML.  If you just want the parsed output with the names you've given things, use dump() instead.  Using dump(), you'll get output like this:

    ['(407) 996-9840 ', 'Trio', 'Trans', 'Inc', '(407) 248-1900 (775) 329-0711']
    - PhoneNumber: ['(407) 248-1900 (775) 329-0711']
    - TransName: ['Trio', 'Trans', 'Inc']

    It shows you the list of matched tokens, followed by a hierarchical tree of any named results.

    You may want to clean up the fields that have embedded spaces, like transName, which as you can see returns the list , but I suspect you really want the string "Trio Trans Inc".  To do this, you have a few options:
    - use a parse action to rejoin the separate words
    - use originalTextFor to indicate that you want the originally formatted substring of the matching tokens, not the whitespace-skipped pieces that got parsed. 

    Here are the two options respectively:

        transName.setParseAction(lambda tokens : ' '.join(tokens))


        transName = originalTextFor(OneOrMore(Word(alphas + simple_punc)))("TransName")

    You only need one or the other, not both.  If you are going to continue using asXML, this should also help clean up some of those ugly <ITEM> tags.  I apologize if using a lambda is a more advanced Python feature that you might not be used to. It is just a shortcut for:

    def action(tokens):
        return ' '.join(tokens)

    For parse actions that are simple joining or string upper/lower casing, lambdas are very convenient.

    A couple of other "style" points on your Python code:

    namedtuple is a nice feature to have learned, but is not totally necessary the way you have used it here, which seems to be to build a mini-class for returning multiple values from a function. The same capability has been in Python for years before namedtuple was introduced, where you simply return the multiple values. In your case this would look like:

        return testText, hotelPhone

    which returns a tuple that can be unpacked in the caller - you don't even need the enclosing ()'s.  The caller of your function would look like:

        testText, testBnf = foo()

    Here's another example: the built-in divmod function returns the quotient and remainder of the division of 2 numbers, and you call it like this:

        quotient,remainder = divmod(10,7)

    after which quotient has the value 1 and remainder the value 3. 

    In contrast, namedtuple is really a shortcut for defining classes that are little more than structs, like:

        Point3D = namedtuple("Point3D", "x y z")

    No setters, no getters, just 3 attributes on an immutable tuple.  There's nothing wrong with the code you've written, and namedtuple is a good tool to have in your kit, but you don't need to overdo it.

    Opening files and reading data are so common, it's a good idea to learn the best idiom right from the beginning.  Two points abour your file-reading code: use read() to read the complete contents of a text file as a string (instead of ''.join(f.readlines())); and get in the habit of using context managers and the 'with' statement, which in this case would take care of closing the file after you are done with it.  Here is your file-reading code:

                testFile = open(testFilePath,'r')
                testData = "".join( testFile.readlines() )
                testText = testData
            except Exception as err:

    To get the same benefit of a context manager using old-style try-except-finally, it would look like:

                testFile = open(testFilePath,'r')
                testData = "".join( testFile.readlines() )
                testText = testData
            except Exception as err:

    Here is your code using with and read():

                with open(testFilePath) as testFile:
                    testData =
                    testText = testData
            except Exception as err:

    If you wanted to process each line of the input file (formerly done by iterating over the list returned by readlines), you can use the file object itself as an iterator:

            with open(testFilePath) as testFile:
                for testLine in testFile:
                    # do something here with testLine

    Context managers are excellent Python style, and show that you comprehend the need for proper closure/cleanup of the resources you are using. Use them for things like locks, database transactions, request handlers - anything that has paired open/close or get/give-back statements, to ensure that the second statement always gets done. (If you are familiar with C++, it's like using auto_ptr to automatically release your allocated memory.)

    So that was some "free advice", worth what you paid for it. Feel free to use/ignore any or all of it. :)

    - Paul

  • Anonymous - 2012-01-10

    WOW! That was so much more than I had hoped for, thank you very much. Sorry it took so long to get back to this thread. This is an on again off again project for me, and I still haven't figured this one out on my own, so I am eager to try out the things you suggested. The only formal programming training I have had was a class in basic nearly twenty five years ago, so people like you who go the extra mile to help out the new guy have meant a lot to me. Thanks. And thanks as well for this great parsing tool, I'm learning Python just to use it.


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks