Thread: [Pyparsing] Problems with white space in line break aware parsing

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi,

after years of creating hand crafted parsers for many reasons, a new task 
smelled like being a good candidate for starting with pyparsing. The first 
steps look very promising, BTW. The fiddling with regexp can be very mind 
boggling, while using such more or less simple python expressions is much 
handier..

I have to process some machine generated PDF-content, where I don't have any 
influence on the creating side.

After extracting text with PDFMiner, I have to parse what you would some 
people call an unholy mess.. The major point is, it is dependent on line 
breaks, and empty lines.

Attached is my starting point. Excuse some german labels please...

The script tries to parse the address data in three different forms, but 
address1 is the one that creates problems. The 4th address in the test data 
contains such a biest. The problem here is, the line between "Herr Pumuckl" 
and "Bibi Blocksbergstrasse" contains a blank. I try to detect an empty line 
with:

ParserElement.setDefaultWhitespaceChars(' \t\r')
NL = LineEnd().suppress()
empty = (NL + NL).suppress()

Although, the blank is part of default whitespace chars, it seems to get in 
the way for the empty expression test. Why?

Let me know, if the script is still to complex, I can reduce it, but this 
might help those, that tries to archive something similar..

Thanks in advance,
Pete

Thread: [Pyparsing] Problems with white space in line break aware parsing

pyparsing-users