[Pyparsing] Problems with white space in line break aware parsing
Brought to you by:
ptmcg
From: Hans-Peter J. <hp...@ur...> - 2013-09-23 12:45:43
|
Hi, after years of creating hand crafted parsers for many reasons, a new task smelled like being a good candidate for starting with pyparsing. The first steps look very promising, BTW. The fiddling with regexp can be very mind boggling, while using such more or less simple python expressions is much handier.. I have to process some machine generated PDF-content, where I don't have any influence on the creating side. After extracting text with PDFMiner, I have to parse what you would some people call an unholy mess.. The major point is, it is dependent on line breaks, and empty lines. Attached is my starting point. Excuse some german labels please... The script tries to parse the address data in three different forms, but address1 is the one that creates problems. The 4th address in the test data contains such a biest. The problem here is, the line between "Herr Pumuckl" and "Bibi Blocksbergstrasse" contains a blank. I try to detect an empty line with: ParserElement.setDefaultWhitespaceChars(' \t\r') NL = LineEnd().suppress() empty = (NL + NL).suppress() Although, the blank is part of default whitespace chars, it seems to get in the way for the empty expression test. Why? Let me know, if the script is still to complex, I can reduce it, but this might help those, that tries to archive something similar.. Thanks in advance, Pete |