Thread: [Pyparsing] Problems with white space in line break aware parsing
Brought to you by:
ptmcg
From: Hans-Peter J. <hp...@ur...> - 2013-09-23 12:45:43
|
Hi, after years of creating hand crafted parsers for many reasons, a new task smelled like being a good candidate for starting with pyparsing. The first steps look very promising, BTW. The fiddling with regexp can be very mind boggling, while using such more or less simple python expressions is much handier.. I have to process some machine generated PDF-content, where I don't have any influence on the creating side. After extracting text with PDFMiner, I have to parse what you would some people call an unholy mess.. The major point is, it is dependent on line breaks, and empty lines. Attached is my starting point. Excuse some german labels please... The script tries to parse the address data in three different forms, but address1 is the one that creates problems. The 4th address in the test data contains such a biest. The problem here is, the line between "Herr Pumuckl" and "Bibi Blocksbergstrasse" contains a blank. I try to detect an empty line with: ParserElement.setDefaultWhitespaceChars(' \t\r') NL = LineEnd().suppress() empty = (NL + NL).suppress() Although, the blank is part of default whitespace chars, it seems to get in the way for the empty expression test. Why? Let me know, if the script is still to complex, I can reduce it, but this might help those, that tries to archive something similar.. Thanks in advance, Pete |
From: Hans-Peter J. <hp...@ur...> - 2013-09-23 13:04:41
|
Something removed the script. Hmm. Inlined below.. On Montag, 23. September 2013 14:19:40 Hans-Peter Jansen wrote: > Hi, > > after years of creating hand crafted parsers for many reasons, a new task > smelled like being a good candidate for starting with pyparsing. The first > steps look very promising, BTW. The fiddling with regexp can be very mind > boggling, while using such more or less simple python expressions is much > handier.. > > I have to process some machine generated PDF-content, where I don't have any > influence on the creating side. > > After extracting text with PDFMiner, I have to parse what you would some > people call an unholy mess.. The major point is, it is dependent on line > breaks, and empty lines. > > Attached is my starting point. Excuse some german labels please... > > The script tries to parse the address data in three different forms, but > address1 is the one that creates problems. The 4th address in the test data > contains such a biest. The problem here is, the line between "Herr Pumuckl" > and "Bibi Blocksbergstrasse" contains a blank. I try to detect an empty line > with: > > ParserElement.setDefaultWhitespaceChars(' \t\r') > NL = LineEnd().suppress() > empty = (NL + NL).suppress() > > Although, the blank is part of default whitespace chars, it seems to get in > the way for the empty expression test. Why? > > Let me know, if the script is still to complex, I can reduce it, but this > might help those, that tries to archive something similar.. > > Thanks in advance, > Pete # -*- coding: utf-8 -*- from pyparsing import * ParserElement.setDefaultWhitespaceChars(' \t\r') NL = LineEnd().suppress() empty = (NL + NL).suppress() line = restOfLine + NL line.setParseAction(lambda t: [t[0].strip()]) name1 = line('name1') name2 = line('name2') strasse = line('strasse') plz = Word(alphanums).setResultsName('plz') ort = line('ort') land = line('land') bestimmt = Literal(u'Bestimmt für').suppress() address1 = Group(name1 + name2 + empty + strasse + plz + ort + land) + empty address2 = Group(name1 + name2 + strasse + plz + ort + land) + bestimmt address3 = Group(name1 + strasse + plz + ort + land) + bestimmt address = empty + Suppress(u'Warenempfänger') + empty + (address1 ^ address2 ^ address3) teststr = u""" Warenempfänger Metronom Tick-Tack 12, Zone Industrielle Schéleck 22 3225 Bettembourg Luxemburg Bestimmt für Warenempfänger Humfti-Bumfti AG Herr Wichtig Landwehrstr. 1 34454 Bad Arolsen-Mengeringhausen Deutschland Bestimmt für Warenempfänger Fa. Simsalabim Im Acker 88 76437 Rastatt Deutschland Bestimmt für Warenempfänger Hotzenplotz GmbH Herrn Pumuckl Bibi Blocksberggasse 1 66955 Pirmasens Deutschland Warenempfänger Uga Uga Am Nashorn 66 66424 Homburg / Saar Deutschland Bestimmt für """ for idx, (tok, sloc, eloc) in enumerate(address.scanString(teststr)): try: print 'page %s: (0x%x, 0x%x): \n%s' % (idx, sloc, eloc, tok[0].asDict()) except ParseException, err: log.error('page %s: %s' % err) log.error(err.line) log.error(' ' * (err.column - 1) + '^') |
From: Hans-Peter J. <hp...@ur...> - 2013-09-23 23:28:40
|
Got it, it was a matter of excluding the right things.. Sorry for disturbance, Pete |