Re: [Pyparsing] Problems with white space in line break aware parsing
Brought to you by:
ptmcg
From: Hans-Peter J. <hp...@ur...> - 2013-09-23 13:04:41
|
Something removed the script. Hmm. Inlined below.. On Montag, 23. September 2013 14:19:40 Hans-Peter Jansen wrote: > Hi, > > after years of creating hand crafted parsers for many reasons, a new task > smelled like being a good candidate for starting with pyparsing. The first > steps look very promising, BTW. The fiddling with regexp can be very mind > boggling, while using such more or less simple python expressions is much > handier.. > > I have to process some machine generated PDF-content, where I don't have any > influence on the creating side. > > After extracting text with PDFMiner, I have to parse what you would some > people call an unholy mess.. The major point is, it is dependent on line > breaks, and empty lines. > > Attached is my starting point. Excuse some german labels please... > > The script tries to parse the address data in three different forms, but > address1 is the one that creates problems. The 4th address in the test data > contains such a biest. The problem here is, the line between "Herr Pumuckl" > and "Bibi Blocksbergstrasse" contains a blank. I try to detect an empty line > with: > > ParserElement.setDefaultWhitespaceChars(' \t\r') > NL = LineEnd().suppress() > empty = (NL + NL).suppress() > > Although, the blank is part of default whitespace chars, it seems to get in > the way for the empty expression test. Why? > > Let me know, if the script is still to complex, I can reduce it, but this > might help those, that tries to archive something similar.. > > Thanks in advance, > Pete # -*- coding: utf-8 -*- from pyparsing import * ParserElement.setDefaultWhitespaceChars(' \t\r') NL = LineEnd().suppress() empty = (NL + NL).suppress() line = restOfLine + NL line.setParseAction(lambda t: [t[0].strip()]) name1 = line('name1') name2 = line('name2') strasse = line('strasse') plz = Word(alphanums).setResultsName('plz') ort = line('ort') land = line('land') bestimmt = Literal(u'Bestimmt für').suppress() address1 = Group(name1 + name2 + empty + strasse + plz + ort + land) + empty address2 = Group(name1 + name2 + strasse + plz + ort + land) + bestimmt address3 = Group(name1 + strasse + plz + ort + land) + bestimmt address = empty + Suppress(u'Warenempfänger') + empty + (address1 ^ address2 ^ address3) teststr = u""" Warenempfänger Metronom Tick-Tack 12, Zone Industrielle Schéleck 22 3225 Bettembourg Luxemburg Bestimmt für Warenempfänger Humfti-Bumfti AG Herr Wichtig Landwehrstr. 1 34454 Bad Arolsen-Mengeringhausen Deutschland Bestimmt für Warenempfänger Fa. Simsalabim Im Acker 88 76437 Rastatt Deutschland Bestimmt für Warenempfänger Hotzenplotz GmbH Herrn Pumuckl Bibi Blocksberggasse 1 66955 Pirmasens Deutschland Warenempfänger Uga Uga Am Nashorn 66 66424 Homburg / Saar Deutschland Bestimmt für """ for idx, (tok, sloc, eloc) in enumerate(address.scanString(teststr)): try: print 'page %s: (0x%x, 0x%x): \n%s' % (idx, sloc, eloc, tok[0].asDict()) except ParseException, err: log.error('page %s: %s' % err) log.error(err.line) log.error(' ' * (err.column - 1) + '^') |