Re: [Pyparsing] Problems with white space in line break aware parsing

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Something removed the script. Hmm.

Inlined below..

On Montag, 23. September 2013 14:19:40 Hans-Peter Jansen wrote:
> Hi,
> 
> after years of creating hand crafted parsers for many reasons, a new task
> smelled like being a good candidate for starting with pyparsing. The first
> steps look very promising, BTW. The fiddling with regexp can be very mind
> boggling, while using such more or less simple python expressions is much
> handier..
> 
> I have to process some machine generated PDF-content, where I don't have any
> influence on the creating side.
> 
> After extracting text with PDFMiner, I have to parse what you would some
> people call an unholy mess.. The major point is, it is dependent on line
> breaks, and empty lines.
> 
> Attached is my starting point. Excuse some german labels please...
> 
> The script tries to parse the address data in three different forms, but
> address1 is the one that creates problems. The 4th address in the test data
> contains such a biest. The problem here is, the line between "Herr Pumuckl"
> and "Bibi Blocksbergstrasse" contains a blank. I try to detect an empty line
> with:
> 
> ParserElement.setDefaultWhitespaceChars(' \t\r')
> NL = LineEnd().suppress()
> empty = (NL + NL).suppress()
> 
> Although, the blank is part of default whitespace chars, it seems to get in
> the way for the empty expression test. Why?
> 
> Let me know, if the script is still to complex, I can reduce it, but this
> might help those, that tries to archive something similar..
> 
> Thanks in advance,
> Pete

# -*- coding: utf-8 -*-

from pyparsing import *

ParserElement.setDefaultWhitespaceChars(' \t\r')
NL = LineEnd().suppress()
empty = (NL + NL).suppress()

line = restOfLine + NL
line.setParseAction(lambda t: [t[0].strip()])

name1 = line('name1')
name2 = line('name2')
strasse = line('strasse')
plz = Word(alphanums).setResultsName('plz')
ort = line('ort')
land = line('land')
bestimmt = Literal(u'Bestimmt für').suppress()

address1 = Group(name1 + name2 + empty + strasse + plz + ort + land) + empty
address2 = Group(name1 + name2 + strasse + plz + ort + land) + bestimmt
address3 = Group(name1 + strasse + plz + ort + land) + bestimmt

address = empty + Suppress(u'Warenempfänger') + empty + (address1 ^ address2 ^ address3)

teststr = u"""

Warenempfänger

Metronom Tick-Tack
12, Zone Industrielle Schéleck 22
3225 Bettembourg
Luxemburg
Bestimmt für

Warenempfänger

Humfti-Bumfti AG
Herr Wichtig
Landwehrstr. 1
34454 Bad Arolsen-Mengeringhausen
Deutschland
Bestimmt für

Warenempfänger

Fa. Simsalabim

Im Acker 88
76437 Rastatt
Deutschland
Bestimmt für

Warenempfänger

Hotzenplotz GmbH
Herrn Pumuckl

Bibi Blocksberggasse 1
66955 Pirmasens
Deutschland

Warenempfänger

Uga Uga

Am Nashorn 66
66424 Homburg / Saar
Deutschland
Bestimmt für

"""

for idx, (tok, sloc, eloc) in enumerate(address.scanString(teststr)):
    try:
        print 'page %s: (0x%x, 0x%x): \n%s' % (idx, sloc, eloc, tok[0].asDict())
    except ParseException, err:
        log.error('page %s: %s' % err)
        log.error(err.line)
        log.error(' ' * (err.column - 1) + '^')