Menu

Pyparsing and utf-8 (Russian, cyrillic)

2017-04-06
2017-04-06
  • Egor Ovcharenko

    Egor Ovcharenko - 2017-04-06

    I need to perform simple text parsing and everything works fine except printing the output.

    The code is the following:

    # -*- coding: utf-8 -*-
    from pyparsing import Word, OneOrMore
    inFilename = 'out1.txt'
    FIN = open(inFilename, 'r')
    TEXT = FIN.read()
    myDigits = '0123456789'
    eng_alphas = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
    rus_alphas = 'йцукенгшщзхъфывапролджэячсмитьбюЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЯЧСМИТЬБЮ'
    punctuation = '.,:;'
    myPrintables = myDigits + eng_alphas + rus_alphas + punctuation
    aWord = Word(myPrintables)
    someText = OneOrMore(aWord)
    outputText = someText.parseString(TEXT)
    print outputText
    

    or here:
    https://github.com/evovch/Useful/blob/master/test2.py

    I provide input text file with one line:

    восстановление короткоживущих частиц, включая очень редкие, по продуктам их распадов;
    

    And get the following output:

    ['\xd0\xb2\xd0\xbe\xd1\x81\xd1\x81\xd1\x82\xd0\xb0\xd0\xbd\xd0\xbe\xd0\xb2\xd0\xbb\xd0\xb5\xd0\xbd\xd0\xb8\xd0\xb5', '\xd0\xba\xd0\xbe\xd1\x80\xd0\xbe\xd1\x82\xd0\xba\xd0\xbe\xd0\xb6\xd0\xb8\xd0\xb2\xd1\x83\xd1\x89\xd0\xb8\xd1\x85', '\xd1\x87\xd0\xb0\xd1\x81\xd1\x82\xd0\xb8\xd1\x86,', '\xd0\xb2\xd0\xba\xd0\xbb\xd1\x8e\xd1\x87\xd0\xb0\xd1\x8f', '\xd0\xbe\xd1\x87\xd0\xb5\xd0\xbd\xd1\x8c', '\xd1\x80\xd0\xb5\xd0\xb4\xd0\xba\xd0\xb8\xd0\xb5,', '\xd0\xbf\xd0\xbe', '\xd0\xbf\xd1\x80\xd0\xbe\xd0\xb4\xd1\x83\xd0\xba\xd1\x82\xd0\xb0\xd0\xbc', '\xd0\xb8\xd1\x85', '\xd1\x80\xd0\xb0\xd1\x81\xd0\xbf\xd0\xb0\xd0\xb4\xd0\xbe\xd0\xb2;']
    

    How could I convert this into the readable text?
    I played a lot with encode/decode but could not get any result.

     
  • Paul McGuire

    Paul McGuire - 2017-04-06

    It looks like you are on Python 2, the unicode handling in Python 3 is much better. Can you try this?

    for wd in outputText:
        print(wd)
    

    or

    print(u' '.join(outputText))
    

    It may be that you are getting this because you are printing the parse results directly, which will use Python's repr function to display the strings.

    -- Paul

     
  • Egor Ovcharenko

    Egor Ovcharenko - 2017-04-06

    Thank you, Paul!
    The first solution works nice, while in the second I had to leave out the 'u' specificator.

    So:

    finalOutput = ' '.join(outputText)
    print finalOutput
    

    Looks to be out of logic, but ok...
    You think upgrading to python 3 will help?

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.