Menu

How to remove the Byte Order Mark \ufeff

Help
Chris
2017-11-09
2017-11-09
  • Chris

    Chris - 2017-11-09

    Hi all,

    I am newbie in python, I write a script which read the text file (d:\subsitutions.txt) and searh and replace the content to all files in the target folder (d:\temp\a), but the result is not found because of each search string has the Byte Order Mark in front of the search string

    The subsitutions.txt file structure as following which save as UTF-8 BOM

    an8 an7
    

    Now my program is written in following which search and show the line first

    #coding: UTF-8
    import os
    import sys
    
    console.write('Program Start !!\n')
    filePathSrc = u'D:\\TEMP\\a'
    subsitutionFile = u'D:\\subsitutions.txt'
    console.write(u'Source: ' + filePathSrc + '\n')
    for root, dirs, files in os.walk(filePathSrc):
        console.write('Searching ' + root + '\n')
        for fn in files:
            fileName = root + '\\' + fn
            console.write(u'fileName: ' + fileName + '\n')
            notepad.open(fileName.encode('utf-8'))
    # replace value in subsitution file, separate values with space
    # Foramat
    # A B
            with open(subsitutionFile) as f:
                for l in f:
                    if len(l) > 1:
                        s = l.split()
                        console.write('from:' + '"' + s[0] + '"' + '\t to:' + '"' + s[1] + '"' + '\n')
                        startPos = 0
                        while True:
                            pos = editor.findText(FINDOPTION.REGEXP, startPos, editor.getLength(), s[0])
                            if pos is None:
                                if startPos == 0:
                                    console.write(s[0] + ' not found !!\n')
                                break
                            else:
                                editor.gotoPos(pos[0])
                                console.write(str(editor.lineFromPosition(editor.getCurrentPos())) + '[' + str(editor.getCurrentPos()) + ']: ' + editor.getCurLine())
                                startPos = pos[0] + 1
                        #editor.replace(s[0], s[1])
            f.close()
            #notepad.save()
            notepad.close()
    

    Result in the Console

    Program Start !!
    Source: D:\TEMP\a
    Searching D:\TEMP\a
    fileName: D:\TEMP\a\testing.txt
    from:"an8"  to:"an7"
    an8 not found !!
    

    The variable of s[0] has the \ufeff in front of an8

    Because finally this subitution file content none english character (chinese word), so I want to keep in as UTF-8 encoding.

    Thank you very much to help me.

    Chris

     
  • CFrank

    CFrank - 2017-11-09

    Why not converting your file to UTF-8 (without the BOM).

    Cheers
    Claudia

     
    • Chris

      Chris - 2017-11-10

      You are right, when I convert it to UTF-8, this issue solved, but I am thinking that how can solve it in program to let it can face different unicode format.

       
  • CFrank

    CFrank - 2017-11-10

    Well than you need to find out what encoding has been used, which, btw, cannot be done
    in a 100% save manner.

    Npp uses chardet to identify the encoding, chardet is also available as python module.
    If only utf8 with or without BOM is used, than you can use codecs module and do
    something like

    import codecs
    
    with codecs.open(u'FILENAME') as f:
        for l in f: 
            if len(l) > 1:
                s = l.split()
                if s[0].startswith(codecs.BOM_UTF8):
                    _s = s[0][len(codecs.BOM_UTF8):].decode('utf-8')
                else:
                    _s = s[0]
                ...
    

    Cheers
    Claudia

     

Log in to post a comment.