Notepad++ Python Script / Discussion / Help: How to remove the Byte Order Mark \ufeff

How to remove the Byte Order Mark \ufeff

Forum: Help

Created: 2017-11-09

Updated: 2017-11-09

Hi all,

I am newbie in python, I write a script which read the text file (d:\subsitutions.txt) and searh and replace the content to all files in the target folder (d:\temp\a), but the result is not found because of each search string has the Byte Order Mark in front of the search string

The subsitutions.txt file structure as following which save as UTF-8 BOM

an8 an7

Now my program is written in following which search and show the line first

#coding: UTF-8
import os
import sys

console.write('Program Start !!\n')
filePathSrc = u'D:\\TEMP\\a'
subsitutionFile = u'D:\\subsitutions.txt'
console.write(u'Source: ' + filePathSrc + '\n')
for root, dirs, files in os.walk(filePathSrc):
    console.write('Searching ' + root + '\n')
    for fn in files:
        fileName = root + '\\' + fn
        console.write(u'fileName: ' + fileName + '\n')
        notepad.open(fileName.encode('utf-8'))
# replace value in subsitution file, separate values with space
# Foramat
# A B
        with open(subsitutionFile) as f:
            for l in f:
                if len(l) > 1:
                    s = l.split()
                    console.write('from:' + '"' + s[0] + '"' + '\t to:' + '"' + s[1] + '"' + '\n')
                    startPos = 0
                    while True:
                        pos = editor.findText(FINDOPTION.REGEXP, startPos, editor.getLength(), s[0])
                        if pos is None:
                            if startPos == 0:
                                console.write(s[0] + ' not found !!\n')
                            break
                        else:
                            editor.gotoPos(pos[0])
                            console.write(str(editor.lineFromPosition(editor.getCurrentPos())) + '[' + str(editor.getCurrentPos()) + ']: ' + editor.getCurLine())
                            startPos = pos[0] + 1
                    #editor.replace(s[0], s[1])
        f.close()
        #notepad.save()
        notepad.close()

Result in the Console

Program Start !!
Source: D:\TEMP\a
Searching D:\TEMP\a
fileName: D:\TEMP\a\testing.txt
from:"an8"  to:"an7"
an8 not found !!

The variable of s[0] has the \ufeff in front of an8

Because finally this subitution file content none english character (chinese word), so I want to keep in as UTF-8 encoding.

Thank you very much to help me.

Chris

CFrank - 2017-11-09

Why not converting your file to UTF-8 (without the BOM).

Cheers
Claudia

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Chris - 2017-11-10
  
  You are right, when I convert it to UTF-8, this issue solved, but I am thinking that how can solve it in program to let it can face different unicode format.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

CFrank - 2017-11-10

Well than you need to find out what encoding has been used, which, btw, cannot be done
in a 100% save manner.

Npp uses chardet to identify the encoding, chardet is also available as python module.
If only utf8 with or without BOM is used, than you can use codecs module and do
something like

import codecs with codecs.open(u'FILENAME') as f: for l in f: if len(l) > 1: s = l.split() if s[0].startswith(codecs.BOM_UTF8): _s = s[0][len(codecs.BOM_UTF8):].decode('utf-8') else: _s = s[0] ...

Cheers
Claudia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

How to remove the Byte Order Mark \ufeff

A Python Scripting plugin for Notepad++

Forums

Help

How to remove the Byte Order Mark \ufeff

How to remove the Byte Order Mark \ufeff

A Python Scripting plugin for Notepad++

Forums

Help

How to remove the Byte Order Mark \ufeff document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

How to remove the Byte Order Mark \ufeff