Notepad++ Python Script / Discussion / Help: Read file encoding using python script

mrpaul1 - 2010-12-14

Hi,

I have a large file structure that I would like to read in using python to detect the file encoding the same way that Notepad++ automatically selects the encoding from its menu. Mainly, we have a bunch of files that are detected as "UTF8 without BOM" and we would like to convert them to UTF-8 (thus adding the BOM), but we need to find where those files reside. If we manually open each file in Notepad++ and check the Encoding menu, the selection tells us which encoding is detected but we are trying to automate this.

Using the following code, I can use the Notepad++ Python Script to convert each file using the Menu Option "Convert to UTF-8":

import os; import sys; filePathSrc="C:\\FilePath" for fn in os.listdir(filePathSrc): if fn[-4:] == '.htm' or fn[-5:] == '.html': notepad.open(filePathSrc+"\\" +fn) notepad.runMenuCommand("Encoding", "Convert to UTF-8") notepad.save() notepad.close()

Now I am trying to print out the encoding of the open file, as detected by Notepad++. I have tried adding the line:

print "fileName: " +fn +" :: encoding: " + str(notepad.getEncoding())

but that prints out COOKIE (for files that are actually detected as "UTF-8 without BOM") or ENC8BIT (for files that are actually detected as "ANSI"). I am also unsure if this will be consistent for each file.

Any idea how to print out the Encoding menu selection for each file using this plugin?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brotherstone - 2010-12-15

Yes, the constants for the enums were all generated from Notepad++'s internal enums - unfortunately they're not all sensibly named.
See the enum definition in the docs:
http://npppythonscript.sourceforge.net/docs/latest/enums.html?highlight=encoding#BUFFERENCODING

COOKIE refers to a "guessed" UTF8 encoding, ENC8BIT refers to ANSI. It will be consistent for each file, but, only as consistent as Notepad++ is at detecting the encoding. It only checks characters in the first 128k, and there has to be some UTF8 encoded (multi-byte) characters in there.

Your best bet is just to map the constants to what you want to say.

encodingMap = { BUFFERENCODING.COOKIE : 'UTF-8 without BOM', BUFFERENCODING.ENC8BIT : 'ANSI' } ... your code... print "fileName: " +fn +" :: encoding: " + encodingMap.get(notepad.getEncoding(). 'Unknown')

Depending on what you're trying to achieve, you might want to console.write() instead of "print", unless you've redirected sys.stdout somewhere.

Cheers,
Dave.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Preethi Shetty - 2018-08-23
  
  Post awaiting moderation.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mrpaul1 - 2010-12-15

Just the confirmation that I needed. Thank you for the prompt reply and for a great tool, Dave.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mrpaul1 - 2011-04-18

I have an additional question on this topic. How would I go about detecting files that do not have an encoding option selected in the menu? For example, we have certain files that, when loaded into Notepad++, does not have an encoding option selected in the menu but printing out notepad.getEncoding() in python displays BUFFERENCODING.COOKIE. Files that are UTF-8 without a BOM also display BUFFERENCODING.COOKIE but we need to differentiate the two. We're trying to automate this because we have thousands of files. Any idea?

Thank you.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mrpaul1 - 2011-04-18

Addendum to my previous post: I noticed that the status bar in Notepad++ is detecting these files as ISO-8859-1 (bottom right), but the "Encoding" menu command does not have anything selected. Is there any way to detect this encoding using python++?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brotherstone - 2011-04-19

(I did see the post - Sourceforge's email notification system does work 99% of the time :)

Short answer : No - there's no API in Notepad++ to get the text on the status bar, and as you've discovered, it only reports ENC8BIT or COOKIE for ANSI or UTF8 w/o BOM files, and still reports COOKIE for ISO-8859-1.

Long(er) anser: That, to my view, is a bug in N++, however, messing with the encoding-code is not something I fancy getting into! However, getting the status bar text has been requested more than once, so I've made a patch that would enable that from N++. I'll try and test it later, and if it works OK, I'll submit it. I'll post a note here. Patches sometimes sit there for ever, so I'll post a note on the Notepad-Plus Open Discussion forum, but then I'll leave it to you to "market it". Once it's in N++, adding it to Python Script is a 10 minute job.

Alternative answer: N++ encoding detection is sketchy at best, you might want to look at more specialized tools to do encoding conversion/detection (Kaboom has been mentioned several times, although not used it myself).
If you know Java or C, you might also want to look at the intel ICU library. That has support for every encoding under the sun, and can happily convert between them. And, their encoding detection is the best there is.

You might also be able to do this quite easily in Python, looking at the file itself - this stackoverflow question has a few good suggestions - http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file

Hope that helps,
Dave.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

mrpaul1 - 2011-04-19

Thanks again for your prompt reply Dave. I ended up finding a solution using an encoding algorithm that I found called the "Universal Encoding Detector", which was also written in python: http://chardet.feedparser.org/

Basically, I am relying on Notepad++ to tell me which files are being detected as UTF-8 without a BOM as these were the ones causing issues for us in the first place. The problem, as you stated, is that when Notepad++ cannot detect the encoding (thus not selecting a menu item), notepad.getEncoding() returns COOKIE, which is the same result for files that are being detected as UTF-8 without a BOM. Therefore, I couldn't differentiate the two… until now.

For the files that Notepad++ couldn't detect, I noticed that the status bar was showing an encoding of ISO-8859-1. Here's the problem with those files: for some reason, when you load these files into Notepad++, it "hides" some of the stranger characters like the angled apostrophe, longer dashes or angled double quotes. When my algorithm tried converting them to UTF-8 automatically, these characters were lost forever (and they show up as spaces in the browser).

What I noticed is that for Notepad++ to "unhide" these characters before calling the Convert to UTF-8, I have to call the "Encode in ANSI" menu option first. Then when you convert to UTF-8, all is well. I couldn't do this for every situation though because files that were actually detected as UTF-8 without a BOM, would get messed up when selecting Encoding in ANSI first. The Universal Encoding Detector algorithm seems to be able to differentiate these types of files, so I was able to easily integrate it to call "Encode in ANSI" first, before calling "Convert to UTF-8". What's neat about this library is that it outputs a confidence value on how sure it is on the encoding.

Another thing worth mentioning is that I cannot rely on the Universal Encoding Detector alone because it doesn't seem to be able to differentiate between files that are "UTF-8" and "UTF-8 without a BOM", which also causes problems for us. It reports them all as UTF-8. So using a combination of both scripts, I seem to have a solid algorithm that can detect files that are causing problems for us, and converts them all to UTF-8. I will post my algorithm in a few days after some additional testing, for anyone having similar problems!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

As promised, here is my script for anyone having similar issues.

A couple of notes:
- The .py script is saved under ~Notepad++\plugins\PythonScript\scripts wherever Notepad++ is installed on your machine
- I had to run Notepad++ under an Administrator account (on Vista anyways)
- To see the script in action, make sure to select the following menu option: Plugins < Python Script < Show Console
- You'll have to install the Universal Encoding Detector for this to work properly: http://chardet.feedparser.org/
- This script will scan the directory for html files, detect those that could not be identified by the Encoding menu or were identified as "UTF-8 without BOM" and converts them to UTF-8. These were the files that were causing problems for us. Files properly detected by Notepad++ as ANSI were converted through another mechanism, so this script doesn't do that (but can easily be modified to handle these cases)

Here you go:

import os;
import sys;
import re;
import chardet;
### User Defined Variables ###
filePathSrc='C:\\Path\\ToScan'
logFile = open("C:\\EncodingFix.log", "w")
foundCount = 1
encodingMap = { BUFFERENCODING.COOKIE : 'UTF-8 without BOM', BUFFERENCODING.ENC8BIT : 'ANSI', BUFFERENCODING.UTF8: 'UTF-8' }
textToWrite = "Starting Script...\n"
console.write(textToWrite)
logFile.write(textToWrite)
for root, subFolders, files in os.walk(filePathSrc): # searches file path recursively
    textToWrite = "Scanning: " +root + "\n"
    console.write(textToWrite)
    logFile.write(textToWrite)
    for file in files:
        filePath = os.path.join(root,file)
        # only do this for html files
        if file[-4:].lower() == '.htm' or file[-5:].lower() == '.html':
            notepad.open(filePath.decode(sys.getfilesystemencoding()).encode('utf8'))           
            # BUFFERENCODING.COOKIE is returned for files that are "UTF-8 without BOM" or no Encoding menu option selected
            if (notepad.getEncoding() == BUFFERENCODING.COOKIE):
                # use the Universal Encoding Detector (http://chardet.feedparser.org)
                rawdata=open(filePath,"r").read()
                UED_Result = chardet.detect(rawdata)
                UED_Result_Encoding = UED_Result.get("encoding")
                UED_Result_Confidence = UED_Result.get("confidence")
                if UED_Result_Encoding.startswith("ISO-8859") or UED_Result_Encoding.startswith("ascii"):
                    textToWrite = "%d: %s %f %s" % (foundCount, "Chardet Detection   -> " +filePath +": " + UED_Result_Encoding + " [ Confidence:",
                    UED_Result_Confidence, "]\n")
                    console.write(textToWrite)
                    logFile.write(textToWrite)
                    notepad.runMenuCommand("Encoding", "Encode in ANSI") #IMPORTANT: preserve certain chars (Notepad++ seems to hide them)
                else: # Notepad++ detected this file as UTF-8 without a BOM
                    textToWrite = "%d: %s" % (foundCount, "Notepad++ Detection -> " +filePath +": " +encodingMap.get(notepad.getEncoding(),'UNKNOWN') + "\n")
                    console.write(textToWrite)
                    logFile.write(textToWrite)                      
                notepad.runMenuCommand("Encoding", "Convert to UTF-8")                  
                notepad.save()
                foundCount += 1             
            notepad.close()
textToWrite = "Program Completed successfully!\n"
console.write(textToWrite)
logFile.write(textToWrite)
logFile.close()

mrpaul1 - 2011-04-28

^^^When you try to copy the code above and paste it, the line breaks get removed (which is annoying). Select the code in firefox, right click and choose "View Selected Source" and you'll be able to copy/paste while preserving line breaks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

André Lieske - 2016-02-16

Hallo Profis,
mein Englisch ist leider sehr schlecht, daher versuche ich es in Deutsch
Möchte ein Verzeichnisinhalt (c:stage) von ANSI nach UTF-8 convertieren
habe folgenden Code:

import os; import sys; filePathSrc="C:\\stage\\" for root, dirs, files in os.walk(filePathSrc): for fn in files: if fn[-5:] == '.html': notepad.open(root + "\\" + fn) notepad.runMenuCommand("Encoding", "Encode in ANSI") notepad.runMenuCommand("Encoding", "Convert to UTF-8 without BOM") notepad.save() notepad.close()

Leider passiert bei mir nach dem ausführen nichts,
woran kann das liegen?
Notepad++ v6.8.8

Besten Dank im voraus
Gruss André

Last edit: André Lieske 2016-02-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hallo,
ich hoffe mein Deutsch ist gut genug.
Bei Python muss man auf Tabs oder Spaces achten.
Deine Syntax ist hier falsch, da die Zeilen, welche mit notepad starten unter dem if weiter
eingerückt werden müssen.
also so

import os;
import sys;
filePathSrc="C:\\stage\\" 
for root, dirs, files in os.walk(filePathSrc):
    for fn in files:
        if fn[-5:] == '.html': 
            notepad.open(root + "\\" + fn)
            notepad.runMenuCommand("Encoding", "Encode in ANSI")
            notepad.runMenuCommand("Encoding", "Convert to UTF-8 without BOM")
            notepad.save()
            notepad.close()

Gruß
Claudia

André Lieske - 2016-02-16

Hallo Claudia,
dein Deutsch ist sehr gut.
Das Script läuft jetzt durch, aber die Dateien im Ordner sind immer noch im ISO 8859-1 kodiert.
Was mache ich falsch?

Besten Dank im voraus
Gruss André

Last edit: André Lieske 2016-02-16

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

CFrank - 2016-02-17

Hallo Andre,
in notepad 6.8.8 musse es heissen

notepad.runMenuCommand("Encoding", "Convert to UTF-8")

und nicht

notepad.runMenuCommand("Encoding", "Convert to UTF-8 without BOM")

Was noch sein kann, benutzt Du Deutsche Sprache in notepad++
dann musst du die deutschen Begriffe (?) verwenden

Cheers
Claudia

Last edit: CFrank 2016-02-17

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hallo Claudia,
besten Dank, das war es.`

import os;
import sys;
filePathSrc="C:\\stage\\" 
for root, dirs, files in os.walk(filePathSrc):
    for fn in files:
        if fn[-4:] == '.php': 
            notepad.open(root + "\\" + fn)
            notepad.runMenuCommand("Kodierung", "Konvertiere zu ANSI")
            notepad.runMenuCommand("Kodierung", "Konvertiere zu UTF-8")
            notepad.save()
            notepad.close()

Ist es auch möglich im gleichen Script den Inhalt zu Ändern?
von:

<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" >

<meta http-equiv="content-type" content="text/html; charset=UTF-8" >

Besten Dank im voraus
Gruss André

CFrank - 2016-02-17

Hallo Andre,

ja, füge folgende Zeile vor dem notepad.save() ein

editor.replace('iso-8859-1', 'UTF-8')

Gruß
Claudia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hallo Claudia,
diesen Code mag Python nicht,
habe auch die doppelten Anführungszeichen versucht.

import os;
import sys;
filePathSrc="C:\\stage\\" 
for root, dirs, files in os.walk(filePathSrc):
    for fn in files:
        if fn[-4:] == '.php': 
            notepad.open(root + "\\" + fn)
            notepad.runMenuCommand("Kodierung", "Konvertiere zu ANSI")
            notepad.runMenuCommand("Kodierung", "Konvertiere zu UTF-8")
            editor.replace('iso-8859-1', 'UTF-8')
            notepad.save()
            notepad.close()

Hast du noch einen Tipp?
Besten Dank im voraus
Gruss André

CFrank - 2016-02-17

Hallo André,
welcher Fehler kommt?
Öffne die python console (Plugins->PythonScript->Show Console) und gib die
Anweisung direkt ein. Funktioniert das?
Es muss naturlich ein Document offen sein, welches die beiden Texte hat.

Gruß
Claudia

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hallo Claudia,
habe ich gemacht, es kommt keine Fehlermeldung

Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
Initialisation took 47ms
Ready.
>>> editor.replace('iso-8859-1', 'UTF-8')

Wenn ich den Code im Script einbaue,

import os;
import sys;
filePathSrc="C:\\Users\\Andre\\Documents\\SmartStore.biz Projekte\\SM6\\Lieske Andre\\Stage\\" 
for root, dirs, files in os.walk(filePathSrc):
    for fn in files:
        if fn[-4:] == '.php': 
            notepad.open(root + "\\" + fn)
            notepad.runMenuCommand("Kodierung", "Konvertiere zu ANSI")
            notepad.runMenuCommand("Kodierung", "Konvertiere zu UTF-8")
            editor.replace('iso-8859-1', 'UTF-8')
            notepad.save()
            notepad.close()

kommt folgender Fehler

Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
Initialisation took 47ms
Ready.
>>> editor.replace('iso-8859-1', 'UTF-8')
  File "C:\Users\Andre\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\Convert Stage Ordner.py", line 10
    editor.replace('iso-8859-1', 'UTF-8')
    ^
IndentationError: unexpected indent

Gruss André

Last edit: André Lieske 2016-02-17

CFrank - 2016-02-17

Hallo André,

dann vermute ich, das du einen Mix aus Tab und Spaces hast, das ist nicht erlaubt.
Aktiviere "zeige alle Symbole", das umgedrehte P, dann solltest du sehen ob du
Tabs hast. Wenn alles Space ist, dann stimmt nicht ganz die Anzahl z. b. die Zeile
davor hat 8 Spaces die nächste nur 7 oder so.

Wenn Du mit Python arbeitest, solltest Du unter Einstellungen->Optionen->Tabulatoren
das Kontrollbox(?) "Durch Leerzeichen ersetzen" aktivieren.

Gruß
Claudia

Last edit: CFrank 2016-02-17

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

André Lieske - 2016-02-17

Hallo Claudia,
du bist ein Schatz.
Vielen, vielen DANK
Gruss André

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

André Lieske - 2016-02-20

Hallo Claudia,
habe doch noch ein Problem
Ich möchte den Dokumentenkopf ändern

editor.replace("<!DOCTYPE html>", "<?php header('Content-Type: text/html;charset=UTF-8');?><!DOCTYPE html>")

das Ergebnis sieht so aus,
es fehlt die Klammer vor dem header und am Ende.
Ergebnis:

<?php header'Content-Type: text/html;charset=UTF-8';?><!DOCTYPE html>

Hast du einen Tipp für mich?
Gruss André
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dave Brotherstone - 2016-02-20

Do muss die Klammer escapen, also so:

editor.replace("<!DOCTYPE html>", "<?php header\\('Content-Type: text/html;charset=UTF-8'\\);?><!DOCTYPE html>")

Es sind zwei '\', weil Python interpretiert sie auch. Ich glaube du könntest die auch mit einem "Raw String" auch machen:

editor.replace("<!DOCTYPE html>", r"<?php header\('Content-Type: text/html;charset=UTF-8'\);?><!DOCTYPE html>")

Die Klammer haben ein besondere Bedeutung in Notepad++, und man kann damit Gruppen von Suchstring holen usw, deswegen müssen sie immer escaped werden.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

André Lieske - 2016-02-20

Hallo Claudia,
Besten Dank.
Ich hatte Probleme mit meinem Warenkorb, der war immer leer.

editor.replace('iso-8859-1', 'UTF-8')

habe jetzt "UTF-8 ohne BOM" eingegeben,
jetzt funktioniert auch mein Warenkorb

Gruss André
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

André Lieske - 2016-02-20

HABE ES HINBEKOMMEN

Hallo Claudia,
wie bekomme ich diesen Replace hin?

#editor.replace("Ihr Warenkorb enth\u00e4lt keine Eintr\u00e4ge", r"Ihr Warenkorb enthält keine Einträge")

Das ist der Original-Code der in einer Datei steht

Ihr Warenkorb enth\u00e4lt keine Eintr\u00e4g

Besten Dank im voraus
Gruss André

Last edit: André Lieske 2016-02-20
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Read file encoding using python script

A Python Scripting plugin for Notepad++

Forums

Help

Read file encoding using python script

HABE ES HINBEKOMMEN

Read file encoding using python script

A Python Scripting plugin for Notepad++

Forums

Help

Read file encoding using python script document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

HABE ES HINBEKOMMEN

Read file encoding using python script