I have a large file structure that I would like to read in using python to detect the file encoding the same way that Notepad++ automatically selects the encoding from its menu. Mainly, we have a bunch of files that are detected as "UTF8 without BOM" and we would like to convert them to UTF-8 (thus adding the BOM), but we need to find where those files reside. If we manually open each file in Notepad++ and check the Encoding menu, the selection tells us which encoding is detected but we are trying to automate this.
Using the following code, I can use the Notepad++ Python Script to convert each file using the Menu Option "Convert to UTF-8":
importos;importsys;filePathSrc="C:\\FilePath"forfninos.listdir(filePathSrc):iffn[-4:]=='.htm'orfn[-5:]=='.html':notepad.open(filePathSrc+"\\"+fn)notepad.runMenuCommand("Encoding","Convert to UTF-8")notepad.save()notepad.close()
Now I am trying to print out the encoding of the open file, as detected by Notepad++. I have tried adding the line:
but that prints out COOKIE (for files that are actually detected as "UTF-8 without BOM") or ENC8BIT (for files that are actually detected as "ANSI"). I am also unsure if this will be consistent for each file.
Any idea how to print out the Encoding menu selection for each file using this plugin?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
COOKIE refers to a "guessed" UTF8 encoding, ENC8BIT refers to ANSI. It will be consistent for each file, but, only as consistent as Notepad++ is at detecting the encoding. It only checks characters in the first 128k, and there has to be some UTF8 encoded (multi-byte) characters in there.
Your best bet is just to map the constants to what you want to say.
I have an additional question on this topic. How would I go about detecting files that do not have an encoding option selected in the menu? For example, we have certain files that, when loaded into Notepad++, does not have an encoding option selected in the menu but printing out notepad.getEncoding() in python displays BUFFERENCODING.COOKIE. Files that are UTF-8 without a BOM also display BUFFERENCODING.COOKIE but we need to differentiate the two. We're trying to automate this because we have thousands of files. Any idea?
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Addendum to my previous post: I noticed that the status bar in Notepad++ is detecting these files as ISO-8859-1 (bottom right), but the "Encoding" menu command does not have anything selected. Is there any way to detect this encoding using python++?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
(I did see the post - Sourceforge's email notification system does work 99% of the time :)
Short answer : No - there's no API in Notepad++ to get the text on the status bar, and as you've discovered, it only reports ENC8BIT or COOKIE for ANSI or UTF8 w/o BOM files, and still reports COOKIE for ISO-8859-1.
Long(er) anser: That, to my view, is a bug in N++, however, messing with the encoding-code is not something I fancy getting into! However, getting the status bar text has been requested more than once, so I've made a patch that would enable that from N++. I'll try and test it later, and if it works OK, I'll submit it. I'll post a note here. Patches sometimes sit there for ever, so I'll post a note on the Notepad-Plus Open Discussion forum, but then I'll leave it to you to "market it". Once it's in N++, adding it to Python Script is a 10 minute job.
Alternative answer: N++ encoding detection is sketchy at best, you might want to look at more specialized tools to do encoding conversion/detection (Kaboom has been mentioned several times, although not used it myself).
If you know Java or C, you might also want to look at the intel ICU library. That has support for every encoding under the sun, and can happily convert between them. And, their encoding detection is the best there is.
Thanks again for your prompt reply Dave. I ended up finding a solution using an encoding algorithm that I found called the "Universal Encoding Detector", which was also written in python: http://chardet.feedparser.org/
Basically, I am relying on Notepad++ to tell me which files are being detected as UTF-8 without a BOM as these were the ones causing issues for us in the first place. The problem, as you stated, is that when Notepad++ cannot detect the encoding (thus not selecting a menu item), notepad.getEncoding() returns COOKIE, which is the same result for files that are being detected as UTF-8 without a BOM. Therefore, I couldn't differentiate the two… until now.
For the files that Notepad++ couldn't detect, I noticed that the status bar was showing an encoding of ISO-8859-1. Here's the problem with those files: for some reason, when you load these files into Notepad++, it "hides" some of the stranger characters like the angled apostrophe, longer dashes or angled double quotes. When my algorithm tried converting them to UTF-8 automatically, these characters were lost forever (and they show up as spaces in the browser).
What I noticed is that for Notepad++ to "unhide" these characters before calling the Convert to UTF-8, I have to call the "Encode in ANSI" menu option first. Then when you convert to UTF-8, all is well. I couldn't do this for every situation though because files that were actually detected as UTF-8 without a BOM, would get messed up when selecting Encoding in ANSI first. The Universal Encoding Detector algorithm seems to be able to differentiate these types of files, so I was able to easily integrate it to call "Encode in ANSI" first, before calling "Convert to UTF-8". What's neat about this library is that it outputs a confidence value on how sure it is on the encoding.
Another thing worth mentioning is that I cannot rely on the Universal Encoding Detector alone because it doesn't seem to be able to differentiate between files that are "UTF-8" and "UTF-8 without a BOM", which also causes problems for us. It reports them all as UTF-8. So using a combination of both scripts, I seem to have a solid algorithm that can detect files that are causing problems for us, and converts them all to UTF-8. I will post my algorithm in a few days after some additional testing, for anyone having similar problems!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As promised, here is my script for anyone having similar issues.
A couple of notes:
- The .py script is saved under ~Notepad++\plugins\PythonScript\scripts wherever Notepad++ is installed on your machine
- I had to run Notepad++ under an Administrator account (on Vista anyways)
- To see the script in action, make sure to select the following menu option: Plugins < Python Script < Show Console
- You'll have to install the Universal Encoding Detector for this to work properly: http://chardet.feedparser.org/
- This script will scan the directory for html files, detect those that could not be identified by the Encoding menu or were identified as "UTF-8 without BOM" and converts them to UTF-8. These were the files that were causing problems for us. Files properly detected by Notepad++ as ANSI were converted through another mechanism, so this script doesn't do that (but can easily be modified to handle these cases)
Here you go:
importos;importsys;importre;importchardet;### User Defined Variables ###filePathSrc='C:\\Path\\ToScan'logFile=open("C:\\EncodingFix.log","w")foundCount=1encodingMap={BUFFERENCODING.COOKIE:'UTF-8 without BOM',BUFFERENCODING.ENC8BIT:'ANSI',BUFFERENCODING.UTF8:'UTF-8'}textToWrite="Starting Script...\n"console.write(textToWrite)logFile.write(textToWrite)forroot,subFolders,filesinos.walk(filePathSrc):# searches file path recursivelytextToWrite="Scanning: "+root+"\n"console.write(textToWrite)logFile.write(textToWrite)forfileinfiles:filePath=os.path.join(root,file)# only do this for html filesiffile[-4:].lower()=='.htm'orfile[-5:].lower()=='.html':notepad.open(filePath.decode(sys.getfilesystemencoding()).encode('utf8'))# BUFFERENCODING.COOKIE is returned for files that are "UTF-8 without BOM" or no Encoding menu option selectedif(notepad.getEncoding()==BUFFERENCODING.COOKIE):# use the Universal Encoding Detector (http://chardet.feedparser.org)rawdata=open(filePath,"r").read()UED_Result=chardet.detect(rawdata)UED_Result_Encoding=UED_Result.get("encoding")UED_Result_Confidence=UED_Result.get("confidence")ifUED_Result_Encoding.startswith("ISO-8859")orUED_Result_Encoding.startswith("ascii"):textToWrite="%d: %s%f%s"%(foundCount,"Chardet Detection -> "+filePath+": "+UED_Result_Encoding+" [ Confidence:",UED_Result_Confidence,"]\n")console.write(textToWrite)logFile.write(textToWrite)notepad.runMenuCommand("Encoding","Encode in ANSI")#IMPORTANT: preserve certain chars (Notepad++ seems to hide them)else:# Notepad++ detected this file as UTF-8 without a BOMtextToWrite="%d: %s"%(foundCount,"Notepad++ Detection -> "+filePath+": "+encodingMap.get(notepad.getEncoding(),'UNKNOWN')+"\n")console.write(textToWrite)logFile.write(textToWrite)notepad.runMenuCommand("Encoding","Convert to UTF-8")notepad.save()foundCount+=1notepad.close()textToWrite="Program Completed successfully!\n"console.write(textToWrite)logFile.write(textToWrite)logFile.close()
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
^^^When you try to copy the code above and paste it, the line breaks get removed (which is annoying). Select the code in firefox, right click and choose "View Selected Source" and you'll be able to copy/paste while preserving line breaks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hallo Profis,
mein Englisch ist leider sehr schlecht, daher versuche ich es in Deutsch
Möchte ein Verzeichnisinhalt (c:stage) von ANSI nach UTF-8 convertieren
habe folgenden Code:
importos;importsys;filePathSrc="C:\\stage\\"forroot,dirs,filesinos.walk(filePathSrc):forfninfiles:iffn[-5:]=='.html':notepad.open(root+"\\"+fn)notepad.runMenuCommand("Encoding","Encode in ANSI")notepad.runMenuCommand("Encoding","Convert to UTF-8 without BOM")notepad.save()notepad.close()
Leider passiert bei mir nach dem ausführen nichts,
woran kann das liegen?
Notepad++ v6.8.8
Besten Dank im voraus
Gruss André
Last edit: André Lieske 2016-02-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hallo,
ich hoffe mein Deutsch ist gut genug.
Bei Python muss man auf Tabs oder Spaces achten.
Deine Syntax ist hier falsch, da die Zeilen, welche mit notepad starten unter dem if weiter
eingerückt werden müssen.
also so
importos;importsys;filePathSrc="C:\\stage\\"forroot,dirs,filesinos.walk(filePathSrc):forfninfiles:iffn[-5:]=='.html':notepad.open(root+"\\"+fn)notepad.runMenuCommand("Encoding","Encode in ANSI")notepad.runMenuCommand("Encoding","Convert to UTF-8 without BOM")notepad.save()notepad.close()
Gruß
Claudia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hallo Claudia,
dein Deutsch ist sehr gut.
Das Script läuft jetzt durch, aber die Dateien im Ordner sind immer noch im ISO 8859-1 kodiert.
Was mache ich falsch?
Besten Dank im voraus
Gruss André
Last edit: André Lieske 2016-02-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
importos;importsys;filePathSrc="C:\\stage\\"forroot,dirs,filesinos.walk(filePathSrc):forfninfiles:iffn[-4:]=='.php':notepad.open(root+"\\"+fn)notepad.runMenuCommand("Kodierung","Konvertiere zu ANSI")notepad.runMenuCommand("Kodierung","Konvertiere zu UTF-8")notepad.save()notepad.close()
Ist es auch möglich im gleichen Script den Inhalt zu Ändern?
von:
Hallo Claudia,
diesen Code mag Python nicht,
habe auch die doppelten Anführungszeichen versucht.
importos;importsys;filePathSrc="C:\\stage\\"forroot,dirs,filesinos.walk(filePathSrc):forfninfiles:iffn[-4:]=='.php':notepad.open(root+"\\"+fn)notepad.runMenuCommand("Kodierung","Konvertiere zu ANSI")notepad.runMenuCommand("Kodierung","Konvertiere zu UTF-8")editor.replace('iso-8859-1','UTF-8')notepad.save()notepad.close()
Hast du noch einen Tipp?
Besten Dank im voraus
Gruss André
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hallo André,
welcher Fehler kommt?
Öffne die python console (Plugins->PythonScript->Show Console) und gib die
Anweisung direkt ein. Funktioniert das?
Es muss naturlich ein Document offen sein, welches die beiden Texte hat.
Gruß
Claudia
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hallo Claudia,
habe ich gemacht, es kommt keine Fehlermeldung
Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
Initialisation took 47ms
Ready.
>>> editor.replace('iso-8859-1', 'UTF-8')
Wenn ich den Code im Script einbaue,
importos;importsys;filePathSrc="C:\\Users\\Andre\\Documents\\SmartStore.biz Projekte\\SM6\\Lieske Andre\\Stage\\"forroot,dirs,filesinos.walk(filePathSrc):forfninfiles:iffn[-4:]=='.php':notepad.open(root+"\\"+fn)notepad.runMenuCommand("Kodierung","Konvertiere zu ANSI")notepad.runMenuCommand("Kodierung","Konvertiere zu UTF-8")editor.replace('iso-8859-1','UTF-8')notepad.save()notepad.close()
dann vermute ich, das du einen Mix aus Tab und Spaces hast, das ist nicht erlaubt.
Aktiviere "zeige alle Symbole", das umgedrehte P, dann solltest du sehen ob du
Tabs hast. Wenn alles Space ist, dann stimmt nicht ganz die Anzahl z. b. die Zeile
davor hat 8 Spaces die nächste nur 7 oder so.
Wenn Du mit Python arbeitest, solltest Du unter Einstellungen->Optionen->Tabulatoren
das Kontrollbox(?) "Durch Leerzeichen ersetzen" aktivieren.
Gruß
Claudia
Last edit: CFrank 2016-02-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I have a large file structure that I would like to read in using python to detect the file encoding the same way that Notepad++ automatically selects the encoding from its menu. Mainly, we have a bunch of files that are detected as "UTF8 without BOM" and we would like to convert them to UTF-8 (thus adding the BOM), but we need to find where those files reside. If we manually open each file in Notepad++ and check the Encoding menu, the selection tells us which encoding is detected but we are trying to automate this.
Using the following code, I can use the Notepad++ Python Script to convert each file using the Menu Option "Convert to UTF-8":
Now I am trying to print out the encoding of the open file, as detected by Notepad++. I have tried adding the line:
but that prints out COOKIE (for files that are actually detected as "UTF-8 without BOM") or ENC8BIT (for files that are actually detected as "ANSI"). I am also unsure if this will be consistent for each file.
Any idea how to print out the Encoding menu selection for each file using this plugin?
Yes, the constants for the enums were all generated from Notepad++'s internal enums - unfortunately they're not all sensibly named.
See the enum definition in the docs:
http://npppythonscript.sourceforge.net/docs/latest/enums.html?highlight=encoding#BUFFERENCODING
COOKIE refers to a "guessed" UTF8 encoding, ENC8BIT refers to ANSI. It will be consistent for each file, but, only as consistent as Notepad++ is at detecting the encoding. It only checks characters in the first 128k, and there has to be some UTF8 encoded (multi-byte) characters in there.
Your best bet is just to map the constants to what you want to say.
Depending on what you're trying to achieve, you might want to console.write() instead of "print", unless you've redirected sys.stdout somewhere.
Cheers,
Dave.
Just the confirmation that I needed. Thank you for the prompt reply and for a great tool, Dave.
I have an additional question on this topic. How would I go about detecting files that do not have an encoding option selected in the menu? For example, we have certain files that, when loaded into Notepad++, does not have an encoding option selected in the menu but printing out notepad.getEncoding() in python displays BUFFERENCODING.COOKIE. Files that are UTF-8 without a BOM also display BUFFERENCODING.COOKIE but we need to differentiate the two. We're trying to automate this because we have thousands of files. Any idea?
Thank you.
Addendum to my previous post: I noticed that the status bar in Notepad++ is detecting these files as ISO-8859-1 (bottom right), but the "Encoding" menu command does not have anything selected. Is there any way to detect this encoding using python++?
(I did see the post - Sourceforge's email notification system does work 99% of the time :)
Short answer : No - there's no API in Notepad++ to get the text on the status bar, and as you've discovered, it only reports ENC8BIT or COOKIE for ANSI or UTF8 w/o BOM files, and still reports COOKIE for ISO-8859-1.
Long(er) anser: That, to my view, is a bug in N++, however, messing with the encoding-code is not something I fancy getting into! However, getting the status bar text has been requested more than once, so I've made a patch that would enable that from N++. I'll try and test it later, and if it works OK, I'll submit it. I'll post a note here. Patches sometimes sit there for ever, so I'll post a note on the Notepad-Plus Open Discussion forum, but then I'll leave it to you to "market it". Once it's in N++, adding it to Python Script is a 10 minute job.
Alternative answer: N++ encoding detection is sketchy at best, you might want to look at more specialized tools to do encoding conversion/detection (Kaboom has been mentioned several times, although not used it myself).
If you know Java or C, you might also want to look at the intel ICU library. That has support for every encoding under the sun, and can happily convert between them. And, their encoding detection is the best there is.
You might also be able to do this quite easily in Python, looking at the file itself - this stackoverflow question has a few good suggestions - http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file
Hope that helps,
Dave.
Thanks again for your prompt reply Dave. I ended up finding a solution using an encoding algorithm that I found called the "Universal Encoding Detector", which was also written in python: http://chardet.feedparser.org/
Basically, I am relying on Notepad++ to tell me which files are being detected as UTF-8 without a BOM as these were the ones causing issues for us in the first place. The problem, as you stated, is that when Notepad++ cannot detect the encoding (thus not selecting a menu item), notepad.getEncoding() returns COOKIE, which is the same result for files that are being detected as UTF-8 without a BOM. Therefore, I couldn't differentiate the two… until now.
For the files that Notepad++ couldn't detect, I noticed that the status bar was showing an encoding of ISO-8859-1. Here's the problem with those files: for some reason, when you load these files into Notepad++, it "hides" some of the stranger characters like the angled apostrophe, longer dashes or angled double quotes. When my algorithm tried converting them to UTF-8 automatically, these characters were lost forever (and they show up as spaces in the browser).
What I noticed is that for Notepad++ to "unhide" these characters before calling the Convert to UTF-8, I have to call the "Encode in ANSI" menu option first. Then when you convert to UTF-8, all is well. I couldn't do this for every situation though because files that were actually detected as UTF-8 without a BOM, would get messed up when selecting Encoding in ANSI first. The Universal Encoding Detector algorithm seems to be able to differentiate these types of files, so I was able to easily integrate it to call "Encode in ANSI" first, before calling "Convert to UTF-8". What's neat about this library is that it outputs a confidence value on how sure it is on the encoding.
Another thing worth mentioning is that I cannot rely on the Universal Encoding Detector alone because it doesn't seem to be able to differentiate between files that are "UTF-8" and "UTF-8 without a BOM", which also causes problems for us. It reports them all as UTF-8. So using a combination of both scripts, I seem to have a solid algorithm that can detect files that are causing problems for us, and converts them all to UTF-8. I will post my algorithm in a few days after some additional testing, for anyone having similar problems!
As promised, here is my script for anyone having similar issues.
A couple of notes:
- The .py script is saved under ~Notepad++\plugins\PythonScript\scripts wherever Notepad++ is installed on your machine
- I had to run Notepad++ under an Administrator account (on Vista anyways)
- To see the script in action, make sure to select the following menu option: Plugins < Python Script < Show Console
- You'll have to install the Universal Encoding Detector for this to work properly: http://chardet.feedparser.org/
- This script will scan the directory for html files, detect those that could not be identified by the Encoding menu or were identified as "UTF-8 without BOM" and converts them to UTF-8. These were the files that were causing problems for us. Files properly detected by Notepad++ as ANSI were converted through another mechanism, so this script doesn't do that (but can easily be modified to handle these cases)
Here you go:
^^^When you try to copy the code above and paste it, the line breaks get removed (which is annoying). Select the code in firefox, right click and choose "View Selected Source" and you'll be able to copy/paste while preserving line breaks.
Hallo Profis,
mein Englisch ist leider sehr schlecht, daher versuche ich es in Deutsch
Möchte ein Verzeichnisinhalt (c:stage) von ANSI nach UTF-8 convertieren
habe folgenden Code:
Leider passiert bei mir nach dem ausführen nichts,
woran kann das liegen?
Notepad++ v6.8.8
Besten Dank im voraus
Gruss André
Last edit: André Lieske 2016-02-16
Hallo,
ich hoffe mein Deutsch ist gut genug.
Bei Python muss man auf Tabs oder Spaces achten.
Deine Syntax ist hier falsch, da die Zeilen, welche mit notepad starten unter dem if weiter
eingerückt werden müssen.
also so
Gruß
Claudia
Hallo Claudia,
dein Deutsch ist sehr gut.
Das Script läuft jetzt durch, aber die Dateien im Ordner sind immer noch im ISO 8859-1 kodiert.
Was mache ich falsch?
Besten Dank im voraus
Gruss André
Last edit: André Lieske 2016-02-16
Hallo Andre,
in notepad 6.8.8 musse es heissen
notepad.runMenuCommand("Encoding", "Convert to UTF-8")
und nicht
notepad.runMenuCommand("Encoding", "Convert to UTF-8 without BOM")
Was noch sein kann, benutzt Du Deutsche Sprache in notepad++
dann musst du die deutschen Begriffe (?) verwenden
Cheers
Claudia
Last edit: CFrank 2016-02-17
Hallo Claudia,
besten Dank, das war es.`
Ist es auch möglich im gleichen Script den Inhalt zu Ändern?
von:
zu
Besten Dank im voraus
Gruss André
Hallo Andre,
ja, füge folgende Zeile vor dem notepad.save() ein
Gruß
Claudia
Hallo Claudia,
diesen Code mag Python nicht,
habe auch die doppelten Anführungszeichen versucht.
Hast du noch einen Tipp?
Besten Dank im voraus
Gruss André
Hallo André,
welcher Fehler kommt?
Öffne die python console (Plugins->PythonScript->Show Console) und gib die
Anweisung direkt ein. Funktioniert das?
Es muss naturlich ein Document offen sein, welches die beiden Texte hat.
Gruß
Claudia
Hallo Claudia,
habe ich gemacht, es kommt keine Fehlermeldung
Wenn ich den Code im Script einbaue,
kommt folgender Fehler
Gruss André
Last edit: André Lieske 2016-02-17
Hallo André,
dann vermute ich, das du einen Mix aus Tab und Spaces hast, das ist nicht erlaubt.
Aktiviere "zeige alle Symbole", das umgedrehte P, dann solltest du sehen ob du
Tabs hast. Wenn alles Space ist, dann stimmt nicht ganz die Anzahl z. b. die Zeile
davor hat 8 Spaces die nächste nur 7 oder so.
Wenn Du mit Python arbeitest, solltest Du unter Einstellungen->Optionen->Tabulatoren
das Kontrollbox(?) "Durch Leerzeichen ersetzen" aktivieren.
Gruß
Claudia
Last edit: CFrank 2016-02-17
Hallo Claudia,
du bist ein Schatz.
Vielen, vielen DANK
Gruss André
Hallo Claudia,
habe doch noch ein Problem
Ich möchte den Dokumentenkopf ändern
das Ergebnis sieht so aus,
es fehlt die Klammer vor dem header und am Ende.
Ergebnis:
Hast du einen Tipp für mich?
Gruss André
Do muss die Klammer escapen, also so:
Es sind zwei '\', weil Python interpretiert sie auch. Ich glaube du könntest die auch mit einem "Raw String" auch machen:
Die Klammer haben ein besondere Bedeutung in Notepad++, und man kann damit Gruppen von Suchstring holen usw, deswegen müssen sie immer escaped werden.
Hallo Claudia,
Besten Dank.
Ich hatte Probleme mit meinem Warenkorb, der war immer leer.
habe jetzt "UTF-8 ohne BOM" eingegeben,
jetzt funktioniert auch mein Warenkorb
Gruss André
HABE ES HINBEKOMMEN
Hallo Claudia,
wie bekomme ich diesen Replace hin?
Das ist der Original-Code der in einer Datei steht
Besten Dank im voraus
Gruss André
Last edit: André Lieske 2016-02-20