Menu

Read file encoding using python script

Help
mrpaul1
2010-12-14
2018-10-09
1 2 3 > >> (Page 1 of 3)
  • mrpaul1

    mrpaul1 - 2010-12-14

    Hi,

    I have a large file structure that I would like to read in using python to detect the file encoding the same way that Notepad++ automatically selects the encoding from its menu. Mainly, we have a bunch of files that are detected as "UTF8 without BOM" and we would like to convert them to UTF-8 (thus adding the BOM), but we need to find where those files reside. If we manually open each file in Notepad++ and check the Encoding menu, the selection tells us which encoding is detected but we are trying to automate this.

    Using the following code, I can use the Notepad++ Python Script to convert each file using the Menu Option "Convert to UTF-8":

    import os;
    import sys;
    filePathSrc="C:\\FilePath"
    for fn in os.listdir(filePathSrc): 
        if fn[-4:] == '.htm' or fn[-5:] == '.html':     
            notepad.open(filePathSrc+"\\" +fn)      
            notepad.runMenuCommand("Encoding", "Convert to UTF-8")
            notepad.save()
            notepad.close()
    

    Now I am trying to print out the encoding of the open file, as detected by Notepad++. I have tried adding the line:

    print "fileName: " +fn +" :: encoding: " + str(notepad.getEncoding())
    

    but that prints out COOKIE (for files that are actually detected as "UTF-8 without BOM") or ENC8BIT (for files that are actually detected as "ANSI"). I am also unsure if this will be consistent for each file.

    Any idea how to print out the Encoding menu selection for each file using this plugin?

     
  • Dave Brotherstone

    Yes, the constants for the enums were all generated from Notepad++'s internal enums - unfortunately they're not all sensibly named. 
    See the enum definition in the docs:
    http://npppythonscript.sourceforge.net/docs/latest/enums.html?highlight=encoding#BUFFERENCODING

    COOKIE refers to a "guessed" UTF8 encoding, ENC8BIT refers to ANSI.  It will be consistent for each file, but, only as consistent as Notepad++ is at detecting the encoding.  It only checks characters in the first 128k, and there has to be some UTF8 encoded (multi-byte) characters in there.

    Your best bet is just to map the constants to what you want to say.

    encodingMap = { BUFFERENCODING.COOKIE : 'UTF-8 without BOM', BUFFERENCODING.ENC8BIT : 'ANSI' }
    ... your code...
    print "fileName: " +fn +" :: encoding: " + encodingMap.get(notepad.getEncoding(). 'Unknown')
    

    Depending on what you're trying to achieve, you might want to console.write() instead of "print", unless you've redirected sys.stdout somewhere.

    Cheers,
    Dave.

     
    • Preethi Shetty

      Preethi Shetty - 2018-08-23
      Post awaiting moderation.
  • mrpaul1

    mrpaul1 - 2010-12-15

    Just the confirmation that I needed. Thank you for the prompt reply and for a great tool, Dave.

     
  • mrpaul1

    mrpaul1 - 2011-04-18

    I have an additional question on this topic. How would I go about detecting files that do not have an encoding option selected in the menu? For example, we have certain files that, when loaded into Notepad++, does not have an encoding option selected in the menu but printing out notepad.getEncoding() in python displays BUFFERENCODING.COOKIE. Files that are UTF-8 without a BOM also display BUFFERENCODING.COOKIE but we need to differentiate the two. We're trying to automate this because we have thousands of files. Any idea?

    Thank you.

     
  • mrpaul1

    mrpaul1 - 2011-04-18

    Addendum to my previous post: I noticed that the status bar in Notepad++ is detecting these files as ISO-8859-1 (bottom right), but the "Encoding" menu command does not have anything selected. Is there any way to detect this encoding using python++?

     
  • Dave Brotherstone

    (I did see the post - Sourceforge's email notification system does work 99% of the time :)

    Short answer : No - there's no API in Notepad++ to get the text on the status bar, and as you've discovered, it only reports ENC8BIT or COOKIE for ANSI or UTF8 w/o BOM files, and still reports COOKIE for ISO-8859-1.

    Long(er) anser:  That, to my view, is a bug in N++, however, messing with the encoding-code is not something I fancy getting into! However, getting the status bar text has been requested more than once, so I've made a patch that would enable that from N++.  I'll try and test it later, and if it works OK, I'll submit it.  I'll post a note here.  Patches sometimes sit there for ever, so I'll post a note on the Notepad-Plus Open Discussion forum, but then I'll leave it to you to "market it".  Once it's in N++, adding it to Python Script is a 10 minute job.

    Alternative answer:  N++ encoding detection is sketchy at best, you might want to look at more specialized tools to do encoding conversion/detection (Kaboom has been mentioned several times, although not used it myself). 
    If you know Java or C, you might also want to look at the intel ICU library.  That has support for every encoding under the sun, and can happily convert between them. And, their encoding detection is the best there is.

    You might also be able to do this quite easily in Python, looking at the file itself - this stackoverflow question has a few good suggestions - http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file

    Hope that helps,
    Dave.

     
  • mrpaul1

    mrpaul1 - 2011-04-19

    Thanks again for your prompt reply Dave. I ended up finding a solution using an encoding algorithm that I found called the "Universal Encoding Detector", which was also written in python: http://chardet.feedparser.org/

    Basically, I am relying on Notepad++ to tell me which files are being detected as UTF-8 without a BOM as these were the ones causing issues for us in the first place. The problem, as you stated, is that when Notepad++ cannot detect the encoding (thus not selecting a menu item), notepad.getEncoding() returns COOKIE, which is the same result for files that are being detected as UTF-8 without a BOM. Therefore, I couldn't differentiate the two… until now.

    For the files that Notepad++ couldn't detect, I noticed that the status bar was showing an encoding of ISO-8859-1. Here's the problem with those files: for some reason, when you load these files into Notepad++, it "hides" some of the stranger characters like the angled apostrophe, longer dashes or angled double quotes. When my algorithm tried converting them to UTF-8 automatically, these characters were lost forever (and they show up as spaces in the browser).

    What I noticed is that for Notepad++ to "unhide" these characters before calling the Convert to UTF-8, I have to call the "Encode in ANSI" menu option first. Then when you convert to UTF-8, all is well. I couldn't do this for every situation though because files that were actually detected as UTF-8 without a BOM, would get messed up when selecting Encoding in ANSI first. The Universal Encoding Detector algorithm seems to be able to differentiate these types of files, so I was able to easily integrate it to call "Encode in ANSI" first, before calling "Convert to UTF-8". What's neat about this library is that it outputs a confidence value on how sure it is on the encoding.

    Another thing worth mentioning is that I cannot rely on the Universal Encoding Detector alone because it doesn't seem to be able to differentiate between files that are "UTF-8" and "UTF-8 without a BOM", which also causes problems for us. It reports them all as UTF-8. So using a combination of both scripts, I seem to have a solid algorithm that can detect files that are causing problems for us, and converts them all to UTF-8. I will post my algorithm in a few days after some additional testing, for anyone having similar problems!

     
  • mrpaul1

    mrpaul1 - 2011-04-28

    As promised, here is my script for anyone having similar issues.

    A couple of notes:
    - The .py script is saved under ~Notepad++\plugins\PythonScript\scripts wherever Notepad++ is installed on your machine
    - I had to run Notepad++ under an Administrator account (on Vista anyways)
    - To see the script in action, make sure to select the following menu option: Plugins < Python Script < Show Console
    - You'll have to install the Universal Encoding Detector for this to work properly: http://chardet.feedparser.org/
    - This script will scan the directory for html files, detect those that could not be identified by the Encoding menu or were identified as "UTF-8 without BOM" and converts them to UTF-8. These were the files that were causing problems for us. Files properly detected by Notepad++ as ANSI were converted through another mechanism, so this script doesn't do that (but can easily be modified to handle these cases)

    Here you go:

    import os;
    import sys;
    import re;
    import chardet;
    ### User Defined Variables ###
    filePathSrc='C:\\Path\\ToScan'
    logFile = open("C:\\EncodingFix.log", "w")
    foundCount = 1
    encodingMap = { BUFFERENCODING.COOKIE : 'UTF-8 without BOM', BUFFERENCODING.ENC8BIT : 'ANSI', BUFFERENCODING.UTF8: 'UTF-8' }
    textToWrite = "Starting Script...\n"
    console.write(textToWrite)
    logFile.write(textToWrite)
    for root, subFolders, files in os.walk(filePathSrc): # searches file path recursively
        textToWrite = "Scanning: " +root + "\n"
        console.write(textToWrite)
        logFile.write(textToWrite)
        for file in files:
            filePath = os.path.join(root,file)
            # only do this for html files
            if file[-4:].lower() == '.htm' or file[-5:].lower() == '.html':
                notepad.open(filePath.decode(sys.getfilesystemencoding()).encode('utf8'))           
                # BUFFERENCODING.COOKIE is returned for files that are "UTF-8 without BOM" or no Encoding menu option selected
                if (notepad.getEncoding() == BUFFERENCODING.COOKIE):
                    # use the Universal Encoding Detector (http://chardet.feedparser.org)
                    rawdata=open(filePath,"r").read()
                    UED_Result = chardet.detect(rawdata)
                    UED_Result_Encoding = UED_Result.get("encoding")
                    UED_Result_Confidence = UED_Result.get("confidence")
                    if UED_Result_Encoding.startswith("ISO-8859") or UED_Result_Encoding.startswith("ascii"):
                        textToWrite = "%d: %s %f %s" % (foundCount, "Chardet Detection   -> " +filePath +": " + UED_Result_Encoding + " [ Confidence:",
                        UED_Result_Confidence, "]\n")
                        console.write(textToWrite)
                        logFile.write(textToWrite)
                        notepad.runMenuCommand("Encoding", "Encode in ANSI") #IMPORTANT: preserve certain chars (Notepad++ seems to hide them)
                    else: # Notepad++ detected this file as UTF-8 without a BOM
                        textToWrite = "%d: %s" % (foundCount, "Notepad++ Detection -> " +filePath +": " +encodingMap.get(notepad.getEncoding(),'UNKNOWN') + "\n")
                        console.write(textToWrite)
                        logFile.write(textToWrite)                      
                    notepad.runMenuCommand("Encoding", "Convert to UTF-8")                  
                    notepad.save()
                    foundCount += 1             
                notepad.close()
    textToWrite = "Program Completed successfully!\n"
    console.write(textToWrite)
    logFile.write(textToWrite)
    logFile.close()
    
     
  • mrpaul1

    mrpaul1 - 2011-04-28

    ^^^When you try to copy the code above and paste it, the line breaks get removed (which is annoying). Select the code in firefox, right click and choose "View Selected Source" and you'll be able to copy/paste while preserving line breaks.

     
  • André Lieske

    André Lieske - 2016-02-16

    Hallo Profis,
    mein Englisch ist leider sehr schlecht, daher versuche ich es in Deutsch
    Möchte ein Verzeichnisinhalt (c:stage) von ANSI nach UTF-8 convertieren
    habe folgenden Code:

    import os;
    import sys;
    filePathSrc="C:\\stage\\" 
    for root, dirs, files in os.walk(filePathSrc):
        for fn in files:
            if fn[-5:] == '.html': 
       notepad.open(root + "\\" + fn)
       notepad.runMenuCommand("Encoding", "Encode in ANSI")
       notepad.runMenuCommand("Encoding", "Convert to UTF-8 without BOM")
       notepad.save()
       notepad.close()
    

    Leider passiert bei mir nach dem ausführen nichts,
    woran kann das liegen?
    Notepad++ v6.8.8

    Besten Dank im voraus
    Gruss André

     

    Last edit: André Lieske 2016-02-16
  • CFrank

    CFrank - 2016-02-16

    Hallo,
    ich hoffe mein Deutsch ist gut genug.
    Bei Python muss man auf Tabs oder Spaces achten.
    Deine Syntax ist hier falsch, da die Zeilen, welche mit notepad starten unter dem if weiter
    eingerückt werden müssen.
    also so

    import os;
    import sys;
    filePathSrc="C:\\stage\\" 
    for root, dirs, files in os.walk(filePathSrc):
        for fn in files:
            if fn[-5:] == '.html': 
                notepad.open(root + "\\" + fn)
                notepad.runMenuCommand("Encoding", "Encode in ANSI")
                notepad.runMenuCommand("Encoding", "Convert to UTF-8 without BOM")
                notepad.save()
                notepad.close()
    

    Gruß
    Claudia

     
  • André Lieske

    André Lieske - 2016-02-16

    Hallo Claudia,
    dein Deutsch ist sehr gut.
    Das Script läuft jetzt durch, aber die Dateien im Ordner sind immer noch im ISO 8859-1 kodiert.
    Was mache ich falsch?

    Besten Dank im voraus
    Gruss André

     

    Last edit: André Lieske 2016-02-16
  • CFrank

    CFrank - 2016-02-17

    Hallo Andre,
    in notepad 6.8.8 musse es heissen

    notepad.runMenuCommand("Encoding", "Convert to UTF-8")

    und nicht

    notepad.runMenuCommand("Encoding", "Convert to UTF-8 without BOM")

    Was noch sein kann, benutzt Du Deutsche Sprache in notepad++
    dann musst du die deutschen Begriffe (?) verwenden

    Cheers
    Claudia

     

    Last edit: CFrank 2016-02-17
  • André Lieske

    André Lieske - 2016-02-17

    Hallo Claudia,
    besten Dank, das war es.`

    import os;
    import sys;
    filePathSrc="C:\\stage\\" 
    for root, dirs, files in os.walk(filePathSrc):
        for fn in files:
            if fn[-4:] == '.php': 
                notepad.open(root + "\\" + fn)
                notepad.runMenuCommand("Kodierung", "Konvertiere zu ANSI")
                notepad.runMenuCommand("Kodierung", "Konvertiere zu UTF-8")
                notepad.save()
                notepad.close()
    

    Ist es auch möglich im gleichen Script den Inhalt zu Ändern?
    von:

    <meta http-equiv="content-type" content="text/html; charset=iso-8859-1" >
    

    zu

    <meta http-equiv="content-type" content="text/html; charset=UTF-8" >
    

    Besten Dank im voraus
    Gruss André

     
  • CFrank

    CFrank - 2016-02-17

    Hallo Andre,

    ja, füge folgende Zeile vor dem notepad.save() ein

    editor.replace('iso-8859-1', 'UTF-8')
    

    Gruß
    Claudia

     
  • André Lieske

    André Lieske - 2016-02-17

    Hallo Claudia,
    diesen Code mag Python nicht,
    habe auch die doppelten Anführungszeichen versucht.

    import os;
    import sys;
    filePathSrc="C:\\stage\\" 
    for root, dirs, files in os.walk(filePathSrc):
        for fn in files:
            if fn[-4:] == '.php': 
                notepad.open(root + "\\" + fn)
                notepad.runMenuCommand("Kodierung", "Konvertiere zu ANSI")
                notepad.runMenuCommand("Kodierung", "Konvertiere zu UTF-8")
                editor.replace('iso-8859-1', 'UTF-8')
                notepad.save()
                notepad.close()
    

    Hast du noch einen Tipp?
    Besten Dank im voraus
    Gruss André

     
  • CFrank

    CFrank - 2016-02-17

    Hallo André,
    welcher Fehler kommt?
    Öffne die python console (Plugins->PythonScript->Show Console) und gib die
    Anweisung direkt ein. Funktioniert das?
    Es muss naturlich ein Document offen sein, welches die beiden Texte hat.

    Gruß
    Claudia

     
  • André Lieske

    André Lieske - 2016-02-17

    Hallo Claudia,
    habe ich gemacht, es kommt keine Fehlermeldung

    Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
    Initialisation took 47ms
    Ready.
    >>> editor.replace('iso-8859-1', 'UTF-8')
    

    Wenn ich den Code im Script einbaue,

    import os;
    import sys;
    filePathSrc="C:\\Users\\Andre\\Documents\\SmartStore.biz Projekte\\SM6\\Lieske Andre\\Stage\\" 
    for root, dirs, files in os.walk(filePathSrc):
        for fn in files:
            if fn[-4:] == '.php': 
                notepad.open(root + "\\" + fn)
                notepad.runMenuCommand("Kodierung", "Konvertiere zu ANSI")
                notepad.runMenuCommand("Kodierung", "Konvertiere zu UTF-8")
                editor.replace('iso-8859-1', 'UTF-8')
                notepad.save()
                notepad.close()
    

    kommt folgender Fehler

    Python 2.7.6-notepad++ r2 (default, Apr 21 2014, 19:26:54) [MSC v.1600 32 bit (Intel)]
    Initialisation took 47ms
    Ready.
    >>> editor.replace('iso-8859-1', 'UTF-8')
      File "C:\Users\Andre\AppData\Roaming\Notepad++\plugins\Config\PythonScript\scripts\Convert Stage Ordner.py", line 10
        editor.replace('iso-8859-1', 'UTF-8')
        ^
    IndentationError: unexpected indent
    

    Gruss André

     

    Last edit: André Lieske 2016-02-17
  • CFrank

    CFrank - 2016-02-17

    Hallo André,

    dann vermute ich, das du einen Mix aus Tab und Spaces hast, das ist nicht erlaubt.
    Aktiviere "zeige alle Symbole", das umgedrehte P, dann solltest du sehen ob du
    Tabs hast. Wenn alles Space ist, dann stimmt nicht ganz die Anzahl z. b. die Zeile
    davor hat 8 Spaces die nächste nur 7 oder so.

    Wenn Du mit Python arbeitest, solltest Du unter Einstellungen->Optionen->Tabulatoren
    das Kontrollbox(?) "Durch Leerzeichen ersetzen" aktivieren.

    Gruß
    Claudia

     

    Last edit: CFrank 2016-02-17
  • André Lieske

    André Lieske - 2016-02-17

    Hallo Claudia,
    du bist ein Schatz.
    Vielen, vielen DANK
    Gruss André

     
  • André Lieske

    André Lieske - 2016-02-20

    Hallo Claudia,
    habe doch noch ein Problem
    Ich möchte den Dokumentenkopf ändern

    editor.replace("<!DOCTYPE html>", "<?php header('Content-Type: text/html;charset=UTF-8');?><!DOCTYPE html>")
    

    das Ergebnis sieht so aus,
    es fehlt die Klammer vor dem header und am Ende.
    Ergebnis:

    <?php header'Content-Type: text/html;charset=UTF-8';?><!DOCTYPE html>
    

    Hast du einen Tipp für mich?
    Gruss André

     
  • Dave Brotherstone

    Do muss die Klammer escapen, also so:

    editor.replace("<!DOCTYPE html>", "<?php header\\('Content-Type: text/html;charset=UTF-8'\\);?><!DOCTYPE html>")
    

    Es sind zwei '\', weil Python interpretiert sie auch. Ich glaube du könntest die auch mit einem "Raw String" auch machen:

    editor.replace("<!DOCTYPE html>", r"<?php header\('Content-Type: text/html;charset=UTF-8'\);?><!DOCTYPE html>")
    

    Die Klammer haben ein besondere Bedeutung in Notepad++, und man kann damit Gruppen von Suchstring holen usw, deswegen müssen sie immer escaped werden.

     
  • André Lieske

    André Lieske - 2016-02-20

    Hallo Claudia,
    Besten Dank.
    Ich hatte Probleme mit meinem Warenkorb, der war immer leer.

    editor.replace('iso-8859-1', 'UTF-8')
    

    habe jetzt "UTF-8 ohne BOM" eingegeben,
    jetzt funktioniert auch mein Warenkorb

    Gruss André

     
  • André Lieske

    André Lieske - 2016-02-20

    HABE ES HINBEKOMMEN

    Hallo Claudia,
    wie bekomme ich diesen Replace hin?

    #editor.replace("Ihr Warenkorb enth\u00e4lt keine Eintr\u00e4ge", r"Ihr Warenkorb enthält keine Einträge")
    

    Das ist der Original-Code der in einer Datei steht

    Ihr Warenkorb enth\u00e4lt keine Eintr\u00e4g
    

    Besten Dank im voraus
    Gruss André

     

    Last edit: André Lieske 2016-02-20
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.