Download Latest Version UniversalCharDetCS.bin.7z (78.0 kB)
Email in envelope

Get an email when there's a new version of StreaMan: DVD-Video Streams Manipulator

Home / Useful_Tools / UniversalCharDetCS
Name Modified Size InfoDownloads / Week
Parent folder
UniversalCharDetCS.bin.7z 2012-04-03 78.0 kB
UniversalCharDetCS.src.7z 2012-04-03 120.8 kB
Readme.txt 2012-04-03 4.1 kB
Totals: 3 Items   202.9 kB 1
UniversalCharDetCS
====================
UniversalCharDetCS is a standalone program (written in C#) for automatic charset / encoding detection of a given text file or web pages.
If automatic detection does not give good results, you can select the encoding manually evaluating the results visually.
In manual mode you can select the language. Then the list of encodings available for selection will accordingly be narrowed.

Charsets automatically recognized
=================================
Language / Name		Alias			CodePage	Remarks

Unicode
UTF-8			utf-8			65001	Unicode (UTF-8)
UTF-16LE		utf-16			1200	Unicode UTF-16, little endian byte order (BMP of ISO 10646)
UTF-16BE		unicodeFFFE		1201	Unicode UTF-16, big endian byte order
UTF-32LE		utf-32			12000	Unicode UTF-32, little endian byte order	Available only to managed applications
UTF-32BE		utf-32BE		12001	Unicode UTF-32, big endian byte order	Available only to managed applications
X-ISO-10646-UCS-4-2143	utf-32			12000	Unusual BOM (3412 order)	It is not supported on MS Windows. Very similar is the UTF-32LE
X-ISO-10646-UCS-4-3412	utf-32BE		12001	Unusual BOM (3412 order)	It is not supported on MS Windows. Very similar is the UTF-32BE

Bulgarian
ISO-8859-5		iso-8859-5		28595	ISO 8859-5 Cyrillic
windows-1251		windows-1251		1251	ANSI Cyrillic, Cyrillic (Windows)

Chinese
Big5			big5			950	ANSI/OEM Traditional Chinese (Taiwan, Hong Kong SAR, PRC)
GB18030			GB18030			54936	Simplified Chinese (4 byte), Chinese Simplified (GB18030)	Windows XP and later
HZ-GB-2312		hz-gb-2312		52936	HZ-GB2312 Simplified Chinese, Chinese Simplified (HZ)	
ISO-2022-CN		x-cp50227		50227	ISO 2022 Simplified Chinese, Chinese Simplified (ISO 2022)	
x-euc-tw		EUC-CN			51936	EUC Simplified Chinese, Chinese Simplified (EUC)	

Greek
ISO-8859-7		iso-8859-7		28597	ISO 8859-7 Greek
windows-1253		windows-1253		1253	ANSI Greek, Greek (Windows) 

Hebrew
ISO-8859-8		iso-8859-8		28598	ISO 8859-8 Hebrew, Hebrew (ISO-Visual)
windows-1255		windows-1255		1255	ANSI Hebrew, Hebrew (Windows)

Japanese
EUC-JP			euc-jp			51932	EUC Japanese	
ISO-2022-JP		csISO2022JP		50222	ISO 2022 Japanese JIS X 0201-1989, Japanese (JIS-Allow 1 byte Kana - SO/SI)	or 50221? or 50220? 
Shift_JIS		shift_jis		932	ANSI/OEM Japanese, Japanese (Shift-JIS)	

Korean
EUC-KR			euc-kr			51949	EUC Korean
ISO-2022-KR		iso-2022-kr		50225	ISO 2022 Korean

Russian
IBM855			IBM855			855	OEM Cyrillic (primarily Russian)
IBM866			cp866				866	OEM Russian, Cyrillic (DOS) 
ISO-8859-5		iso-8859-5		28595	ISO 8859-5 Cyrillic
KOI8-R			koi8-r				20866	Russian (KOI8-R), Cyrillic (KOI8-R) 
windows-1251		windows-1251		1251	ANSI Cyrillic, Cyrillic (Windows) 
x-mac-cyrillic		x-mac-cyrillic		10007	Cyrillic (Mac)

Thai
TIS-620			ISO 8859-11		874		TIS-620 (8-bit Thai) = ISO 8859-11	28601 not supported on MS Windows, supported by windows-874

Others
ASCII			us-ascii		20127	US-ASCII (7-bit)
windows-1252		windows-1252		1252	ANSI Latin 1, Western European (Windows) 

Information
===========
Software is based on Mozilla Universal Charset Detector:
http://mxr.mozilla.org/mozilla/source/extensions/universalchardet/src/
Techniques used by universalchardet are described at:
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

Majority of basic code was taken from a Ude (C # port):
http://code.google.com/p/ude/

Related works (from where taken some ideas):
(Pascal) http://chsdet.sourceforge.net/
(C#) http://code.google.com/p/nuniversalchardet/
(Java) http://code.google.com/p/juniversalchardet/

Code Page Identifiers: http://msdn.microsoft.com/en-us/library/windows/desktop/dd317756%28v=vs.85%29.aspx

License
=======
The software is subject to the Mozilla Public License Version 1.1.
Alternatively, the software may be used under the terms of either the GNU General Public License Version 2 or later,
or the GNU Lesser General Public License 2.1 or later.

Copyright (C) 2012 by Pawel57 <pawel57(at)users(dot)sourceforge(dot)net>
http://sourceforge.net/projects/streaman/files/Useful_Tools/

Source: Readme.txt, updated 2012-04-03