charset detection

Authors:

Overview
Algorithm
Implementations

Overview

Character set detection is the process of determining the character set, or encoding, of character data in an unknown format. This is, at best, an imprecise operation using statistics and heuristics. In some cases, the language can be determined along with the encoding.

Algorithm

Assumes that we have an array of bytes as input. For each detectable encoding the following procedure will be executed:
we assumes that bytes are representing a text in this encoding.
the sequence of bytes is checked for legal (for this encoding) patterns. If test fails encoding marked as "not me" and next step skipped.
*if applicable (single and multi-byte encodings) the statistical analysis executed, like frequency distribution of trigraphs (three letter groups) for each language that can be written using that encoding. The result of analysis is confidence - the probability that this language and this encoding is used.

After loop evaluation the encoding with higest confidence is reported.

Implementations

Mozilla charset detector

enca

MS Windows API

ICU charset detector

Lazarus charset detector

Comparsion

Encoding	Mozilla	enca	Win API	ICU	Lazarus
UTF-7	+	+	+	+	+
UTF-8	+	+	+	+	+
UTF-16LE	+	+	+	+	+

Wiki: Home

Charset detector Wiki