Character set detection is the process of determining the character set, or encoding, of character data in an unknown format. This is, at best, an imprecise operation using statistics and heuristics. In some cases, the language can be determined along with the encoding.
Assumes that we have an array of bytes as input. For each detectable encoding the following procedure will be executed:
we assumes that bytes are representing a text in this encoding.
the sequence of bytes is checked for legal (for this encoding) patterns. If test fails encoding marked as "not me" and next step skipped.
*if applicable (single and multi-byte encodings) the statistical analysis executed, like frequency distribution of trigraphs (three letter groups) for each language that can be written using that encoding. The result of analysis is confidence - the probability that this language and this encoding is used.
After loop evaluation the encoding with higest confidence is reported.
Comparsion
Encoding | Mozilla | enca | Win API | ICU | Lazarus |
UTF-7 | + | + | + | + | + |
UTF-8 | + | + | + | + | + |
UTF-16LE | + | + | + | + | + |