The ZIP archives in this directory contain source and pre-built
language models. LA-Strings now knows how to identify 1196 languages
and 3611 language/encoding pairs. Using optional trigram models from
the An Crubadan web crawler project, an additional 150+ languages in
UTF-8 can potentially be identified. Alternatively, a subset of 101 of
the most-spoken languages can be used for greater speed.
RECENT CHANGES
==============
v1.21 2013-05-31:
Added -16b/-16l flags to 'whatlang' to permit input of UTF16BE and
UTF16LE text in line-by-line mode (echoed text is converted to
UTF8). Upgraded evaluation scripts to support testing of UTF16
language identification from UTF8 test/key files with new
--utf16be and --utf16le flags.
Added -l and -L flags to langident/subsample to sample lines by
length in bytes, and -b flag to sample uniformly with a target
size in bytes instead of number of lines
Added -S flag to langident/mklangid to allow score smoothing power
to be set from the commandline for tuning experiments
Added support for running Shuyo's LangIdent from eval.sh.
Corrected training error for UTF16BE and UTF16LE models for Polish,
Hakka, and Pampangan.
[[Added models for 23 further languages; updated models for three more.]]
[languages.db: total languages=1119+2, total models=3737+3, lang/code
pairs=3611+3]
v1.20 2012-11-21:
Corrected search order for language database file.
Corrected some language and country codes.
Corrected support for XZ-compressed input files for MkLangID.
Added support for a separate set of language models for
character-set identification to la-strings. Added -C option to
MkLangID to convert a language database into another database
containing a smaller number of models merged from the models in
the original database (initially a single model for each distinct
character encoding). This nearly doubles la-strings throughput
on random data with language identification for the full
languages.db and increases it by 25% for top100.db.
[[Added models for 16 further languages; increased data for 4 others.]]
[[Updated numerous language models, many of them now built from
known-redistributable data.]]
[languages.db: total languages=1096+1, total models=3653+2, lang/code
pairs=3530+2]
v1.19 2012-10-21:
Reallocated bits in language database's frequency records to permit
8191 language models instead of 4095; moved computation of
frequency smoothing to database creation time to reduce
quantization error with the reduced available range after bit
reallocation.
Recoded innermost n-gram counting loops for a 40% reduction in
language identification time.
Fixed truncation error when generating fake UTF-8 language model from
Unicode codepoint range
Removed executables from distribution archive.
[[Added 48 additional languages, added more data for 11 languages.]]
[[Added second script for two languages.]]
[languages.db: total languages=1080+1, total models=3598+2, lang/code
pairs=3480+2]