The ZIP archives in this directory contain source and pre-built
language models. LA-Strings now knows how to identify 1193 languages
and 3860 language/encoding pairs. Using optional trigram models from
the An Crubadan web crawler project, an additional 140+ languages in
UTF-8 can potentially be identified. Alternatively, a subset of 101 of
the most-spoken languages can be used for greater speed.
Added missing initializations to scan_strings so that
bulk_extractor plugin extracts the same strings as the standalone
version. Fixed multi-threading crash.
Added setlocale("en_US.UTF-8") for systems on which
Added --batch flag to eval.sh to run all identifications in a
single invocation of the identification program, to avoid startup
overhead (particularly for LangDetect and langid.py). Added
util/score.C to perform bulk scoring on batch-mode identification
Fixed eval.sh line-counting with --utf16be and --utf16le.
Added pseudo-model for HTML markup and five additional languages.
[languages.db: total languages=1193+3, total models=4004+5, lang/code
pairs=3860+5, total encodings=38]
Can now be compiled as a bulk_extractor plugin.
Added support for running langid.py
(https://github.com/saffsd/langid.py) from eval.sh, and
--minlen/--maxlen options for characterizing error rates at
varying string lengths.
Corrected problem with line counting in eval.sh when using
--utf16be and --utf16le. Fixed bug in mklangid when building
UTF16 models without using -2b/-2l/-8b/-8l.
MkLangID was not correctly filtering ngrams for ASCII-16BE and
UTF-16BE, because the deciding byte of the second character is
the fourth byte of the ngram, and the filtering took place at the
Added make targets for language databases omitting UTF16 models.
Modified build process to eliminate warnings for "top100" databases.
Added several missing UTF16 models. Added romanized Hindi language
models. Updated Marshallese training data to full Bible.
[[69 additional languages.]]
[languages.db: total languages=1188+2, total models=3980+4, lang/code
pairs=3843+4, total encodings=38]
Added -16b/-16l flags to 'whatlang' to permit input of UTF16BE and
UTF16LE text in line-by-line mode (echoed text is converted to
UTF8). Upgraded evaluation scripts to support testing of UTF16
language identification from UTF8 test/key files with new
--utf16be and --utf16le flags.
Added -l and -L flags to langident/subsample to sample lines by
length in bytes, and -b flag to sample uniformly with a target
size in bytes instead of number of lines
Added -S flag to langident/mklangid to allow score smoothing power
to be set from the commandline for tuning experiments
Added support for running Shuyo's LangIdent from eval.sh.
Corrected training error for UTF16BE and UTF16LE models for Polish,
Hakka, and Pampangan.
[[Added models for 23 further languages; updated models for three more.]]
[languages.db: total languages=1119+2, total models=3737+3, lang/code
pairs=3614+3, total encodings=38]