The ZIP archives in this directory contain source and pre-built
language models. LA-Strings now knows how to identify 1371 languages
and 4466 language/encoding pairs. Using optional trigram models from
the An Crubadan web crawler project, an additional 120+ languages in
UTF-8 can potentially be identified. Alternatively, a subset of 103 of
the most-spoken languages can be used for greater speed.
Improved n-gram weighting for language identification yields ~3%
relative reduction in classification errors in preliminary
Improved byte-based subsampling in 'subsample' results in much more
accurate size of result. Eliminated subsample.C dependencies
on FramepaC to allow standalone distribution.
Added option to MkLangID to perform frequency smoothing using
a logarithmic mapping rather than P^y mapping; specify a
negative smoothing value for -S to activate.
Boosted model size for a small number of highly-confusible
language sets to 15000 n-grams for better discrimination.
Added --yali flag to eval.sh and counts.sh to support use of
Majlis's Yet Another Language Identifier as language
identification program in evaluations.
Fixed some compiler warnings for GCC 4.3.2 and 4.8.
Replaced numerous Bibles with known-redistributable copies.
[[Added 178 languages; updated dozens more.]]
[languages.db: total languages=1371+3, total models=4652+5, lang/code
pairs=4466+5, total encodings=38]
Added missing initializations to scan_strings so that
bulk_extractor plugin extracts the same strings as the standalone
version. Fixed multi-threading crash.
Added setlocale("en_US.UTF-8") for systems on which
Added --batch flag to eval.sh to run all identifications in a
single invocation of the identification program, to avoid startup
overhead (particularly for LangDetect and langid.py). Added
util/score.C to perform bulk scoring on batch-mode identification
Fixed eval.sh line-counting with --utf16be and --utf16le.
Added pseudo-model for HTML markup.
[languages.db: total languages=1193+3, total models=4004+5, lang/code
pairs=3860+5, total encodings=38]
Fleshed out bulk_extractor interface in scan_strings.C. Requires
bulk_extractor v1.4.0 (tested with beta3).
Added support for running langid.py
(https://github.com/saffsd/langid.py) from eval.sh, and
--minlen/--maxlen options for characterizing error rates at
varying string lengths.
Corrected problem with line counting in eval.sh when using
--utf16be and --utf16le. Fixed bug in mklangid when building
UTF16 models without using -2b/-2l/-8b/-8l.
MkLangID was not correctly filtering ngrams for ASCII-16BE and
UTF-16BE, because the deciding byte of the second character is
the fourth byte of the ngram, and the filtering took place at the
Added make targets for language databases omitting UTF16 models.
Modified build process to eliminate warnings for "top100" databases.
Added several missing UTF16 models. Added romanized Hindi language
models. Updated Marshallese training data to full Bible.
Replaced Sumo-Mayangna (sum_NI) by Mayangna (yan_NI), following ISO
[[69 additional languages.]]
[languages.db: total languages=1188+2, total models=3980+4, lang/code
pairs=3843+4, total encodings=38]