Looking for the latest version? Download lastrings-1.21.zip (78.0 MB)
Home
Name Modified Size Downloads Status
Totals: 6 Items   233.5 MB
Language-Data 2012-11-22 Stats
Old-Versions 2012-08-24 Stats
README 2013-06-11 3.1 kB Stats
lastrings-1.21.zip 2013-06-05 78.0 MB Stats
lastrings-1.20.zip 2012-11-22 76.7 MB Stats
lastrings-1.19.zip 2012-10-22 78.8 MB Stats
The ZIP archives in this directory contain source and pre-built language models. LA-Strings now knows how to identify 1196 languages and 3611 language/encoding pairs. Using optional trigram models from the An Crubadan web crawler project, an additional 150+ languages in UTF-8 can potentially be identified. Alternatively, a subset of 101 of the most-spoken languages can be used for greater speed. RECENT CHANGES ============== v1.21 2013-05-31: Added -16b/-16l flags to 'whatlang' to permit input of UTF16BE and UTF16LE text in line-by-line mode (echoed text is converted to UTF8). Upgraded evaluation scripts to support testing of UTF16 language identification from UTF8 test/key files with new --utf16be and --utf16le flags. Added -l and -L flags to langident/subsample to sample lines by length in bytes, and -b flag to sample uniformly with a target size in bytes instead of number of lines Added -S flag to langident/mklangid to allow score smoothing power to be set from the commandline for tuning experiments Added support for running Shuyo's LangIdent from eval.sh. Corrected training error for UTF16BE and UTF16LE models for Polish, Hakka, and Pampangan. [[Added models for 23 further languages; updated models for three more.]] [languages.db: total languages=1119+2, total models=3737+3, lang/code pairs=3611+3] v1.20 2012-11-21: Corrected search order for language database file. Corrected some language and country codes. Corrected support for XZ-compressed input files for MkLangID. Added support for a separate set of language models for character-set identification to la-strings. Added -C option to MkLangID to convert a language database into another database containing a smaller number of models merged from the models in the original database (initially a single model for each distinct character encoding). This nearly doubles la-strings throughput on random data with language identification for the full languages.db and increases it by 25% for top100.db. [[Added models for 16 further languages; increased data for 4 others.]] [[Updated numerous language models, many of them now built from known-redistributable data.]] [languages.db: total languages=1096+1, total models=3653+2, lang/code pairs=3530+2] v1.19 2012-10-21: Reallocated bits in language database's frequency records to permit 8191 language models instead of 4095; moved computation of frequency smoothing to database creation time to reduce quantization error with the reduced available range after bit reallocation. Recoded innermost n-gram counting loops for a 40% reduction in language identification time. Fixed truncation error when generating fake UTF-8 language model from Unicode codepoint range Removed executables from distribution archive. [[Added 48 additional languages, added more data for 11 languages.]] [[Added second script for two languages.]] [languages.db: total languages=1080+1, total models=3598+2, lang/code pairs=3480+2]
Source: README, updated 2013-06-11