Looking for the latest version? Download lastrings-1.23.zip (80.1 MB)
Home
Name Modified Size Downloads / Week Status
Totals: 7 Items   238.1 MB 6
Evaluation 2013-08-28 1 weekly downloads
Language-Data 2013-03-05 2 weekly downloads
Old-Versions 2012-11-22 23 weekly downloads
README 2013-08-28 3.3 kB 33 weekly downloads
lastrings-1.23.zip 2013-08-28 80.1 MB 11 weekly downloads
lastrings-1.22.zip 2013-07-30 80.0 MB 11 weekly downloads
lastrings-1.21.zip 2013-06-05 78.0 MB 11 weekly downloads
The ZIP archives in this directory contain source and pre-built language models. LA-Strings now knows how to identify 1193 languages and 3860 language/encoding pairs. Using optional trigram models from the An Crubadan web crawler project, an additional 140+ languages in UTF-8 can potentially be identified. Alternatively, a subset of 101 of the most-spoken languages can be used for greater speed. v1.23 2013-08-28: Added missing initializations to scan_strings so that bulk_extractor plugin extracts the same strings as the standalone version. Fixed multi-threading crash. Added setlocale("en_US.UTF-8") for systems on which setlocale("UTF-8") fails. Added --batch flag to eval.sh to run all identifications in a single invocation of the identification program, to avoid startup overhead (particularly for LangDetect and langid.py). Added util/score.C to perform bulk scoring on batch-mode identification output. Fixed eval.sh line-counting with --utf16be and --utf16le. Added pseudo-model for HTML markup and five additional languages. [languages.db: total languages=1193+3, total models=4004+5, lang/code pairs=3860+5, total encodings=38] v1.22 2013-07-30: Can now be compiled as a bulk_extractor plugin. Added support for running langid.py (https://github.com/saffsd/langid.py) from eval.sh, and --minlen/--maxlen options for characterizing error rates at varying string lengths. Corrected problem with line counting in eval.sh when using --utf16be and --utf16le. Fixed bug in mklangid when building UTF16 models without using -2b/-2l/-8b/-8l. MkLangID was not correctly filtering ngrams for ASCII-16BE and UTF-16BE, because the deciding byte of the second character is the fourth byte of the ngram, and the filtering took place at the trigram-counting stage. Added make targets for language databases omitting UTF16 models. Modified build process to eliminate warnings for "top100" databases. Added several missing UTF16 models. Added romanized Hindi language models. Updated Marshallese training data to full Bible. [[69 additional languages.]] [languages.db: total languages=1188+2, total models=3980+4, lang/code pairs=3843+4, total encodings=38] v1.21 2013-05-31: Added -16b/-16l flags to 'whatlang' to permit input of UTF16BE and UTF16LE text in line-by-line mode (echoed text is converted to UTF8). Upgraded evaluation scripts to support testing of UTF16 language identification from UTF8 test/key files with new --utf16be and --utf16le flags. Added -l and -L flags to langident/subsample to sample lines by length in bytes, and -b flag to sample uniformly with a target size in bytes instead of number of lines Added -S flag to langident/mklangid to allow score smoothing power to be set from the commandline for tuning experiments Added support for running Shuyo's LangIdent from eval.sh. Corrected training error for UTF16BE and UTF16LE models for Polish, Hakka, and Pampangan. [[Added models for 23 further languages; updated models for three more.]] [languages.db: total languages=1119+2, total models=3737+3, lang/code pairs=3614+3, total encodings=38]
Source: README, updated 2013-08-28