Looking for the latest version? Download lastrings-1.25.zip (102.9 MB)
Name Modified Size Downloads / Week Status
Language-Data 2014-11-21 22 weekly downloads
Evaluation 2013-08-28 0
Old-Versions 2012-11-22 0
README 2015-09-28 2.5 kB 0
lastrings-1.25.zip 2015-09-28 102.9 MB 0
lastrings-1.24.zip 2014-08-19 96.0 MB 0
lastrings-1.23.zip 2013-08-28 80.1 MB 0
lastrings-1.22.zip 2013-07-30 80.0 MB 0
lastrings-1.21.zip 2013-06-05 78.0 MB 0
Totals: 9 Items   437.0 MB 2
The ZIP archives in this directory contain source and pre-built language models. LA-Strings now knows how to identify 1475 languages and 4819 language/encoding pairs. Using optional trigram models from the An Crubadan web crawler project, an additional 110+ languages in UTF-8 can potentially be identified. Alternatively, a subset of 103 of the most-spoken languages can be used for greater speed. v1.25 2015-09-28: Made scan_strings compatible with BulkExtractor v1.5.5. [[Added 104 languages; updated numerous others.]] [languages.db: total languages=1475+3, total models=5003+5, lang/code pairs=4819+5, total encodings=38] v1.24 2014-08-19: Improved n-gram weighting for language identification yields ~3% relative reduction in classification errors in preliminary testing. Improved byte-based subsampling in 'subsample' results in much more accurate size of result. Eliminated subsample.C dependencies on FramepaC to allow standalone distribution. Added option to MkLangID to perform frequency smoothing using a logarithmic mapping rather than P^y mapping; specify a negative smoothing value for -S to activate. Boosted model size for a small number of highly-confusible language sets to 15000 n-grams for better discrimination. Added --yali flag to eval.sh and counts.sh to support use of Majlis's Yet Another Language Identifier as language identification program in evaluations. Fixed some compiler warnings for GCC 4.3.2 and 4.8. Replaced numerous Bibles with known-redistributable copies. [[Added 178 languages; updated dozens more.]] [languages.db: total languages=1371+3, total models=4652+5, lang/code pairs=4466+5, total encodings=38] v1.23 2013-08-28: Added missing initializations to scan_strings so that bulk_extractor plugin extracts the same strings as the standalone version. Fixed multi-threading crash. Added setlocale("en_US.UTF-8") for systems on which setlocale("UTF-8") fails. Added --batch flag to eval.sh to run all identifications in a single invocation of the identification program, to avoid startup overhead (particularly for LangDetect and langid.py). Added util/score.C to perform bulk scoring on batch-mode identification output. Fixed eval.sh line-counting with --utf16be and --utf16le. Added pseudo-model for HTML markup. [languages.db: total languages=1193+3, total models=4004+5, lang/code pairs=3860+5, total encodings=38]
Source: README, updated 2015-09-28

Thanks for helping keep SourceForge clean.

Screenshot instructions:
Red Hat Linux   Ubuntu

Click URL instructions:
Right-click on ad, choose "Copy Link", then paste here →
(This may not be possible with some types of ads)

More information about our ad policies

Briefly describe the problem (required):

Upload screenshot of ad (required):
Select a file, or drag & drop file here.

Please provide the ad click URL, if possible:

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.

No, thanks