Language-Aware String Extractor - Browse Files at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Language-Data	2023-07-25		3
Evaluation	2013-08-28		0
Old-Versions	2012-11-22		0
README	2015-09-28	2.5 kB	0
lastrings-1.25.zip	2015-09-28	102.9 MB	0
lastrings-1.24.zip	2014-08-19	96.0 MB	0
lastrings-1.23.zip	2013-08-28	80.1 MB	0
lastrings-1.22.zip	2013-07-30	80.0 MB	0
lastrings-1.21.zip	2013-06-05	78.0 MB	0
Totals: 9 Items		437.0 MB	3

The ZIP archives in this directory contain source and pre-built
language models. LA-Strings now knows how to identify 1475 languages
and 4819 language/encoding pairs. Using optional trigram models from
the An Crubadan web crawler project, an additional 110+ languages in
UTF-8 can potentially be identified. Alternatively, a subset of 103 of
the most-spoken languages can be used for greater speed.

v1.25 2015-09-28:
   Made scan_strings compatible with BulkExtractor v1.5.5.
   [[Added 104 languages; updated numerous others.]]
   [languages.db: total languages=1475+3, total models=5003+5, lang/code
     pairs=4819+5, total encodings=38]

v1.24 2014-08-19:
   Improved n-gram weighting for language identification yields ~3%
     relative reduction in classification errors in preliminary
     testing.
   Improved byte-based subsampling in 'subsample' results in much more
     accurate size of result.  Eliminated subsample.C dependencies
     on FramepaC to allow standalone distribution.
   Added option to MkLangID to perform frequency smoothing using
     a logarithmic mapping rather than P^y mapping; specify a
     negative smoothing value for -S to activate.
   Boosted model size for a small number of highly-confusible
     language sets to 15000 n-grams for better discrimination.
   Added --yali flag to eval.sh and counts.sh to support use of
     Majlis's Yet Another Language Identifier as language
     identification program in evaluations.
   Fixed some compiler warnings for GCC 4.3.2 and 4.8.
   Replaced numerous Bibles with known-redistributable copies.
   [[Added 178 languages; updated dozens more.]]
   [languages.db: total languages=1371+3, total models=4652+5, lang/code
     pairs=4466+5, total encodings=38]

v1.23 2013-08-28:
   Added missing initializations to scan_strings so that
     bulk_extractor plugin extracts the same strings as the standalone
     version.  Fixed multi-threading crash.
   Added setlocale("en_US.UTF-8") for systems on which
     setlocale("UTF-8") fails.
   Added --batch flag to eval.sh to run all identifications in a
     single invocation of the identification program, to avoid startup
     overhead (particularly for LangDetect and langid.py).  Added
     util/score.C to perform bulk scoring on batch-mode identification
     output.
   Fixed eval.sh line-counting with --utf16be and --utf16le.
   Added pseudo-model for HTML markup.
   [languages.db: total languages=1193+3, total models=4004+5, lang/code
     pairs=3860+5, total encodings=38]

Source: README, updated 2015-09-28

Language-Aware String Extractor Files

multi-encoding strings(1) replacement with language identification

Language-Aware String Extractor Files

multi-encoding strings(1) replacement with language identification

Get an email when there's a new version of Language-Aware String Extractor