NGramJ Code
Brought to you by:
nestefan
| File | Date | Author | Commit |
|---|---|---|---|
| samples | 2009-12-23 | nestefan | [r13] * See ChangeLog.txt |
| src | 2009-12-23 | nestefan | [r13] * See ChangeLog.txt |
| ChangeLog.txt | 2009-12-23 | nestefan | [r13] * See ChangeLog.txt |
| LICENSE.txt | 2006-03-27 | nestefan | [r2] Added keywords for this project. |
| README-license.txt | 2006-03-29 | nestefan | [r4] File was still refering to the Jacson project. |
| README.txt | 2006-03-27 | nestefan | [r2] Added keywords for this project. |
| build.number | 2009-12-23 | nestefan | [r14] |
| build.properties | 2009-12-23 | nestefan | [r14] |
| build.xml | 2009-07-26 | nestefan | [r12] * cngram: Removed double locked check for null ... |
This is
NGramJ
it is actually two independant sets of java classes:
1. the ngramj part, which is actually an rebuild of the text_cat PERL stuff
(see http://odur.let.rug.nl/~vannoord/TextCat/) in Java. It tries to determine
the encoding and language to a sequence of bytes. In symbols:
ngramj : byte[] --> (Language, Encoding)
2. the cngram this is the newer but right now more mature part. It's basic
function is to determine the language of a sequence of characters.
cngram : char[] --> Language
Note 1: This means given a file, ngramj can be immediately be applied, but
cngram needs additional information about the encoding. On the other hand if
you know the encoding, why let ngramj determine it. So both algorithms have
their applications.
Note 2: The basic principle of both ngram algorithms is statistical not to say
heuristical. Therefore you will not likely achieve 100% results. However given
enough text the methods get very, very close.
NGramJ is Open Source software released under the terms
of the GNU Lesser General Public License. It is hosted
on Sourceforge. Use
http://ngramj.sourceforge.net/
as an entry point.
Enjoy,
Frank
Installation:
1.) Phoner:
java -classpath ngramj.jar de.spieleck.ngram.phoner.Phoner frank.lm a_phone_number
tries to convert a_phone_number into a easier to memorize string. Can be very
slow for long number, depends on the language resource given
(try LM/English.lm instead of frank.lm)
2.) Langage classification:
java -classpath ngramj.jar de.spieleck.ngram.lm.CathegorizerImpl LM a_text_file
:tries to figure the language/encoding of the text file a_text_file
3.) Generate new resource:
java -classpath ngramj.jar de.spieleck.ngram.lm.LMWriter a_text_file a_resource_name.lm
converts the text file a_text_file into the resource a_resource_name, which
can be used for classification task like the resources included.