NGramJ Code
Brought to you by:
nestefan
File | Date | Author | Commit |
---|---|---|---|
samples | 2009-12-23 | nestefan | [r13] * See ChangeLog.txt |
src | 2009-12-23 | nestefan | [r13] * See ChangeLog.txt |
ChangeLog.txt | 2009-12-23 | nestefan | [r13] * See ChangeLog.txt |
LICENSE.txt | 2006-03-27 | nestefan | [r2] Added keywords for this project. |
README-license.txt | 2006-03-29 | nestefan | [r4] File was still refering to the Jacson project. |
README.txt | 2006-03-27 | nestefan | [r2] Added keywords for this project. |
build.number | 2009-12-23 | nestefan | [r14] |
build.properties | 2009-12-23 | nestefan | [r14] |
build.xml | 2009-07-26 | nestefan | [r12] * cngram: Removed double locked check for null ... |
This is NGramJ it is actually two independant sets of java classes: 1. the ngramj part, which is actually an rebuild of the text_cat PERL stuff (see http://odur.let.rug.nl/~vannoord/TextCat/) in Java. It tries to determine the encoding and language to a sequence of bytes. In symbols: ngramj : byte[] --> (Language, Encoding) 2. the cngram this is the newer but right now more mature part. It's basic function is to determine the language of a sequence of characters. cngram : char[] --> Language Note 1: This means given a file, ngramj can be immediately be applied, but cngram needs additional information about the encoding. On the other hand if you know the encoding, why let ngramj determine it. So both algorithms have their applications. Note 2: The basic principle of both ngram algorithms is statistical not to say heuristical. Therefore you will not likely achieve 100% results. However given enough text the methods get very, very close. NGramJ is Open Source software released under the terms of the GNU Lesser General Public License. It is hosted on Sourceforge. Use http://ngramj.sourceforge.net/ as an entry point. Enjoy, Frank Installation: 1.) Phoner: java -classpath ngramj.jar de.spieleck.ngram.phoner.Phoner frank.lm a_phone_number tries to convert a_phone_number into a easier to memorize string. Can be very slow for long number, depends on the language resource given (try LM/English.lm instead of frank.lm) 2.) Langage classification: java -classpath ngramj.jar de.spieleck.ngram.lm.CathegorizerImpl LM a_text_file :tries to figure the language/encoding of the text file a_text_file 3.) Generate new resource: java -classpath ngramj.jar de.spieleck.ngram.lm.LMWriter a_text_file a_resource_name.lm converts the text file a_text_file into the resource a_resource_name, which can be used for classification task like the resources included.