NGramJ Code

Brought to you by: nestefan

Tree [r14] /

History

HTTPS access

File	Date	Author	Commit
samples	2009-12-23	nestefan	[r13] * See ChangeLog.txt
src	2009-12-23	nestefan	[r13] * See ChangeLog.txt
ChangeLog.txt	2009-12-23	nestefan	[r13] * See ChangeLog.txt
LICENSE.txt	2006-03-27	nestefan	[r2] Added keywords for this project.
README-license.txt	2006-03-29	nestefan	[r4] File was still refering to the Jacson project.
README.txt	2006-03-27	nestefan	[r2] Added keywords for this project.
build.number	2009-12-23	nestefan	[r14]
build.properties	2009-12-23	nestefan	[r14]
build.xml	2009-07-26	nestefan	[r12] * cngram: Removed double locked check for null ...

Read Me

This is 

        NGramJ

it is actually two independant sets of java classes:

1. the ngramj part, which is actually an rebuild of the text_cat PERL stuff 
(see http://odur.let.rug.nl/~vannoord/TextCat/) in Java. It tries to determine
the encoding and language to a sequence of bytes. In symbols:

    ngramj : byte[]  -->  (Language, Encoding)

2. the cngram this is the newer but right now more mature part. It's basic
function is to determine the language of a sequence of characters.

    cngram : char[]  --> Language

Note 1: This means given a file, ngramj can be immediately be applied, but
cngram needs additional information about the encoding. On the other hand if
you know the encoding, why let ngramj determine it. So both algorithms have
their applications.

Note 2: The basic principle of both ngram algorithms is statistical not to say
heuristical. Therefore you will not likely achieve 100% results. However given
enough text the methods get very, very close.


NGramJ is Open Source software released under the terms 
of the GNU Lesser General Public License. It is hosted
on Sourceforge. Use

    http://ngramj.sourceforge.net/

as an entry point.

Enjoy,
Frank

Installation:

1.) Phoner:
    java -classpath ngramj.jar de.spieleck.ngram.phoner.Phoner frank.lm a_phone_number

tries to convert a_phone_number into a easier to memorize string. Can be very
slow for long number, depends on the language resource given 
(try LM/English.lm instead of frank.lm)

2.) Langage classification:
    java -classpath ngramj.jar de.spieleck.ngram.lm.CathegorizerImpl LM a_text_file
:tries to figure the language/encoding of the text file a_text_file

3.) Generate new resource:
    java -classpath ngramj.jar de.spieleck.ngram.lm.LMWriter a_text_file a_resource_name.lm

converts the text file a_text_file into the resource a_resource_name, which 
can be used for classification task like the resources included.

NGramJ Code

Tree [r14] / Download Snapshot History

Read Me

Tree [r14] /

History