[ciphertool-devel] New scoring methods
Status: Beta
Brought to you by:
wart
|
From: Wart <wa...@ko...> - 2004-03-08 04:33:15
|
The CVS repository now contains the new pluggable scoring system.
Here's a short summary of the new commands:
score create <type>
Creates a new scoring object. The type must be one of "digramlog",
"digramcount", "trigramlog", "trigramcount", "ngramlog", "ngramcount",
or "wordtree". The return value is the name of a new Tcl command that
will be used to store the scoring data.
$scoreObj add <element> <value>
Add values to the new scoring table using the new scoring command.
$scoreObj normalize
"normalize" the data in the scoring table. This is not a real
mathematical normalization, but instead represents an operation that is
performed on the entire table. For example, the "normalize" subcommand
for the digramlog scoring table will take the natural logarithm of each
element in the scoring table. The "normalize" subcommand for the
2,3,n-gram count scoring tables does nothing.
$scoreObj value <element>
Get a value from the scoring table.
Sample usage:
set scoreObj [score create digramlog]
$scoreObj add er 100
$scoreObj add ab 20
$scoreObj normalize
set plaintextValue [$scoreObj value "mydoghasfleas"]
There are also 3 new Tcl procedures for loading and saving scoring
tables:
Scoredata::generate <scoreobj> <filename>
Generate scoring data from a text file.
Scoredata::saveData <scoreObj> <filename>
Save a scoring table to a file sp that it can be loaded at a later
time.
Scoredata::loadData <scoreobj> <filename>
Load a scoring table that was previously saved. If no filename is
given then an attempt is made to locate an appropriate default scoring
table.
One new program was also added to generate scoring tables from a text
file:
genscores -type <scoretype> -output <outfile> [-verbose] [-elemsize n]
[-nonormalize] [-validchars ...]
-type must be one of the known scoring types above.
-output is the name of the file where the table will be saved.
-elemsize must be used with the ngramlog and ngramcount types to
indicate the size of the ngrams.
-nonormalize skips the step of normalizing the data. By default all
tables are normalized before saving.
-validchars is the set of characters that are allowed in the scoring
table. By default this is a-z, but it can be any set of ascii
characters.
The default scoring tables were generated using the reference data using
the commands:
genscores -type digramlog -output digramlogData.tcl frank14.txt
genscores -type digramcount -output digramcountData.tcl frank14.txt
genscores -type trigramlog -output trigramlogData.tcl frank14.txt
genscores -type trigramcount -output trigramcountData.tcl frank14.txt
I'll add html documentation over the next few days, including tips on
writing your own scoring routine.
--Wart
|