[ciphertool-devel] Re: language statistics

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Alex,

I'm forwarding to the ciphertool-devel list since this is relevant to
ciphertool development.

On Thu, 2004-03-04 at 16:13, Alex Griffing wrote:
> Hi Mike,
> 
> >I had a concern about the characters that
> >should be allowed in the n-grams.  Currently any ascii value is allowed
> >for the digrams and general n-grams, but only a-z is allowed for the
> >trigams.  I only allowed a-z for the trigrams to preserve space.  If I
> >allow any ascii character for the trigrams, the space requirements go up
> >from 140k (26^3*8) to 134M (256^3*8) for a single trigram table. 
> >However, the general n-gram data structure is better suited for sparse
> >matrices, so if you need full ascii support for trigrams, just use a
> >general n-gram with an element size of 3.  I don't use the general
> >n-gram data structure for digrams because the digram lookups (simple
> >array lookups) are much much faster than the general n-gram lookup.
> >
> >  
> >
> Ahh, sparse general n-grams.  Woo hoo!  I don't know what data structure 
> you're using, but I've tried both a tree structure and a hash 
> structure.  The tree structure might save a little memory, but hash 
> structure is so much faster that it's not even a contest.  In fact, I've 
> empirically found that a hash lookup is about the same time as a table 
> lookup.  I'm imagining the cpu pipeline calculating the next hash while 
> it waits for the current lookup to come back from the memory fetch, if 
> that makes sense.  If not, ignore me :)  Also, you can do blazing fast 
> hashes with 4, 5, and 6grams by packing [a..z]s into a 32 bit word 
> before hashing.  (|a..z| == 26 <= 32 == 5 bits.  6(gram)*5bits == 30 
> bits <= 32 == computer word). /rambling

You're right, a hash table would have been faster.  I already use a tree
structure for storing the entire dictionary in memory, so I figured I
would just reuse it for the general n-grams.  I've found that the tree
takes up about 1/3 the space of storing each word in its entirety. 
Maybe if I get motivated I'll try out a hash table for the n-grams and
compare the memory usage..  But since I just got n-grams working with
the tree structure, I'm going to leave it alone for now.

> >However, the utility procedures that I wrote for populating the scoring
> >tables restrict the contents from a-z.  I've always used only a-z in my
> >n-gram tables because I only ever use the n-grams for solving ciphers
> >that don't expose the word divisions.  I'll modify it so that you can
> >specify a list of allowed characters for the n-grams, but default it to
> >"a-z" if you don't specify a character list.
> >
> >  
> >
> That makes sense, since you use wordlists instead for dealing with word 
> divisions.  I'm fine with just using a-z since it's simpler.  In fact, 
> it would be even better for scoring comparisons.

I've used n-grams for a few aristocrats, and the ragbaby, and I haven't
noticed that the lack of word divisions in the n-gram tables have had
any effect on the outcome.  This is something that would be interesting
to test.

> >I didn't write it so that it would "wrap" the corpus because the total
> >contribution from the wrapped portion is miniscule compared to the
> >contribution of the rest of the text.
> >  
> >
> Right.  Do you take the 'wrapped' ngrams of a potential *decryption* (as 
> opposed to corpus)  into account when scoring?  I've noticed that if I 
> forget to do this that I end up with a decryption that either starts or 
> ends with a bunch of junk letters, since ngrams would be penalized 
> relatively less in those places.

Nope.  I hadn't thought about this either.  I've never encountered the
extra junk letters at the start/end of a decryption.  I suspect this is
because the plaintext at the very start/end is still highly correllated
with the plaintext in the middle of the cipher.  I can see how this
would happen with shorter ciphers, though.

I just checked in the last changes for general n-grams.  The few items
left to fix before I'm done are:

1) Allow custom sets of allowed characters to be specified when building
scoring tables from samples of plaintext.

2) Add a "saveData" command for saving an existing scoring table to disk
so that it can be loaded again at a later time.

3) Modify the existing programs to use the new scoring system instead of
the older "stat digram", "stat trigram" methods.

4) Generate new default 2,3,4-gram tables from the standard text.

5) Add documentation for the new commands and procedures.

These shouldn't take too long to fix, but they also shouldn't keep you
from being able to use the new scoring system if you're going to be
comparing scoring fucntions and corpuses (corpi?).

--Wart