Re: [ciphertool-devel] more ideas for standards
Status: Beta
Brought to you by:
wart
|
From: Wart <wa...@ko...> - 2004-03-03 07:07:47
|
On Tue, 2004-03-02 at 21:04, Alex Griffing wrote: > Hi, > > I suggest that we identify some standard sources of language statistics > for the purposes of comparing search methods and scoring methods. Since > ciphertools already uses Frankenstein for n-grams, I propose that we use > it as a standard. Specifically, > http://www.gutenberg.net/etext93/frank14.txt > with MD5: > dc18c8d4c9ef449796f85e138f38d6f5 This sounds fine to me. The current n-gram tables in ciphertool aren't based on the exact text in this file, but it's pretty close. Chapter headings and the legal notice were removed before generating the n-grams. I think we should continue to use the above text, without the legalese at the top of the file, and distribute the text as an additional add-on package to future releases of ciphertool so that others can reproduce the n-grams. Luckily, the text is in the public domain, so we can redistribute it freely. > Presumably we will want to create another standard in the future based > on more texts. I agree. This is a good start though. My experience has been that the 2,3-gram frequencies from this text are good enough to solve most classical ciphers that aren't maliciously created. > A few standard word lists would also be useful for comparing methods > that act on words as the fundamental units. > This page might be a useful start: > http://wordlist.sourceforge.net/ > The SCOWL project (Spell Checker Oriented Word Lists) listed there looks > particularly interesting. > > Any specific suggestions or other thoughts? The word list that I've been using for my personal solving is version 15 of the UKACD: http://www.ori.org/~kenl/projects/wordlist/UKACD-readme.htm I've made some slight modifications, adding a few new words and removing a few uncommon others. The nice thing about this word list is that it includes plurals and all of the various verb conjugations. I've rearranged it so that it spans multiple files, each sorted by word length. This word list is also redistributable with the obligatory copyright acknowledgement. This should be included as part of the new add-on data package with the Frankenstein text above. If there are other word lists that look interesting, we should simply merge them with the existing word list to form one large complete list. Or do you think there is a need to keep separate word lists for different uses? Foreign languages word lists would be very useful in the future. I've got a few already for czech, danish, espreanto, french, german, interlingua, italian, latin, norwegian, spanish, and swedish. They're not as complete as the UKACD english word list, but still very useful. I have to check the licensing on these word lists before we start redistributing them. --Mike |