[ciphertool-devel] more ideas for standards
Status: Beta
Brought to you by:
wart
|
From: Alex G. <xa...@us...> - 2004-03-03 05:18:12
|
Hi, I suggest that we identify some standard sources of language statistics for the purposes of comparing search methods and scoring methods. Since ciphertools already uses Frankenstein for n-grams, I propose that we use it as a standard. Specifically, http://www.gutenberg.net/etext93/frank14.txt with MD5: dc18c8d4c9ef449796f85e138f38d6f5 Presumably we will want to create another standard in the future based on more texts. A few standard word lists would also be useful for comparing methods that act on words as the fundamental units. This page might be a useful start: http://wordlist.sourceforge.net/ The SCOWL project (Spell Checker Oriented Word Lists) listed there looks particularly interesting. Any specific suggestions or other thoughts? - Alex |