[Algorithms] Algorithm for determining 'word difficulty'

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I have an interesting little project I'm working on and I thought I would
solicit the list to see if anyone else has some ideas.

I'm creating an educational word game that focuses on spelling and
vocabulary; it is designed to run on mobile devices (Ipad, Iphone, Droid,
etc.).  This is just a fun little side project I'm doing so my son can learn
more hands on programming.  My daughter is doing the artwork so we are
making it a little family project.

I first wrote this game for an Apple II in 1983 so it's kind of fun to be
making a new version for today's devices.  Back then, I didn't have enough
memory to store a really large word list.  Today I have the ability to store
the entire English dictionary.  And, not just the words, but also every
component associated with each word (synonyms, etymology, definitions, etc.)

The algorithm I am looking for is how to automatically come up with a
'difficulty' metric for each word in the English language.

My thoughts are that I could consider the following:

(1) Length of the word, though to be honest very short words can be
difficult too if they are obscure.
(2) Number of definitions.
(3) Field of study of the word (biology, physics, etc.) The open source
English dictionary I have access to provides this data.
(4) Whether the word is a verb, noun, etc.
(5) Cross reference each word against a thesaurus and consider the
difficulty/obscurity based on how many synonyms and antonyms there are
total.

One thing that would help immensely if if I had access to a word list of the
'most common' words in the English language.  Hopefully I can find such a
list and this would provide me an excellent first guess at whether or not a
word is obscure or not.

When you play the game you get to choose the difficulty level you want to
play at really could have two metrics.  Difficulty to spell, or difficulty
in terms of knowing recognizing the word.  (The game itself more or less
works like wheel or fortune or hangman, you are just trying to guess a
single word rather than a phrase).

Any thoughts on an algorithm which could more or less automatically score
the entire English language by 'difficultly to spell' and 'difficulty to
recognize'?  Assuming you have as input all of the data in a standard
dictionary and thesaurus?

Thanks,

John