Thread: [Algorithms] Algorithm for determining 'word difficulty'

Brought to you by: vexxed72

gdalgorithms-list

[Algorithms] Algorithm for determining 'word difficulty'

From: John R. <jra...@gm...> - 2010-06-18 17:18:21

I have an interesting little project I'm working on and I thought I would
solicit the list to see if anyone else has some ideas.

I'm creating an educational word game that focuses on spelling and
vocabulary; it is designed to run on mobile devices (Ipad, Iphone, Droid,
etc.).  This is just a fun little side project I'm doing so my son can learn
more hands on programming.  My daughter is doing the artwork so we are
making it a little family project.

I first wrote this game for an Apple II in 1983 so it's kind of fun to be
making a new version for today's devices.  Back then, I didn't have enough
memory to store a really large word list.  Today I have the ability to store
the entire English dictionary.  And, not just the words, but also every
component associated with each word (synonyms, etymology, definitions, etc.)

The algorithm I am looking for is how to automatically come up with a
'difficulty' metric for each word in the English language.

My thoughts are that I could consider the following:

(1) Length of the word, though to be honest very short words can be
difficult too if they are obscure.
(2) Number of definitions.
(3) Field of study of the word (biology, physics, etc.) The open source
English dictionary I have access to provides this data.
(4) Whether the word is a verb, noun, etc.
(5) Cross reference each word against a thesaurus and consider the
difficulty/obscurity based on how many synonyms and antonyms there are
total.

One thing that would help immensely if if I had access to a word list of the
'most common' words in the English language.  Hopefully I can find such a
list and this would provide me an excellent first guess at whether or not a
word is obscure or not.

When you play the game you get to choose the difficulty level you want to
play at really could have two metrics.  Difficulty to spell, or difficulty
in terms of knowing recognizing the word.  (The game itself more or less
works like wheel or fortune or hangman, you are just trying to guess a
single word rather than a phrase).

Any thoughts on an algorithm which could more or less automatically score
the entire English language by 'difficultly to spell' and 'difficulty to
recognize'?  Assuming you have as input all of the data in a standard
dictionary and thesaurus?

Thanks,

John

Re: [Algorithms] Algorithm for determining 'word difficulty'

From: Jeff R. <je...@8m...> - 2010-06-18 17:27:12

My thought was the same as yours as far as the frequency of use entering
into it. If you can't find any good data on that, it might not be hard to
generate it if you have a good deal of text on hand. Run a number of novels
or something through some simple app that tracks word frequencies, and you
might have the start to a database at least.

Jeff

On Fri, Jun 18, 2010 at 12:18 PM, John Ratcliff
<jra...@gm...>wrote:

> I have an interesting little project I'm working on and I thought I would
> solicit the list to see if anyone else has some ideas.
>
> I'm creating an educational word game that focuses on spelling and
> vocabulary; it is designed to run on mobile devices (Ipad, Iphone, Droid,
> etc.).  This is just a fun little side project I'm doing so my son can learn
> more hands on programming.  My daughter is doing the artwork so we are
> making it a little family project.
>
> I first wrote this game for an Apple II in 1983 so it's kind of fun to be
> making a new version for today's devices.  Back then, I didn't have enough
> memory to store a really large word list.  Today I have the ability to store
> the entire English dictionary.  And, not just the words, but also every
> component associated with each word (synonyms, etymology, definitions, etc.)
>
> The algorithm I am looking for is how to automatically come up with a
> 'difficulty' metric for each word in the English language.
>
> My thoughts are that I could consider the following:
>
> (1) Length of the word, though to be honest very short words can be
> difficult too if they are obscure.
> (2) Number of definitions.
> (3) Field of study of the word (biology, physics, etc.) The open source
> English dictionary I have access to provides this data.
> (4) Whether the word is a verb, noun, etc.
> (5) Cross reference each word against a thesaurus and consider the
> difficulty/obscurity based on how many synonyms and antonyms there are
> total.
>
> One thing that would help immensely if if I had access to a word list of
> the 'most common' words in the English language.  Hopefully I can find such
> a list and this would provide me an excellent first guess at whether or not
> a word is obscure or not.
>
> When you play the game you get to choose the difficulty level you want to
> play at really could have two metrics.  Difficulty to spell, or difficulty
> in terms of knowing recognizing the word.  (The game itself more or less
> works like wheel or fortune or hangman, you are just trying to guess a
> single word rather than a phrase).
>
> Any thoughts on an algorithm which could more or less automatically score
> the entire English language by 'difficultly to spell' and 'difficulty to
> recognize'?  Assuming you have as input all of the data in a standard
> dictionary and thesaurus?
>
> Thanks,
>
> John
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
Jeff Russell
Engineer, 8monkey Labs
www.8monkeylabs.com

Re: [Algorithms] Algorithm for determining 'word difficulty'

From: John R. <jra...@gm...> - 2010-06-18 17:37:01

Great idea Jeff, I didn't think about that.  I could download a number of
open source books at various reading levels, and then build a database
against each.  That actually sounds like the perfect solution!  Project
Guttenberg, here I come with my big ass Perl script.....

On Fri, Jun 18, 2010 at 12:26 PM, Jeff Russell <je...@8m...>wrote:

> My thought was the same as yours as far as the frequency of use entering
> into it. If you can't find any good data on that, it might not be hard to
> generate it if you have a good deal of text on hand. Run a number of novels
> or something through some simple app that tracks word frequencies, and you
> might have the start to a database at least.
>
> Jeff
>
> On Fri, Jun 18, 2010 at 12:18 PM, John Ratcliff <jra...@gm...
> > wrote:
>
>> I have an interesting little project I'm working on and I thought I would
>> solicit the list to see if anyone else has some ideas.
>>
>> I'm creating an educational word game that focuses on spelling and
>> vocabulary; it is designed to run on mobile devices (Ipad, Iphone, Droid,
>> etc.).  This is just a fun little side project I'm doing so my son can learn
>> more hands on programming.  My daughter is doing the artwork so we are
>> making it a little family project.
>>
>> I first wrote this game for an Apple II in 1983 so it's kind of fun to be
>> making a new version for today's devices.  Back then, I didn't have enough
>> memory to store a really large word list.  Today I have the ability to store
>> the entire English dictionary.  And, not just the words, but also every
>> component associated with each word (synonyms, etymology, definitions, etc.)
>>
>> The algorithm I am looking for is how to automatically come up with a
>> 'difficulty' metric for each word in the English language.
>>
>> My thoughts are that I could consider the following:
>>
>> (1) Length of the word, though to be honest very short words can be
>> difficult too if they are obscure.
>> (2) Number of definitions.
>> (3) Field of study of the word (biology, physics, etc.) The open source
>> English dictionary I have access to provides this data.
>> (4) Whether the word is a verb, noun, etc.
>> (5) Cross reference each word against a thesaurus and consider the
>> difficulty/obscurity based on how many synonyms and antonyms there are
>> total.
>>
>> One thing that would help immensely if if I had access to a word list of
>> the 'most common' words in the English language.  Hopefully I can find such
>> a list and this would provide me an excellent first guess at whether or not
>> a word is obscure or not.
>>
>> When you play the game you get to choose the difficulty level you want to
>> play at really could have two metrics.  Difficulty to spell, or difficulty
>> in terms of knowing recognizing the word.  (The game itself more or less
>> works like wheel or fortune or hangman, you are just trying to guess a
>> single word rather than a phrase).
>>
>> Any thoughts on an algorithm which could more or less automatically score
>> the entire English language by 'difficultly to spell' and 'difficulty to
>> recognize'?  Assuming you have as input all of the data in a standard
>> dictionary and thesaurus?
>>
>> Thanks,
>>
>> John
>>
>>
>> ------------------------------------------------------------------------------
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>
>
> --
> Jeff Russell
> Engineer, 8monkey Labs
> www.8monkeylabs.com
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] Algorithm for determining 'word difficulty'

From: Conor S. <bor...@ya...> - 2010-06-18 17:44:18

For difficulty, I would say put in a load of very "common" language texts, then some more technical and advanced texts (even some philology) and then use a differential histogram between the two to pick out more difficult words. Language in the brain is a contextual hierarchy... the brain pattern matches based on the right context. As such more difficult words are generally those with a more difficult context.

Cheers,
Conor 
  



________________________________
From: Jeff Russell <je...@8m...>
To: Game Development Algorithms <gda...@li...>
Sent: Sat, 19 June, 2010 1:26:45 AM
Subject: Re: [Algorithms] Algorithm for determining 'word difficulty'

My thought was the same as yours as far as the frequency of use entering into it. If you can't find any good data on that, it might not be hard to generate it if you have a good deal of text on hand. Run a number of novels or something through some simple app that tracks word frequencies, and you might have the start to a database at least.

Jeff


On Fri, Jun 18, 2010 at 12:18 PM, John Ratcliff <jra...@gm...> wrote:

>
>I have an interesting little project I'm working on and I thought I would solicit the list to see if anyone else has some ideas.
>
>I'm creating an educational word game that focuses on spelling and vocabulary; it is designed to run on mobile devices (Ipad, Iphone, Droid, etc.).  This is just a fun little side project I'm doing so my son can learn more hands on programming.  My daughter is doing the artwork so we are making it a little family project.
>
>I first wrote this game for an Apple II in 1983 so it's kind of fun to be making a new version for today's devices.  Back then, I didn't have enough memory to store a really large word list.  Today I have the ability to store the entire English dictionary.  And, not just the words, but also every component associated with each word (synonyms, etymology, definitions, etc.)
>
>The algorithm I am looking for is how to automatically come up with a 'difficulty' metric for each word in the English language.
>
>My thoughts are that I could consider the following:
>
>(1) Length of the word, though to be honest very short words can be difficult too if they are obscure.
>>
>
>(2) Number of definitions.
>(3) Field of study of the word (biology, physics, etc.) The open source English dictionary I have access to provides this data.
>(4) Whether the word is a verb, noun, etc.
>(5) Cross reference each word against a thesaurus and consider the difficulty/obscurity based on how many synonyms and antonyms there are total.
>
>One thing that would help immensely if if I had access to a word list of the 'most common' words in the English language.  Hopefully I can find such a list and this would provide me an excellent first guess at whether or not a word is obscure or not.
>
>When you play the game you get to choose the difficulty level you want to play at really could have two metrics.  Difficulty to spell, or difficulty in terms of knowing recognizing the word.  (The game itself more or less works like wheel or fortune or hangman, you are just trying to guess a single word rather than a phrase).
>
>Any thoughts on an algorithm which could more or less automatically score the entire English language by 'difficultly to spell' and 'difficulty to recognize'?  Assuming you have as input all of the data in a standard dictionary and thesaurus?
>
>Thanks,
>
>John
>
>------------------------------------------------------------------------------
>>ThinkGeek and WIRED's GeekDad team up for the Ultimate
>>GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>>lucky parental unit.  See the prize list and enter to win:
>http://p.sf.net/sfu/thinkgeek-promo
>_______________________________________________
>>GDAlgorithms-list mailing list
>GDA...@li...
>https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>>Archives:
>http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>


-- 
Jeff Russell
Engineer, 8monkey Labs
www.8monkeylabs.com

Re: [Algorithms] Algorithm for determining 'word difficulty'

From: Jonathan S. <jon...@gm...> - 2010-06-18 17:34:03

Hello,

> Any thoughts on an algorithm which could more or less automatically  
> score the entire English language by 'difficultly to spell' and  
> 'difficulty to recognize'?  Assuming you have as input all of the  
> data in a standard dictionary and thesaurus?

Maybe you could use Scrabble's scoring to rank words.


Jonathan

Re: [Algorithms] Algorithm for determining 'word difficulty'

From: John R. <jra...@gm...> - 2010-06-18 17:40:35

Yes, I'm actually using Scabble's score for the game itself; however that
only gives you indication of letter frequency.  The more I think about it, I
believe the perfect solution is to generate a database of word frequencies
against a number of books with a known 'reading level'.  I'm pretty much
sure that is the best clean solution to the problem.

John

On Fri, Jun 18, 2010 at 12:33 PM, Jonathan Sauer <jon...@gm...>wrote:

> Hello,
>
> > Any thoughts on an algorithm which could more or less automatically
> > score the entire English language by 'difficultly to spell' and
> > 'difficulty to recognize'?  Assuming you have as input all of the
> > data in a standard dictionary and thesaurus?
>
> Maybe you could use Scrabble's scoring to rank words.
>
>
> Jonathan
>
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] Algorithm for determining 'word difficulty'

From: jhorton <jh...@ro...> - 2010-06-18 17:54:19

My daughter has brought home dozens of books at various difficulty levels from the school. I'm sure you could search for things like grade 1 word lists and so on. Scholastic books mostly it seems.


On Fri, Jun 18, 2010 at 12:18:14PM -0500, John Ratcliff wrote:
> I have an interesting little project I'm working on and I thought I would
> solicit the list to see if anyone else has some ideas.
> 
> I'm creating an educational word game that focuses on spelling and
> vocabulary; it is designed to run on mobile devices (Ipad, Iphone, Droid,
> etc.).  This is just a fun little side project I'm doing so my son can learn
> more hands on programming.  My daughter is doing the artwork so we are
> making it a little family project.
> 
> I first wrote this game for an Apple II in 1983 so it's kind of fun to be
> making a new version for today's devices.  Back then, I didn't have enough
> memory to store a really large word list.  Today I have the ability to store
> the entire English dictionary.  And, not just the words, but also every
> component associated with each word (synonyms, etymology, definitions, etc.)
> 
> The algorithm I am looking for is how to automatically come up with a
> 'difficulty' metric for each word in the English language.
> 
> My thoughts are that I could consider the following:
> 
> (1) Length of the word, though to be honest very short words can be
> difficult too if they are obscure.
> (2) Number of definitions.
> (3) Field of study of the word (biology, physics, etc.) The open source
> English dictionary I have access to provides this data.
> (4) Whether the word is a verb, noun, etc.
> (5) Cross reference each word against a thesaurus and consider the
> difficulty/obscurity based on how many synonyms and antonyms there are
> total.
> 
> One thing that would help immensely if if I had access to a word list of the
> 'most common' words in the English language.  Hopefully I can find such a
> list and this would provide me an excellent first guess at whether or not a
> word is obscure or not.
> 
> When you play the game you get to choose the difficulty level you want to
> play at really could have two metrics.  Difficulty to spell, or difficulty
> in terms of knowing recognizing the word.  (The game itself more or less
> works like wheel or fortune or hangman, you are just trying to guess a
> single word rather than a phrase).
> 
> Any thoughts on an algorithm which could more or less automatically score
> the entire English language by 'difficultly to spell' and 'difficulty to
> recognize'?  Assuming you have as input all of the data in a standard
> dictionary and thesaurus?
> 
> Thanks,
> 
> John

> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate 
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
> lucky parental unit.  See the prize list and enter to win: 
> http://p.sf.net/sfu/thinkgeek-promo

> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list

Re: [Algorithms] Algorithm for determining 'word difficulty'

From: Samuel M. <sam...@go...> - 2010-06-18 18:12:32

I think a really good metric for difficulty would be how often other
people spell this particular word wrong. So you could take a lot of
text that *includes misspelled words* and count the odds. Public
forums or other websites would be perfect. You can even pick your
target group ;)

The difficulty here is to find out what word was actually meant by a
misspelled one. But on the other hand, just take any reasonable metric
for distance between words (there should be literature on that
regarding automatic spell correction) and assume the closest one (or
the N closest words that are not too far away) was meant.

Maybe you could also use google... generate a few variations on each
word and count the number of google results ;)

Greets,
Samuel

On Fri, Jun 18, 2010 at 7:35 PM, jhorton <jh...@ro...> wrote:
> My daughter has brought home dozens of books at various difficulty levels from the school. I'm sure you could search for things like grade 1 word lists and so on. Scholastic books mostly it seems.
>
>
> On Fri, Jun 18, 2010 at 12:18:14PM -0500, John Ratcliff wrote:
>> I have an interesting little project I'm working on and I thought I would
>> solicit the list to see if anyone else has some ideas.
>>
>> I'm creating an educational word game that focuses on spelling and
>> vocabulary; it is designed to run on mobile devices (Ipad, Iphone, Droid,
>> etc.).  This is just a fun little side project I'm doing so my son can learn
>> more hands on programming.  My daughter is doing the artwork so we are
>> making it a little family project.
>>
>> I first wrote this game for an Apple II in 1983 so it's kind of fun to be
>> making a new version for today's devices.  Back then, I didn't have enough
>> memory to store a really large word list.  Today I have the ability to store
>> the entire English dictionary.  And, not just the words, but also every
>> component associated with each word (synonyms, etymology, definitions, etc.)
>>
>> The algorithm I am looking for is how to automatically come up with a
>> 'difficulty' metric for each word in the English language.
>>
>> My thoughts are that I could consider the following:
>>
>> (1) Length of the word, though to be honest very short words can be
>> difficult too if they are obscure.
>> (2) Number of definitions.
>> (3) Field of study of the word (biology, physics, etc.) The open source
>> English dictionary I have access to provides this data.
>> (4) Whether the word is a verb, noun, etc.
>> (5) Cross reference each word against a thesaurus and consider the
>> difficulty/obscurity based on how many synonyms and antonyms there are
>> total.
>>
>> One thing that would help immensely if if I had access to a word list of the
>> 'most common' words in the English language.  Hopefully I can find such a
>> list and this would provide me an excellent first guess at whether or not a
>> word is obscure or not.
>>
>> When you play the game you get to choose the difficulty level you want to
>> play at really could have two metrics.  Difficulty to spell, or difficulty
>> in terms of knowing recognizing the word.  (The game itself more or less
>> works like wheel or fortune or hangman, you are just trying to guess a
>> single word rather than a phrase).
>>
>> Any thoughts on an algorithm which could more or less automatically score
>> the entire English language by 'difficultly to spell' and 'difficulty to
>> recognize'?  Assuming you have as input all of the data in a standard
>> dictionary and thesaurus?
>>
>> Thanks,
>>
>> John
>
>> ------------------------------------------------------------------------------
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] Algorithm for determining 'word difficulty'

From: Matthew H. <ma...@ev...> - 2010-06-18 18:39:48

As nice as "automatic" would be, I think if you are targeting very young
kids you'll also want to add a "human tuned" metric to the weighting
system.  As you've no doubt seen, each school grade level has a list of
"sight words" that kids are supposed to know, as well as other common or
high-frequency words that they are supposed to be picking up.  I'd guess
this information is available somewhere/somehow - even if it means asking
some teachers for help.  (For that matter, it's possible there are
state-mandated lists of words that kids must know at different grade
levels.)

I'd guess that the younger the age target, the more "human tuned" the word
selection is going to need to be.  As they get older, you can probably rely
more on purely statistical metrics, etc.

On Fri, Jun 18, 2010 at 12:18 PM, John Ratcliff
<jra...@gm...>wrote:

> I have an interesting little project I'm working on and I thought I would
> solicit the list to see if anyone else has some ideas.
>
> I'm creating an educational word game that focuses on spelling and
> vocabulary; it is designed to run on mobile devices (Ipad, Iphone, Droid,
> etc.).  This is just a fun little side project I'm doing so my son can learn
> more hands on programming.  My daughter is doing the artwork so we are
> making it a little family project.
>
> I first wrote this game for an Apple II in 1983 so it's kind of fun to be
> making a new version for today's devices.  Back then, I didn't have enough
> memory to store a really large word list.  Today I have the ability to store
> the entire English dictionary.  And, not just the words, but also every
> component associated with each word (synonyms, etymology, definitions, etc.)
>
> The algorithm I am looking for is how to automatically come up with a
> 'difficulty' metric for each word in the English language.
>
> My thoughts are that I could consider the following:
>
> (1) Length of the word, though to be honest very short words can be
> difficult too if they are obscure.
> (2) Number of definitions.
> (3) Field of study of the word (biology, physics, etc.) The open source
> English dictionary I have access to provides this data.
> (4) Whether the word is a verb, noun, etc.
> (5) Cross reference each word against a thesaurus and consider the
> difficulty/obscurity based on how many synonyms and antonyms there are
> total.
>
> One thing that would help immensely if if I had access to a word list of
> the 'most common' words in the English language.  Hopefully I can find such
> a list and this would provide me an excellent first guess at whether or not
> a word is obscure or not.
>
> When you play the game you get to choose the difficulty level you want to
> play at really could have two metrics.  Difficulty to spell, or difficulty
> in terms of knowing recognizing the word.  (The game itself more or less
> works like wheel or fortune or hangman, you are just trying to guess a
> single word rather than a phrase).
>
> Any thoughts on an algorithm which could more or less automatically score
> the entire English language by 'difficultly to spell' and 'difficulty to
> recognize'?  Assuming you have as input all of the data in a standard
> dictionary and thesaurus?
>
> Thanks,
>
> John
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>

Re: [Algorithms] Algorithm for determining 'word difficulty'

From: Binh N. <ng...@gm...> - 2010-06-18 21:33:21

You can also use some online metrics:
+ Google News (syndicated of most common "news" + rank)
+ Number of hits per search


On Fri, Jun 18, 2010 at 2:12 PM, Matthew Harmon <ma...@ev...>wrote:

> As nice as "automatic" would be, I think if you are targeting very young
> kids you'll also want to add a "human tuned" metric to the weighting
> system.  As you've no doubt seen, each school grade level has a list of
> "sight words" that kids are supposed to know, as well as other common or
> high-frequency words that they are supposed to be picking up.  I'd guess
> this information is available somewhere/somehow - even if it means asking
> some teachers for help.  (For that matter, it's possible there are
> state-mandated lists of words that kids must know at different grade
> levels.)
>
> I'd guess that the younger the age target, the more "human tuned" the word
> selection is going to need to be.  As they get older, you can probably rely
> more on purely statistical metrics, etc.
>
> On Fri, Jun 18, 2010 at 12:18 PM, John Ratcliff <jra...@gm...
> > wrote:
>
>> I have an interesting little project I'm working on and I thought I would
>> solicit the list to see if anyone else has some ideas.
>>
>> I'm creating an educational word game that focuses on spelling and
>> vocabulary; it is designed to run on mobile devices (Ipad, Iphone, Droid,
>> etc.).  This is just a fun little side project I'm doing so my son can learn
>> more hands on programming.  My daughter is doing the artwork so we are
>> making it a little family project.
>>
>> I first wrote this game for an Apple II in 1983 so it's kind of fun to be
>> making a new version for today's devices.  Back then, I didn't have enough
>> memory to store a really large word list.  Today I have the ability to store
>> the entire English dictionary.  And, not just the words, but also every
>> component associated with each word (synonyms, etymology, definitions, etc.)
>>
>> The algorithm I am looking for is how to automatically come up with a
>> 'difficulty' metric for each word in the English language.
>>
>> My thoughts are that I could consider the following:
>>
>> (1) Length of the word, though to be honest very short words can be
>> difficult too if they are obscure.
>> (2) Number of definitions.
>> (3) Field of study of the word (biology, physics, etc.) The open source
>> English dictionary I have access to provides this data.
>> (4) Whether the word is a verb, noun, etc.
>> (5) Cross reference each word against a thesaurus and consider the
>> difficulty/obscurity based on how many synonyms and antonyms there are
>> total.
>>
>> One thing that would help immensely if if I had access to a word list of
>> the 'most common' words in the English language.  Hopefully I can find such
>> a list and this would provide me an excellent first guess at whether or not
>> a word is obscure or not.
>>
>> When you play the game you get to choose the difficulty level you want to
>> play at really could have two metrics.  Difficulty to spell, or difficulty
>> in terms of knowing recognizing the word.  (The game itself more or less
>> works like wheel or fortune or hangman, you are just trying to guess a
>> single word rather than a phrase).
>>
>> Any thoughts on an algorithm which could more or less automatically score
>> the entire English language by 'difficultly to spell' and 'difficulty to
>> recognize'?  Assuming you have as input all of the data in a standard
>> dictionary and thesaurus?
>>
>> Thanks,
>>
>> John
>>
>>
>> ------------------------------------------------------------------------------
>> ThinkGeek and WIRED's GeekDad team up for the Ultimate
>> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
>> lucky parental unit.  See the prize list and enter to win:
>> http://p.sf.net/sfu/thinkgeek-promo
>> _______________________________________________
>> GDAlgorithms-list mailing list
>> GDA...@li...
>> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
>> Archives:
>> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>>
>
>
>
> ------------------------------------------------------------------------------
> ThinkGeek and WIRED's GeekDad team up for the Ultimate
> GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
> lucky parental unit.  See the prize list and enter to win:
> http://p.sf.net/sfu/thinkgeek-promo
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_name=gdalgorithms-list
>



-- 
--------------------------------------------------
Binh Nguyen
Computer Science Department
Rensselaer Polytechnic Institute
Troy, NY, 12180
--------------------------------------------------