Menu

Great Tool!

TimIL2007
2007-06-01
2013-03-21
  • TimIL2007

    TimIL2007 - 2007-06-01

    Thank you for providing this tool.  It is easy to use, and runs quickly.  I especially appreciate that it is useable 'out of the box'.
    Thank you!

     
    • John Joseph Bowers

      Hey, no problem.  I just did it to satisfy my own curiosity.  Actually the first thing I did was the "Three Word Phrase" Generator.  And from there there I thought I should just throw in a few more things and see if anyone else found it useful. 

      If you have any suggestions as to what you find most useful (or what else would be useful to add), please post them.

       
    • TimIL2007

      TimIL2007 - 2007-06-04

      I'm using it initially for the quick word counts but would like to exploit the three word phrases, but don't understand what the phrase variables mean.  Do you have some documentation/comments on the 'three word phrase' generator and what its user controls do (words, # before/after)?

      Also, as you noted in your feature request, it would be helpful to see what is considered whitespace/stopwords/characters.

      I'm only beginning to learn some programming, so exploring the source doesn't make this information clear to me...

      Thanks,
      Tim

       
    • John Joseph Bowers

      I am rather lacking on documentation.  I guess I can take this opportunity to throw out some explanation for the "Likely 3 Word Phrases" feature.

      The three word phrases generator basically attempts come up with interesting three word phrases that would be likely to be found in the text.  It does this by taking the top X words found in the text (where X is number specified in the words box).  It then locates each occurance of the these top X words in the text and determines the words that most frequently occur before and after this word.  It then puts these together to form a three word phrase.  The "# Before" and "# After" represent the number of words occuring before and after which you would like to use to generate the 3 word phrases. 

      Example:

      Say you are evaluating a novel and your main characters' names was "Darcy".  If you specified "300" in the words box, Darcy would likely show up as one of the top 300 words.  If you specified 2 in the "# Before" and "# After" box, you would locate the top two words that occured both before and after Darcy (and every other one of the top 300 words) and you would create three word phrases with every possible combination.

      So lets say the top 2 words before "Darcy" were, "Mr." and "Lord", and the top 2 occuring after were "said" and "pranced".  You would end up with 4 phrases for this word:

      Lord Darcy Said
      Lord Darcy Pranced
      Mr. Darcy Said
      Mr. Darcy Pranced

      You would repeat this for each of the top 300 hundred words ending up with 300 x 4 = 1200 three word phrases.  I also included 3 "ignore files" in the directory to which the generator binary is installed.  There is a ignore_words.txt ignore_first.txt and and ignore_last.txt.

      If it turns out you are seeing a word that doesn't seem interesting to you, you can make sure you no longer see that word in the first spot if you add it to the ignore_first.txt, the second spot if you add it to the ignore_words.txt and the final spot if you add it to the ignore_last.txt.

      The idea is that some words don't make good "interesting" phrases when they are included in some spots.  For example "The" is nearly always the most common word in any english text, but it doesn't make a very interesting phrase to use "The" as the middle word. 

      I would eventually like to add a UI for modifying the ingore lists or come up with a better statistical way to determine "interesting" 3 word phrases, but for now this is what the generator has.  Manually modify the .txt files and restart the application to reinclude the words.

       
    • John Joseph Bowers

      I might also note that the three word phrases probably isn't really interesting on very small documents.  It is also probably the slowest thing to run.  It can take quite a bit of time on very large texts.  Running it on the bible (the largest text I have run it on) took a few minutes to complete.

       

Log in to post a comment.