Word Frequency

2. Help
WandersFar
2013-05-06
2013-05-22
  • WandersFar
    WandersFar
    2013-05-06

    Just wondering if anyone knows of a N++ plugin that does this: count how many times each word is used and spit out a list with the results, from most frequent to least frequent.

    Basically a N++ version of this site:
    http://writewords.org.uk/word_count.asp

     
  • azrafe7
    azrafe7
    2013-05-09

    With PythonScript you can use a script like this:

    from operator import itemgetter
    from collections import OrderedDict, namedtuple
    import inspect
    import os
    import re
    
    __author__ = 'azrafe7'
    
    SCRIPT_NAME = os.path.basename(inspect.getfile(inspect.currentframe()))
    
    SHOW_TOP = 0L               # 0 means track all words. Change this to 10 f.e. to only show the top ten most frequent words
    WORD_PATTERN = r'(\w+)'     # change this to modify the word regex pattern, f.e. with r'([a-zA-Z]+)
    
    console.show()
    
    console.write("\n\nExecuting [%s] PythonScrypt:\n" % SCRIPT_NAME)
    
    stats = OrderedDict()
    totalWords = 0L
    
    # update word occurrences
    def updateStats(text, lineNum, totalLines):
        global totalWords
        for word in re.findall(WORD_PATTERN, text):
            word = word.lower()
            stats.setdefault(word, 0)
            stats[word] += 1
            totalWords += 1
    
    # run updateStats on each line of current doc
    editor.forEachLine(updateStats)
    
    # sort word occurrences (more frequent words shown first)
    sortedStats = sorted(stats.iteritems(), key=itemgetter(1), reverse=True)
    
    count = 0L
    for item in sortedStats:
        wordFreq = (100.0 * item[1]) / totalWords
        console.write("  '%s': %d (%.2f%%)\n" % (item[0], item[1], wordFreq))
        count += 1
        if SHOW_TOP > 0 and count >= SHOW_TOP:
            break
    
    console.writeError("Total words: %d. Unique words: %d." % (totalWords, len(sortedStats)))
    
    
    

    Modifying these two lines you can fine tune the behavior:

    SHOW_TOP = 0L               # 0 means track all words. Change this to 10 f.e. to only show the top ten most frequent words
    WORD_PATTERN = r'(\w+)'     # change this to modify the word regex pattern, f.e. with r'([a-zA-Z]+)
    

    I'm really enjoying PythonScript!

     
  • Loreia2
    Loreia2
    2013-05-15

    Hi azrafe7,

    really nice script.

    I'm really enjoying PythonScript!
    

    I couldn't agree more.

    BR,
    Loreia

     
  • azrafe7
    azrafe7
    2013-05-19

    Glad to know you're enjoying that plugin as much as I do!

    As a side note I'd like to make you know that I really appreciate your work with UDL, and to point you to http://sourceforge.net/p/notepad-plus/feature-requests/2283 (which you probably haven't read yet).

    What I tried to state in that post is the need to have more control over keywords.
    I'd be especially grateful if regexes could be enabled/disabled at will (as I assume you're already using them to do your magic;).

    tl;dr: what about a regex checkbox when defining keywords etc.?

     
  • Loreia2
    Loreia2
    2013-05-22

    Hi azrafe7,

    regex support is planned for 3.0 release (currently being implemented), but it is going to be limited just on Keyword1-8 type.

    To answer your question, you cannot have '.' both as operator and part of keyword. UDL is very strict about operators, they are unique.

    Take C++ code for example, operator ';' can't be both an operator and part of keyword.

    About your LESS example:

    h2.category {
        background-color : red;
    }
    

    One trick to go around this limitation, is to define ".category" as operator (and put it in front of '.' in a list of operators. In this way you can color ".category" as one unit.

    Problem though is that you have just one operator1 type until I release UDL 3.0 (and that won't happen anytime soon)

    BR,
    Loreia