Menu

Help deletin dupe words/lines and sorting

frmariam
2008-07-04
2012-11-13
  • frmariam

    frmariam - 2008-07-04

    Hi there. I've started using Notepead++ just nowand I have been having trouble finding some functions (that is if they exist...).

    Can someone please tell me how can I delete all duped word and lines in my document? It's just a wordlist alphabetically sorted but I didn't want to do it manually...

    Also I noticed that "à", "á" and "â" aren't sorted near the other "a"... Is it a bug? I mean they are still "a" (just accentuated).
    Also I'd like to know if it's possible to sort the lines according to the number of chars in it (my lines only have one word) but still alphabettically:

    a
    b
    ba
    bb
    dd
    zx
    aab
    abcd

    Thanks in advance (and sorry if the answer is clear... I'm just having a hard time finding it... like when you miss to see something right in front of you)

     
    • frmariam

      frmariam - 2008-07-04

      Man you are awesome! Thanks a whole bunch (I can see you went through a lot of work to do it).

      This has got to be one of the kindest gestures I ever received from any user in any forum I've ever been.
      You surely made it easy for me (compensating for the noobness or language barrier).

      Once again I really apreciate this

       
      • Fool4UAnyway

        Fool4UAnyway - 2008-07-05

        > Thanks a whole bunch (I can see you went through a lot of work to do it).

        You're welcome. (I only had to re-write and adapt the previous instruction. It wasn't that hard.)

        I forgot to mention the Simple Script Plug-In also has two functions to remove duplicate lines: DeleteDupes() and DeleteDupesCase().

        >>DeleteDupes(): Deletes any line that is identical to a previous line.

        This results in a list of only distinct rows. Use this in combination with sort() to get a sorted list with no duplicate entries.

        The matching on this is case-insensitive. So for example, "bird" matches to "BIRD".
        For case-sensitive matching, use deletedupescase() instead.

        >>DeleteDupesCase(): Deletes any line that is identical to a previous line, with case-sensitive matching.

        This results in a list of only distinct rows. Use this in combination with sort() or sortcase() to get a sorted list with no duplicate entries.

        The matching on this is case-sensitive. So for example, "bird" is no the same as "BIRD".
        For case-insensitive matching, use deletedupes() instead.

         
    • Fool4UAnyway

      Fool4UAnyway - 2008-07-04

      To get an idea about what you can do, read this thread:

      "Preparing numbers to be sorted by TextFX sort" (Plugin Development forum)
      http://sourceforge.net/forum/forum.php?thread_id=2011755&forum_id=482781

      Now, think about how you could do a similar thing for your words of different lenghts...
      In the description, "replace" the 0 by a space character and treat your words as numbers.
      (I guess the maximum length of a word in the list really doesn't matter.)

      For simple sorting, you can also use TextFX.

      Menu_____ : TextFX
      Submenu__ : TextFX Tools
      Option___ : +Sort outputs only UNIQUE (at column) lines

      Check this option if you want to remove duplicate entries/lines, in your case: words.

      To actually perform the sorting:

      Menu_____ : TextFX
      Submenu__ : TextFX Tools
      Option___ : +Sort lines case (in)sensitive (at column)

      The (at column) means, sorting will be performed from the starting position of your selection (on the first line).

      In your case, you can select complete lines (cursor at te beginning) and even the complete file or list of words (CTRL+A).

      If you even want the accents to be ignored, you may want to remove them (first). I can't think of a way right now to put them back after the sorting has been done.

       
      • Fool4UAnyway

        Fool4UAnyway - 2008-07-04

        To allow "accentual correct" sorting, you might replace each single character uniquely with another character that would result in the correct sorting order. Then, after sorting, you could do a reverse replacing to get the original characters back. This would require you to create a 1-to-1 transformation list, which you could execute by using the Simple Script Plug-In's TransformChars function:

        TransformChars( string alphabet1, string alphabet2 )
        : replaces all characters in alphabet1 with the characters in the same position in alphabet2.

        Equivalent to running a series of replaces on single characters.

        This function operates on each line individually, so \r and \n won't work.  All other escape characters work fine.

        Each character in the line will be looked up in alphabet1. If it finds it, it replaces it with the character in the same position in alphabet2.

        Unlike in other functions, matching is case-sensitive.

        For obvious reasons, alphabet1 and alphabet2 should be the same length. If not, characters may not be replaced that should be. They do not, however, need to be the same characters.

        Alphabet1 and alphabet2 do not have to include all letters in the alphabet, only ones that need to be replaced.

        This function can be used for simple encryption, but because there's no guarantee the transformation will be reversible, you need to choose your alphabets carefully.

        Examples:

        Replacing asterisks with ampersands, ampersands with quotation marks, and quotation marks with asterisks:

        transformchars("*&\q","&\q*")

        Replacing vowels with numbers:

        transformchars("aeiouAEIOU","1234512345")

        Replacing tabs with spaces:

        transformchars("\t"," ")

        Rot13 cyper:

        transformchars("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ","nopqrstuvwxyzabcdefghijklmNOPQRSTUVWXYZABCDEFGHIJKLM")

         
    • frmariam

      frmariam - 2008-07-04

      Thanks for your detailed help!

      As for sorting the words beggining with accentuated chars I thinks it's still easier to do it manually (they're still just a few).

      Actually for this list the line lenght seems to matter (the app seems to complain otherwise)... The list is over 2500 lines... And based on that link I'd need a way to autoprefixing my lines with a number based on their length... I didn't quite get or thought of a way to do it...

       
      • Fool4UAnyway

        Fool4UAnyway - 2008-07-04

        > As for sorting the words beggining with accentuated chars I thinks it's still easier
        > to do it manually (they're still just a few).

        If there are only a few accents, I suppose you could also manually "remove" them, then do a correct sorting, and later manually re-place the accents. (My suggestion above would lead to a pseudo-order, just making sure an accented a would be sorted prior to a b, but would still keep an order between differently accented a's.)

        The number of lines doesn't matter.

        I meant to say that in the number example, the numbers were only short, but for numbers as well as for words, it really doesn't matter how long they really are.
        You just have to make sure that you prefix all (other) words with a number of spaces equalling the length of the longest line (minus the length of the shortest line).

        Then you may be able (if the longest word isn't tremendously long) to remove all _extra_ spaces by a regular expression that will only keep the last #X characters of the line. All words will be equally long then, with "short" words prefixed by space characters, making sure they will appear on top of the sorted list. After sorting, you can simply remove all prefix space characters.

         
    • frmariam

      frmariam - 2008-07-04

      Yeah I kind of got that part about prefixing the words with blank spaces (or anything but letters)according to the longest word.

      My problem is how do I prefix the words (I mean every single one at once... with simple operations) with the correct neumber of spaces... All I'd have to do next is replacing the " " by nothing...

       
      • Fool4UAnyway

        Fool4UAnyway - 2008-07-04

        OK, I'll try to rewrite the complete instruction to suit your purpose.

        1. Prefix words with an excessive amount of space characters (0's would also do).
        2. Remove unwanted space characters using the regex mode of the replace dialog.
        3. Sort the words.
        4. Remove remaining prefix space characters.
        5. Voila, done, list sorted by length and alphabetically!

        1. Prefix words with an excessive amount of space characters (0's would also do).

        Each line contains a single word, right? There are several ways to do insert text at the beginning of each line. Let's use the block selection mode to select a 0-character width column at the start of each line:

        Move the to the beginning or end of the file: Ctrl+Home or Ctrl+End.
        Hold down the Shift key while pressing the other combination: Ctrl+End or Ctrl+Home.
        While holding the Shift key, move the cursor to the right twice to make sure you'll see the next effect.
        While still holding down the Shift key, also press the Alt key and move the cursor to the left once. You'll see the line selection change into a columnar block with a width of 1 column.
        While holding down Shift and Alt key, move the cursor to the left once more. You'll see no selection, which "indicates" a columnar block of width 0.

        Press ALT+C to open the Column Editor dialog.
        Select "Text to insert".
        Enter a number of space characters matching the longest line/word's length.
        Press OK.

        All words will now be prefixed by the necessary number of space characters, but most lines will be longer now than the previously longest line. We will remove this part of the prefix in the next step.

        2. Remove unwanted space characters using the regex mode of the replace dialog.

        This uses regular expression mode in either (CTRL+H or CTRL+R) Find/Replace dialog.
        Check the regular expression checkbox and uncheck any other undesired option (Selection, wrap). Move the cursor to the top of the file: Ctrl+Home.

        In the Find field, enter:
        ^_*(.....)$

        _ represent a spache character: type a space character in the dialog
        Use as many dots (.) as the length of the longest word. (More doesn't hurt, but isn't necessary.)

        In the Replace field, enter:
        \1

        (Find the first and) replace all occurrences.

        This will search and find all occurrences, but telling the replace engine to keep only the last "\d" number of digits. The prefix of the remaining number of space characters won't be placed back, so in effect will be removed.

        _* means: any consecutive string of space characters with a minimum of zero (none)

        3. Sort the words.

        Select all lines/text: Ctrl+A.

        Menu_____ : TextFX
        Submenu__ : TextFX Tools
        Option___ : +Sort ascending (make sure it is checked, you may have to "check" this twice!)
        Option___ : Sort lines case insensitive (at column)

        4. Remove remaining prefix space characters.

        Now the list is sorted, but there are still prefix space characters.
        These can be easily removed by applying the following regex in the Find field and the replace string in the Replace field.

        Find field:
        ^_+(.)

        Replace field:
        \1

        Reminder: _ represents a space character, type a real space character in the dialog.

        This will search for any string of space characters from the beginning of a line, including the first non-space character, and re-place only that non-space character.

        _+ is used, because it would make no sense to search for zero space characters (no prefix te remove).
        + means: a string of at least one consecutive space characters

        5. Voila, done, list sorted by length and alphabetically!

        You're welcome. I'll gratefully accept any reward.

         
        • Fool4UAnyway

          Fool4UAnyway - 2008-07-04

          Here's a somewhat simpler solution.

          It uses the Simple Script Plug-In. You may have to download it.

          The really simple script is as follows:

          PadLeft( " ", 55 )
          Sort()
          TrimLeft(" ")

          Unfortunately, when I add a space character between the ( and " on the last line, the script won't validate correctly. This prevents me from writing nice looking code, at least in my view.

          This simple script also does the trick. It really is simple.

          But the explanation above also gives you an idea about what Notepad++ is capable of, so what YOU are capable of.

           
          • Fool4UAnyway

            Fool4UAnyway - 2008-07-04

            Of course, the number 55 represents the length of the longest word/line. Enter the correct number here.

            The preceding messages tells you I re-wrote the other instructions:

            > This will search and find all occurrences, but telling the replace engine to keep
            > only the last "\d" number of digits.

            Of course, this should have been: only the last number of "." characters.