Announcing Sort Lines Plugin

Heinz
2014-01-04
2014-01-09
  • Heinz
    Heinz
    2014-01-04

    Hello,

    I have created a plugin that sorts lines.

    Recently there was added a "Sort Lines" feature to Notepad++ , but since I needed some more features I have created a plugin.

    Plugin Features:
    -sort by leading numbers (numerically)
    -sort by line length
    -sort by characters from the right side of lines (x last characters)
    -remove duplicates (unique lines in result)
    -remove empty lines
    -remove leading spaces
    -remove leading numbers

    My plugin seems to be faster than NPP line sort feature (Tested with large files (1 million lines))

    The plugin is 32 Bit, tested with Unicode files and Notepad++ 6.5.2 (Unicode)

    Download, more information and screenshots here:
    http://www.scout-soft.com/linesort/

    Please let me know what you think

     
  • Fool4UAnyway
    Fool4UAnyway
    2014-01-05

    See: http://sourceforge.net/p/notepad-plus/discussion/331753/thread/0ee373c9/#6ffe

    I think internally replacing tab characters by the corresponding number of space characters is of more use than removing parts of lines, which can easily be done manually anyway. It's more or less like sorting numerically, i.e. 1, 2, ..., 9, 10 instead of 1, 10, 2, ...

     
  • THEVENOT Guy
    THEVENOT Guy
    2014-01-08

    Hello Heinz,

    Thank you very much for this new sorting N++ plugin, :-)

    First general impressions and first bugs found ( Sorry ! ) :


    The different features proposed are very, very, interesting, especially the numeric sort, remove empty and/or duplicate lines and result in an other window :-)

    But, without doubt, all the other goodies will be useful to everyone, from time to time !

    In addition, your plugin sort characters, according to their UNICODE codepoint, exactly in the same way than the internal Quick sort of N++. Indeed, it's the best way, to sort the characters of the UCS ( Universal Characters Set )

    For example, the SIX characters : £ , ™ , € , ƒ , A and æ are sorted A , £ , æ , ƒ , € and , because the different code-points are :

    \x0041 (A) , \x00A3 (£) , \x00E6 (æ) , \x0192 (ƒ) , \x20AC (€) , \x2122 (™)

    I noticed TWO small bugs which should be easy enough to fix :

    • If you start two instances of your plugin, the sort seems not to work and, sometimes a message "Unknown exception error", is displayed in a small window !

    • When you have a file in the secondary view of N++, with the focus on it, your plugin doesn't sort this file, but, instead, sort the corresponding unfocused file of the main view of N++ !


    Now, as you know, in professional sort software, there is, always, a possibility to define some sort keys. I mean, for example :

    1st key : 10 characters, beginning at position 30
    2nd key : 8 characters, beginning at position 50
    3rd key : 20 characters, beginning at position 5 and so on...

    Would it be possible, if , of course, it's not too difficult to code, to get such a feature ? I think that a minimum of 3 sorting keys should be OK. And it seems sensible not to exceed 5 sorting keys. It's up to you to decide :-)

    To that purpose, refer to the link below, which propose an other N++ sort plugin !

    http://william.famille-blum.org/blog/index.php

    But be quite quiet :-) This plugin ( NppColumnSort ), proposed by William Blum, doesn't manage very well the accentuated characters of many European languages :(( Some months ago, I send him a mail, to expose some bugs, but I've never get back an answer. He must have change his mind about it ! Moreover, except the numeric sort, it doesn't have all your add-ons !


    As Fool4Uanyway said, in his post above, we have a discussion about some problems with the use of the tabulation character ( \x09 ). Follow, to that purpose, the link and see the attached picture Sort_Problem.png

    http://sourceforge.net/p/notepad-plus/discussion/331753/thread/0ee373c9/#5900

    In fact, there's are TWO distinct problems :

    By convention, let suppose that a tabulation take FOUR physical positions ( Default N++ tab settings )

    1)

    Fool4Uanyway pointed out the fact that a leading tabulation, before a text, can have 4 forms, when you add some spaces at the very beginning of the line :

    • a) A full tabulation, taking 4 physical positions, before some text
    • b) ONE space, followed by a tabulation , taking 3 physical positions, before some text
    • c) TWO spaces, followed by a tabulation , taking 2 physical positions, before some text
    • d) THREE spaces, followed by a tabulation , taking 1 physical position, before some text

    And, as both, the N++ Quick sort and your plugin consider that a tabulation is listed BEFORE the space character, the result, AFTER sort, is that these four lines are sorted like above (from a to d ), independently of the fact that the text in these four lines, which is located on the same physical vertical position, are probably NOT sorted at all !

    So, a workaround to improve the sort, in that case, would be to replace 1,2 or 3 spaces + tabulation by a tabulation ONLY, taking the FOUR positions, internally, during sort process. However, this search is not sufficient :

    Let consider the string "ab" + 7 spaces + 1 tabulation + "defg", at the beginning of a line.

    Then, the simple search [ ]{1,3}\t and the replacement by \t BREAKS the original position of text, in the line.

    But the search of the alternative : ([^\t\r\n]{4})|\x20{1,3}\t and the replacement by (?1\1:\t) keeps the position of characters. See the end of my post at the address :

    http://sourceforge.net/p/notepad-plus/discussion/331753/thread/0ee373c9/#5900

    So, if a tabulation take n physical positions, a possible algorithm could be :

    A) From the cursor position, consider the next n characters of the file, without EOL characters

    B) If NO tabulation, among these n characters, go back to point A)

    C) If there are 1, 2 or 3 spaces BEFORE that TABULATION, replace that set ( 1 to 3 space(s)+ 1 tab ) by ONE tabulation ONLY

    D) Go back to point A)

    2)

    In your plugin, you have an option to sort by length of the lines of file. Nice ! But, again, if some tabulations exist on lines, this disrupt the result because we can choose between two interpretations :

    Either we consider that a tabulation is 1 character long ( The present and normal behaviour )

    Or we consider that a tabulation stands for n characters long. Then, after sort, text will be displayed in a nicer way.

    I think that the two solutions are equally relevant !?

    Of course, Heinz, all these remarks are only suggestions and it's up to you to decide what is suitable to do :-)

    After all, it's YOUR plugin ! Even, in its present form, I do think that is worth installing it :-)

    Best Regards

    guy038

     
    Last edit: THEVENOT Guy 2014-01-08
    • Heinz
      Heinz
      2014-01-08

      Hello,

      I noticed TWO small bugs which should be easy enough to fix :
      If you start two instances of your plugin, the sort seems not to work and, sometimes a message "Unknown exception error", is displayed in a small window !
      When you have a file in the secondary view of N++, with the focus on it, your plugin doesn't sort this file, but, instead, sort the corresponding unfocused file of the main view of N++ !

      Thanks for this information.
      I will try to find a solution

      Now, as you know, in professional sort software, there is, always, a possibility to define some sort keys.

      Your idea sounds interesting, I will try that
      (By the way : If the text is "CSV" formatted then you can sort it by columns with my SQL plugin: http://www.scout-soft.com/sql/index.html )

      But be quite quiet :-) This plugin ( NppColumnSort ), proposed by William Blum, doesn't manage very well the accentuated characters of many European languages

      Well, I am from Germany and therefore I'm used to work with texts containing non-ASCII characters (like Umlautes), so Unicode/UTF8 seemed to me a natural choice :-)

      As Fool4Uanyway said, in his post above, we have a discussion about some problems with the use of the tabulation character ( \x09 ).

      Yes, while I was testing "sort by linelength" I came accross this problem.
      But to me this was not really a problem because the texts I work with are usually without tabs (like automatically generated logfiles etc.)
      But of course I understand that for example if people manually type text this is another situation and tabs may exist.

      So, if a tabulation take n physical positions, a possible algorithm could be :
      A) From the cursor position, consider the next n characters of the file, without EOL characters
      B) If NO tabulation, among these n characters, go back to point A)
      C) If there are 1, 2 or 3 spaces BEFORE that TABULATION, replace that set ( 1 to 3 space(s)+ 1 tab ) by ONE tabulation ONLY
      D) Go back to point A)

      So you suggest to replace space(s) by corresponding tabs?
      Well, I was thinking in the opposite direction.
      Why? Because one could say that a tabulation character is just a replacement (a shortcut) for n (in case of N++ : 1-4) spaces...so why then not replace that shortcut by its actually value (n spaces) ?

      So my algorithm could be:
      A)..
      C) If there are 1, 2 or 3 spaces BEFORE that TABULATION, replace that TABULATION by 3,2 or 1 space(s)
      C1)if TABULATION is the first or last character in a line - replace it by 4 spaces
      D)..

      2)
      In your plugin, you have an option to sort by length of the lines of file.
      Nice ! But, again, if some tabulations exist on lines, this disrupt the result because we can choose between two interpretations :

      Yes, but maybe there is no one-fits-all solution to address the problem:
      Technically (in memory), a tabulation is 1 byte long (\x09)
      Visually (in editor), a tabulation is n "charactes" long (in case of N++ it fills up to 4 spaces, other editors may vary ...)

      Thank you for commenting!

       
      • Fool4UAnyway
        Fool4UAnyway
        2014-01-09

        So my algorithm could be:
        A)..
        C) If there are 1, 2 or 3 spaces BEFORE that TABULATION,
        replace that TABULATION by 3,2 or 1 space(s)
        C1)if TABULATION is the first or last character in a line
        - replace it by 4 spaces
        D)..

        You HAVE to take into account the "visual" position any tab character is on. Especially your C1) remark may not be correct: if the last character on a line is a tab, but is it not on the first tab visual position, it it not a full tab.

        Incrementally, you can replace tab characters only by the number of space characters up to the next visual tab position. A full tab can only be on position 1, ts + 1, 2 * ts + 1, n * ts + 1, where ts = tab size, and each next tab character can only be replaced if you take into account all visual positions that you have claimed by replacing each previous tab by the correct number of space characters.

        Put simply, if your file exists of a letter and then a tab character, which is the last character, you replace that tab character by the (maximum/set) tab size minus 1.

         
  • THEVENOT Guy
    THEVENOT Guy
    2014-01-09

    Hi Heinz,

    Thanks you very much for your rapid answer !

    But to me this was not really a problem because the texts I work with are usually without tabs (like automatically generated logfiles etc.)
    But of course I understand that for example if people manually type text this is another situation and tabs may exist.

    Finally, after thought, I think you perfectly right about it ! Indeed, tabulated text especially exists on languages source code and, of course, it's generally useless to sort such a source code !!!

    So you suggest to replace space(s) by corresponding tabs?
    Well, I was thinking in the opposite direction.
    Why? Because one could say that a tabulation character is just a replacement (a shortcut) for n (in case of N++ : 1-4) spaces...so why then not replace that shortcut by its actually value (n spaces) ?

    Again, your solution, which considers the replacement of a tabulation by 3,2 or 1 space, looks more simple. Because, contrary to source code texts, texts, that we have to sort, certainly contains more space characters than tabulation characters.

    So your solution seems preferable to mine, as long as you can, of course, take in account the exact number of spaces of a tabulation character for a specific language !

    And naturally, with YOUR tabulation replacement proposed, your sort by line's length will automatically be correct too !

    As the French proverb says : "The Best" is, often, the enemy of "The Good" !

    Of course, I can't presume of Fool4Uanyway's opinion about these sort problems !

    But, as for me, with some S/R, in regex mode, I'll always be able to solve this small tabulation annoyance :-)

    Cheers,

    guy038

     
    Last edit: THEVENOT Guy 2014-01-09