Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

How to simulate an improved sort with sort criterias !

2013-12-22
2014-01-05
  • THEVENOT Guy
    THEVENOT Guy
    2013-12-22

    Hi All,

    Well, from now on, since the 6.5.2 version, we can use the internal Quick Sort of N++. Fine !

    Globally, the Quick Sort and the Binary Tree sort, with its specific structure of data, are the two fastest sorts algorithms, except some variations like IntroSort, based on Quick Sort and few others.

    But, generally, a good sort program use sort keys. At present, N++'s Quick Sort uses each entire line, as a pseudo sort key. Here is a simple way to simulate the use of sort keys :-)


    Just consider a file, with different sort keys, from 1 to n :

    • L1 characters, from the column C1, included
    • L2 characters, from the column C2, included
    • L3 characters, from the column C3, included

    and so on, till :

    • Ln characters, from the column Cn, included

    with the conditions :

    C1 < C2 < C3... < Cn

    The smallest length of the lines, of this file, is Cn + Ln - 1

    And suppose that we want to sort, according to, in decreasing order :

    • the third sort's criteria
    • the first sort's criteria
    • the nth sort's criteria

    and so on, till, for example, :

    • the fifth sort's criteria

    The main idea is to add, by juxtaposition, at the very beginning of each line, with the help of a search/replacement, in regular expression mode, the different sort keys, in the user order

    So, IF all the sort keys are identical for some lines, an implicit sort is, then, made on the remainder of the concerned lines


    To that purpose :

    1) Create the Search/Replacement below :

    SEARCH : ^.{C1-C0-1}(.{L1}).{C2-C1-1}(.{L2}).{C3-C2-1}(.{L3})............{Cn-Cn-1-1}(.{Ln})

    REPLACE : \3\1\n....\5$0 ( Criteria 3 then 1, then n .... and 5 , followed by the entire search expression $0 )

    Notes :

    The value of C0 is, by convention, equal to 0. So, if C1 = 1, the part .(C1-C0-1) is useless !

    The different Ci and Li must be changed by any appropriate non-zero number

    => The different sort keys are added, in front of each line, according to the user's preferences !

    2) Execute the Quick Sort command

    3) Get rid of the supplementary stuff, added at the beginning of each line, with the simple Search/Replacement below :

    SEARCH : ^.{L1+L2+L3+....Ln}(.*)

    REPLACE : \1

    => The added temporary sort keys are removed from each line

    Note :

    Replace the sum L1+L2+L3+....Ln by the resulting number


    A concrete example :

    Three sort keys : ( C1 = 12, L1 = 8 ) , ( C2 = 20, L2 = 10 ) and ( C3 = 37, L3 = 1 )

    User Sort order : Criteria 3, then criteria 1 and, finally, criteria 2

    Then :

    1)

    SEARCH ^.{11}(.{8}).{7}(.{10}).{16}(.{1})

    REPLACE : \3\1\2$0

    2)

    Execution of the Quick Sort

    3)

    SEARCH : ^.{19}(.*)

    REPLACE : \1

    Hope that it'll be useful to someone !

    Cheers,

    guy038

     
    Last edit: THEVENOT Guy 2013-12-22
  • Fool4UAnyway
    Fool4UAnyway
    2013-12-28

    If I understand this correctly, this is only applicable to fixed column length, so fixed position files.

    I would really like any sort algorithm to take tab characters into account, internally representing them by the specific number of space characters they occupy as set by their maximum size.

     
  • THEVENOT Guy
    THEVENOT Guy
    2013-12-28

    Hello Fool4Uanyway,

    You're perfectly right ! AFAIK, the N++ Quick Sort sort characters, according to their physical position in lines.

    And, IF the n-1 first characters, before the absolute column n, are identical between two lines, then Quick Sort compares the true UNICODE code-points of the characters, at column n, in these two lines.

    Then, the line with the smallest character's code-point, at column n, is re-written first, regardless to the remainder of these two lines, from column n+1

    For example, the SIX characters : £ , ™ , € , ƒ , A and æ are sorted A , £ , æ , ƒ , € and , because the different code-points are :

    \x0041 (A) , \x00A3 (£) , \x00E6 (æ) , \x0192 (ƒ) , \x20AC (€) , \x2122 (™)


    Now, let suppose that, if you type in a tabulation it takes FOUR spaces ( the default N++ number of spaces ). Then, with the simple S/R, in Extended or Regular expression mode, ( \t --> FOUR spaces ), the sort should be correct :-)

    See, to that purpose, the two attached pictures Default_Sort.png and Fool4_Sort.png

    AFTER the Quick Sort operation, just performs the reverse S/R ( FOUR spaces -> \t )

    Of course, the second S/R may break position of text in your file , because your file may mix spaces and tabulations, anywhere, in the same line :(

    Best Regards,

    guy038

     
    Last edit: THEVENOT Guy 2013-12-28
  • Fool4UAnyway
    Fool4UAnyway
    2013-12-31

    Now, let suppose that, if you type in a tabulation
    it takes FOUR spaces ( the default N++ number of spaces ).
    Then, with the simple S/R, in Extended or Regular
    expression mode, ( \t --> FOUR spaces ),
    the sort should be correct :-)

    Of course, the second S/R may break position
    of text in your file , because your file may
    mix spaces and tabulations, anywhere, in the
    same line :(>

    No, the big problem is that not all tabs represent the maximum number of 4 spaces or any other number.

    If a line starts with any number of space characters up to 1 less than the tab size and then has a tab character, the tab character should be replaces by 1 up to tab size minus 1 additional space characters.

    Yes, then, to undo this, you have an additional problem because you don't know the Original order of tab and space characters, but hey, since you started to replace all tab characters with the maximum number of space characters, you made a mess of it anyway...

    You simply do not want users to have to do this in order to get a good position (...).
    Notepad++'s sorting algorithm should take the "width" or virtual positions of tab characters into account when sorting (by internally replacing them by space characters first) and then later perform an additional search criterion and ordering by the actual character on the matching position (tab x09 < space x20) and skipping all positions from there up until the next matching like the tab character does in the text view.

     
  • THEVENOT Guy
    THEVENOT Guy
    2014-01-04

    Hi Fool4Uanyway,

    Sorry to be late, but I'm on holidays this week and I began to learn the Python language ! So, I wasn't on N++ forums these past five days !

    First of all, I wish you an excellent year 2014 and efficient work with our loved editor !

    Indeed, I didn't think about lines beginning with few spaces, followed by a tab smaller than the default tab, in order to adjust to the right position of text :(

    But, as you can see on my attached picture Sort_Problem.png, I found a workaround to have a better sort of the tabulation and space characters !

    This workaround imply, necessarily, a Search/Replacement and, of course, I agree with you, that the N++ Quick Sort should manage it internally ! But, in the meanwhile...


    So, let suppose that, after hitting the TAB key, 4 spaces are written ( the default case )

    We'll have to perform the following S/R, BEFORE starting the Quick Sort ( Case B ) :

    SEARCH : ([^\t\r\n]{4})|\x20{1,3}\t

    REPLACE : (?1\1:\t)

    At the bottom of my attached picture, I did a test with a line containing mixed tabulations/spaces and some text.

    => The result of the S/R seems correct and, then, allows a better sort, afterwards :-)

    NOTES :

    • The search expression try to match, either, FOUR characters, different from TAB and EOL, OR from 1 to 3 spaces, followed by a tabulation \t

    • If the left part of the alternative is matched, these four characters are just replaced by...themselves ! This trivial S/R helps us to keep the correct positions of the text tabulations ! The group 1 , represents these four characters, different from TAB, which are re-written (\1), as the THEN part of the conditional replacement (?1\1:\t)

    • If the right part of the alternative is matched ( one to three spaces followed by a tabulation ) the ELSE part of the conditional replacement is executed, as group 1 doesn't exist and a complete TAB character, standing for 4 physical positions, is re-written :-)

    It's important to notice that the two parts of the alternative CAN'T be matched at the same time ! ( No TAB in the left part and, at least, a SPACE + a TAB in the right part ). So, this S/R is very safe :-)

    In the general case, where a hit, on the TABULATION key, represents n characters, this S/R becomes :

    SEARCH : ([^\t\r\n]{n})|\x20{1,n-1}\t

    REPLACE : (?1\1:\t) Unchanged !

    I hope that this workaround meet your needs and I'm waiting for your feedback !

    Best Regards,

    guy038

     
    Last edit: THEVENOT Guy 2014-01-04
  • Fool4UAnyway
    Fool4UAnyway
    2014-01-05

    It's not that I need to do this kind of sorting now and am looking for a solution (it's seems to be working)...

    It's just that I should not be forced to do this myself.
    Notepad++ should handle this in code, so there even is no need to use a general regular expression, because Notepad++ should just change each line for internal sorting on the fly and knows what position the characters are on. It can simply count instead of using a trick that has to take line boundaries into account.

     
  • THEVENOT Guy
    THEVENOT Guy
    2014-01-05

    Hi Fool4Uanyway,

    I do agree with you : this work could be done, on the fly, internally, by N++ Quick Sort.

    BTW, I also tried an other N++ sort plugin, that you can download from this address :

    http://william.famille-blum.org/blog/index.php

    After the download of the archive, you just have to open the Unicode Release of this archive, unzip the NppColumnSort.dll file in your plugins directory and, then, restart Notepad++

    As you can see, this plugin has several features :

    • Definition of multiple sort keys ( position and length ) and their priorities

    • The options : case sensitive ot not, ascending sort or not, Numeric sort or not

    So, at first sight, it seems more interesting than the internal Quick Sort of N++

    Unfortunately, it doesn't manage accentuated characters at all, leaving them in the same initial order, before sort :(( What a pity !

    In addition, it has exactly the same behaviour about the mixed tabulations/spaces problem than the internal Quick Sort of Notepad++ :(.

    So, although it needs to know how many spaces takes a tabulation, the specific S/R proposed, in my previous post, before the sort process, can be used.

    But, what else can we do ?


    We, all, should NOT forget that N++ is a FREE product. So, when we need a new feature, I think that the *right and sensible attitude should be, by priority's order :

    • 1) Ask the developer's team for an improvement, as long as that feature seems easy enough to code

    • 2) If this feature is not considered as a priority, try to find a plugin which could achieve this particular feature

    • 3) If, both, Notepad++ and N++ plugins give up, you can, either, ask the N++ community for a workaround and/or find out, on the Web, a product which can achieve it.

    • 4) Finally, if there's still no solution found, just try to learn enough C++ stuff in order to develop your own plugin or a specific part of N++, as, for example, the excellent UDL 2.0, and soon, UDL 3.0, created by Loreia2 or, even, a complete external tool :-)

    As for me, I dream of being able to develop for N++, but I'll have to learn, first, C++ and plenty of things, I guess !!

    Of course, it easy to understand that, for a NON free software, our demands could be take in account, in a better way, according to our legitimate contribution !

    Cheers,

    guy038

     
    Last edit: THEVENOT Guy 2014-01-05