I'm trying to use a grep pattern to make LineSort ignore numbers and tabs when sorting. That way this...
23 Washington
601 Adams
5 Lincoln
will be sorted alphabetically, like this...
601 Adams
5 Lincoln
23 Washington
I've tried a number of patterns that seem to do what I want using BBEdit's Find feature, but they don't work when I try to use them with LineSort.
I'm sure this is a user error, not a bug, but for the record I'm using LineSort 5.0.7, BBEdit 6.5, and Mac OS X 10.2.3.
Your help will be much appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your description "ignore numbers and tabs" is somewhat ambiguous, but I made a simple pattern that will sort the data as desired.
Skip Number
^[\d]*\t(.*)$
\1
This is based on the other "Skip" patterns. The general idea is to make a Find substitution that will match the first pattern and replace it with the second pattern, which is then sorted.
The above pattern is as follows:
"^" starts at the beginning of the line (so that it doesn't match birthdays, number of years in office, etc., that may occur after the name.)
"[\d]*"
The brackets aren't necessary, since there's only one element in the list, but since I just did a copy&paste of the Skip Space pattern, I left them. The "\d" means "any digit" (0-9). You could also use the old-style "[0-9]".
The "*" means "zero or more" of the previous pattern, thus matching any quantity of contiguous digits.
The "\t" matches one tab. If you have multiple tabs to skip, then use "\t+" (one or more), or "\t*" if there may not be any tabs at all.
The rest of the pattern, "(.*)$" is two things: ".*$", which matches the rest of the line, and "()" which stores that matched pattern.
The following line is where you put the pattern for the data you want to sort, as in a Find substitution. The "\1" is replaced by the first stored pattern.
Thus, the output is "the rest of the line", after ripping off the digits and the single tab.
The reason I said your description was ambiguous is that "numbers and tabs" could be read to be any quantity of columns of numbers, separated by one or more tabs. If that were the case, you'd want a pattern like "^[\d\t]*(.*)$", which will continue removing any quantity of digits and tabs until something non-(digit or tab) is encountered.
Let's continue.
Take a look at the "domain@id" pattern. You'll see the use of two sets of parentheses; "\2" preceeds "\1" in the output, which re-orders the result (the second stored pattern, followed by "@", then the first stored pattern).
So if your data isn't unique, i.e., you have multiple instances of the same name with different numbers attached to them, you might want to sort first by name, then by number.
You could use the pattern "^([\d]*)\t(.*)$" to store the number, ignore the tab, then store the rest of the line (which, for simplicity, I'll assume consists only of the name). The output pattern would be "\2\t\1", that is, name, then tab, then number. (The tab probably isn't necessary for the sorting process, and could be a space or some other separator, but I prefer to be precise.)
And just in case you do have other columns, let's complete the job by making a pattern that will sort by name and number, ignored the rest of the fields: "^(\d*)\t(.*)\t.*$". First, we store the number (but not the tab), then we store the next field (anything up to the next tab), then we match but don't store the rest of the line, so that it doesn't get put into the sort data. (The ".*$" is probably superfluous, since it will be discarded anyway, but it doesn't hurt to show exactly what you want to have happen.) The sort data pattern would be "\2\t\1", that is, name followed by tab followed by the number, with the rest of the input line discarded.
Since you didn't provide any of the patterns you tried, I wasn't sure what part of the process was giving you trouble. I hope this helps.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you, Douglas. I must say I didn't expect to receive such a prompt, detailed, and lucid answer. I understand this stuff better now, though admittedly I'm still a newbie.
What was tripping me up was the use of parentheses--actually, I was leaving them out. Having only read the BBEdit help files, I was entering something like [^\d\t].*$ in LineSort. (I think I'll order a copy of Mastering Regular Expressions soon.)
The pattern you provided works, and I see how it works, but there's still something that puzzles me: Why does it recognize "20\tAdams" and "Adams" as duplicates, but not "20\tAdams" and "9\tAdams"? Is there a way to recognize all of these as duplicates and remove them?
Thanks for your help,
Joshua
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Glad I could help. In your original message, you mentioned entering patterns in the Find dialog. I recommend doing that with a copy of your data to view the result of your match and replace patterns, which would then be sorted if you had entered those patterns in LineSort. If the Find result doesn't look right, then it can help you figure out what's wrong with your patterns. I do recommend a short but intensive study of regular expressions; I did so twenty years ago, and it has repaid me well. (Things have changed since then; I still need to familiarize myself with the PCRE engine used in BBEdit!)
Regarding duplicates: I found something in CSortEngine.cp between lines 220-269 which appears to be where this is addressed. (I'm not actually familiar with the source code, so I could be wrong.) First the code rejects duplicates; then it applies the pattern, rejecting non-matches and processing matches for sorting. The "Adams" doesn't match the pattern (since the pattern requires a tab on the line), so it appears that the reason it is being rejected is not because it's a duplicate, but because it's not a pattern match. If my quick read of the code is correct (without really knowing what's going on), then you can't Keep One Duplicate based on LineSort pattern matching.
Hey, wait! This is open source! I just took the code, moved lines 220-234 to after line 269 (moving "find and tag duplicate lines" to after "format lines using a pattern"), built it, installed it and re-ran your example. It kept "20\tAdams" and removed "9\tAdams" as a duplicate! Cooool! Thank you, Craig!
So you have two choices: hack the code as above, or pre-process the data. If you don't care about the numbers at all, and just want to remove duplicates, you could enter the pattern in the Find dialog, producing the list of names, then use LineSort (without a pattern) to sort and remove duplicates. (For example, in Find use "^\d*\t*(.*)$" as the Search For pattern and "\1" as the Replace With pattern.)
A further (embarassing) note: I hadn't ever used patterns in LineSort, so when I chose to Edit the patterns to write the reply to your original question, I didn't really look at the examples supplied. The pattern "Column 2+" is very close to what I provided, except that it removes the first column whether or not it consists of digits. After I looked at these examples, I realized that I had made a common error in the example in the next-to-the-last paragraph in my message: "^(\d*)\t(.*)\t.*$". I said, "anything up to the next tab." But in fact, that will store everything up until the _last_ tab. The examples in the file are better. Rather than "(.*)\t", use "([^\t]*)\t". The reason is that "." is 'everything including tabs', and since grep is greedy, it will store all text and tabs until it hits the last tab. The "[^\t]" is 'everything except a tab', so it will store anything that isn't a tab until the next tab, which is the desired behavior.
Your point about duplicates is interesting. I have no idea of the overall consequences of processing duplicates after patterns (it may be too confusing for unsophisticated users), but I'm curious what Craig and the other LineSort folks think of the idea of eliminating duplicates based on the pattern, rather than before.
A further note: I went back and did some more testing on the hack, and the set of lines "20\tAdams", "9\tAdams", "Adams" yields the result of all Adams being deleted. If "Adams" appears before the last "N\tAdams", then it leaves the last "N\tAdams". It does appear that there are some side effects to my hack involving pattern rejections and pattern matches.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That's an interesting hack--and a nice illustration of the flexibility of open source software. Since I have BBEdit 7, I was able to try another approach. I entered the pattern we've been working on into the Sort Lines dialog box, then used Process Duplicate Lines. Behold, 9\tAdams and 0\tAdams were recognized as duplicates and rejected.
How exactly PDL works differently from LineSort, or whether there are any tradeoffs to PDL's approach are questions I'll leave to others.
I ran into a few other problems after solving that one, but I'm going to put this project aside for awhile and come back to it after reading up on regular expressions.
Thanks again for your help, Douglas.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I wrote a document a few years ago that describes LineSort's sorting logic in great detail. Unfortunately, it fell out of date and now all I have to offer is the source code.
Doug's hack should have worked correctly. The fact that it didn't always work reveals a logic error in the FindDuplicates function.
FindDuplicates tags each line in a duplicate set as a dup, except the last line. (This is arbitrary -- it could just as well keep only the first line.)
Let's assume Doug chooses "Column 2+" in the pattern menu and checks Keep One Duplicate, and runs the hacked LineSort on the following lines:
9\tAdams
20\tAdams
Adams
The "Adams" line gets tagged as a reject because it doesn't match the pattern -- there is no column 2 in this line.
FindDuplicates doesn't notice this, and the first two lines get tagged as duplicates. Then the TransferLines function decides all three lines are unwanted, and removes them all. Ouch.
We could certainly change LineSort to conform with the way BBEdit handles dups. If we use Doug's modification, we will need to fix the logic error as well.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm trying to use a grep pattern to make LineSort ignore numbers and tabs when sorting. That way this...
23 Washington
601 Adams
5 Lincoln
will be sorted alphabetically, like this...
601 Adams
5 Lincoln
23 Washington
I've tried a number of patterns that seem to do what I want using BBEdit's Find feature, but they don't work when I try to use them with LineSort.
I'm sure this is a user error, not a bug, but for the record I'm using LineSort 5.0.7, BBEdit 6.5, and Mac OS X 10.2.3.
Your help will be much appreciated.
Your description "ignore numbers and tabs" is somewhat ambiguous, but I made a simple pattern that will sort the data as desired.
Skip Number
^[\d]*\t(.*)$
\1
This is based on the other "Skip" patterns. The general idea is to make a Find substitution that will match the first pattern and replace it with the second pattern, which is then sorted.
The above pattern is as follows:
"^" starts at the beginning of the line (so that it doesn't match birthdays, number of years in office, etc., that may occur after the name.)
"[\d]*"
The brackets aren't necessary, since there's only one element in the list, but since I just did a copy&paste of the Skip Space pattern, I left them. The "\d" means "any digit" (0-9). You could also use the old-style "[0-9]".
The "*" means "zero or more" of the previous pattern, thus matching any quantity of contiguous digits.
The "\t" matches one tab. If you have multiple tabs to skip, then use "\t+" (one or more), or "\t*" if there may not be any tabs at all.
The rest of the pattern, "(.*)$" is two things: ".*$", which matches the rest of the line, and "()" which stores that matched pattern.
The following line is where you put the pattern for the data you want to sort, as in a Find substitution. The "\1" is replaced by the first stored pattern.
Thus, the output is "the rest of the line", after ripping off the digits and the single tab.
The reason I said your description was ambiguous is that "numbers and tabs" could be read to be any quantity of columns of numbers, separated by one or more tabs. If that were the case, you'd want a pattern like "^[\d\t]*(.*)$", which will continue removing any quantity of digits and tabs until something non-(digit or tab) is encountered.
Let's continue.
Take a look at the "domain@id" pattern. You'll see the use of two sets of parentheses; "\2" preceeds "\1" in the output, which re-orders the result (the second stored pattern, followed by "@", then the first stored pattern).
So if your data isn't unique, i.e., you have multiple instances of the same name with different numbers attached to them, you might want to sort first by name, then by number.
You could use the pattern "^([\d]*)\t(.*)$" to store the number, ignore the tab, then store the rest of the line (which, for simplicity, I'll assume consists only of the name). The output pattern would be "\2\t\1", that is, name, then tab, then number. (The tab probably isn't necessary for the sorting process, and could be a space or some other separator, but I prefer to be precise.)
And just in case you do have other columns, let's complete the job by making a pattern that will sort by name and number, ignored the rest of the fields: "^(\d*)\t(.*)\t.*$". First, we store the number (but not the tab), then we store the next field (anything up to the next tab), then we match but don't store the rest of the line, so that it doesn't get put into the sort data. (The ".*$" is probably superfluous, since it will be discarded anyway, but it doesn't hurt to show exactly what you want to have happen.) The sort data pattern would be "\2\t\1", that is, name followed by tab followed by the number, with the rest of the input line discarded.
Since you didn't provide any of the patterns you tried, I wasn't sure what part of the process was giving you trouble. I hope this helps.
Thank you, Douglas. I must say I didn't expect to receive such a prompt, detailed, and lucid answer. I understand this stuff better now, though admittedly I'm still a newbie.
What was tripping me up was the use of parentheses--actually, I was leaving them out. Having only read the BBEdit help files, I was entering something like [^\d\t].*$ in LineSort. (I think I'll order a copy of Mastering Regular Expressions soon.)
The pattern you provided works, and I see how it works, but there's still something that puzzles me: Why does it recognize "20\tAdams" and "Adams" as duplicates, but not "20\tAdams" and "9\tAdams"? Is there a way to recognize all of these as duplicates and remove them?
Thanks for your help,
Joshua
Glad I could help. In your original message, you mentioned entering patterns in the Find dialog. I recommend doing that with a copy of your data to view the result of your match and replace patterns, which would then be sorted if you had entered those patterns in LineSort. If the Find result doesn't look right, then it can help you figure out what's wrong with your patterns. I do recommend a short but intensive study of regular expressions; I did so twenty years ago, and it has repaid me well. (Things have changed since then; I still need to familiarize myself with the PCRE engine used in BBEdit!)
Regarding duplicates: I found something in CSortEngine.cp between lines 220-269 which appears to be where this is addressed. (I'm not actually familiar with the source code, so I could be wrong.) First the code rejects duplicates; then it applies the pattern, rejecting non-matches and processing matches for sorting. The "Adams" doesn't match the pattern (since the pattern requires a tab on the line), so it appears that the reason it is being rejected is not because it's a duplicate, but because it's not a pattern match. If my quick read of the code is correct (without really knowing what's going on), then you can't Keep One Duplicate based on LineSort pattern matching.
Hey, wait! This is open source! I just took the code, moved lines 220-234 to after line 269 (moving "find and tag duplicate lines" to after "format lines using a pattern"), built it, installed it and re-ran your example. It kept "20\tAdams" and removed "9\tAdams" as a duplicate! Cooool! Thank you, Craig!
So you have two choices: hack the code as above, or pre-process the data. If you don't care about the numbers at all, and just want to remove duplicates, you could enter the pattern in the Find dialog, producing the list of names, then use LineSort (without a pattern) to sort and remove duplicates. (For example, in Find use "^\d*\t*(.*)$" as the Search For pattern and "\1" as the Replace With pattern.)
A further (embarassing) note: I hadn't ever used patterns in LineSort, so when I chose to Edit the patterns to write the reply to your original question, I didn't really look at the examples supplied. The pattern "Column 2+" is very close to what I provided, except that it removes the first column whether or not it consists of digits. After I looked at these examples, I realized that I had made a common error in the example in the next-to-the-last paragraph in my message: "^(\d*)\t(.*)\t.*$". I said, "anything up to the next tab." But in fact, that will store everything up until the _last_ tab. The examples in the file are better. Rather than "(.*)\t", use "([^\t]*)\t". The reason is that "." is 'everything including tabs', and since grep is greedy, it will store all text and tabs until it hits the last tab. The "[^\t]" is 'everything except a tab', so it will store anything that isn't a tab until the next tab, which is the desired behavior.
Your point about duplicates is interesting. I have no idea of the overall consequences of processing duplicates after patterns (it may be too confusing for unsophisticated users), but I'm curious what Craig and the other LineSort folks think of the idea of eliminating duplicates based on the pattern, rather than before.
A further note: I went back and did some more testing on the hack, and the set of lines "20\tAdams", "9\tAdams", "Adams" yields the result of all Adams being deleted. If "Adams" appears before the last "N\tAdams", then it leaves the last "N\tAdams". It does appear that there are some side effects to my hack involving pattern rejections and pattern matches.
That's an interesting hack--and a nice illustration of the flexibility of open source software. Since I have BBEdit 7, I was able to try another approach. I entered the pattern we've been working on into the Sort Lines dialog box, then used Process Duplicate Lines. Behold, 9\tAdams and 0\tAdams were recognized as duplicates and rejected.
How exactly PDL works differently from LineSort, or whether there are any tradeoffs to PDL's approach are questions I'll leave to others.
I ran into a few other problems after solving that one, but I'm going to put this project aside for awhile and come back to it after reading up on regular expressions.
Thanks again for your help, Douglas.
I wrote a document a few years ago that describes LineSort's sorting logic in great detail. Unfortunately, it fell out of date and now all I have to offer is the source code.
Doug's hack should have worked correctly. The fact that it didn't always work reveals a logic error in the FindDuplicates function.
FindDuplicates tags each line in a duplicate set as a dup, except the last line. (This is arbitrary -- it could just as well keep only the first line.)
Let's assume Doug chooses "Column 2+" in the pattern menu and checks Keep One Duplicate, and runs the hacked LineSort on the following lines:
9\tAdams
20\tAdams
Adams
The "Adams" line gets tagged as a reject because it doesn't match the pattern -- there is no column 2 in this line.
FindDuplicates doesn't notice this, and the first two lines get tagged as duplicates. Then the TransferLines function decides all three lines are unwanted, and removes them all. Ouch.
We could certainly change LineSort to conform with the way BBEdit handles dups. If we use Doug's modification, we will need to fix the logic error as well.