#714 lsort -dictionary does not work properly with ` character

obsolete: 8.0.3
closed-fixed
nobody
2
2014-08-14
2000-10-26
Anonymous
No

OriginalBugID: 1357 Bug
Version: 8.0.3
SubmitDate: '1999-03-02'
LastModified: '2000-01-13'
Severity: MED
Status: Released
Submitter: pat
ChangedBy: ericm
OS: Solaris
OSVersion: 2.5.1
Machine: Other
FixedDate: '2000-01-13'
FixedInVersion: 8.3b2
ClosedDate: '2000-10-25'

Name: Jason Mechler

ReproducibleScript:
START_OF_SCRIPT
#!/home/jasonm/ashl/bin/tclsh8.0

set infilename [lindex $argv 0]
set outfilename [lindex $argv 1]

set infilep [open $infilename r]
set data [read -nonewline $infilep]
close $infilep

set newdata [lsort -dictionary [split $data \n]]

set outfilep [open $outfilename w]
foreach item $newdata {
puts $outfilep $item
}
close $outfilep
END_OF_SCRIPT

START_OF_INPUT_DATA
`
addrf
COMMAND:
!
acicheckin
>
$
d
%
!!
diagnose
deleterf
acicheckout
c
ACI
END_OF_INPUT_DATA

ObservedBehavior:
When the data in section 6 is used as input to the script in section 6,
the output is the data below, which is not in correct dictionary order.
(Not the positioning of addrf and the ` character)

The problem does not happen if there are no elements in the list that begin
with uppercase letters. I think the problem is that there are a few ASCII
characters between the uppercase and lowercase letters. This is not taken
into account in the DictionaryCompare function.

!
!!
$
%
>
ACI
acicheckin
acicheckout
c
COMMAND:
`
addrf
d
deleterf
diagnose

DesiredBehavior:
The alphanumeric portion of the list should be sorted in correct dictionary
order. I don't really care where the ` goes, but it should definitely be
with the other symbols instead of in the middle of the letters

This problem still exists in 8.2, and can cause differing results
when you move the ` around (yes - the output of the lsort depends
on the order of the input - very wrong). See results:

(tkcon) 64 % set b
` addrf COMMAND: ! acicheckin > {$} d % !! diagnose deleterf acicheckout c ACI
(tkcon) 65 % lsort -diction $b
! !! {$} % > ACI acicheckin acicheckout c COMMAND: ` addrf d deleterf diagnose
(tkcon) 66 % lsort -diction [lrange $b 1 end]
! !! {$} % > ACI acicheckin acicheckout addrf c COMMAND: d deleterf diagnose
(tkcon) 67 % lsort -diction [concat [lrange $b 1 end] [lindex $b 0]]
! !! {$} % > ACI ` acicheckin acicheckout addrf c COMMAND: d deleterf diagnose

-- 08/07/1999 hobbs

Minimal case:

AA ` c CC ==> AA ` c CC
AA c ` CC ==> AA c CC `

The problem is caused by the mergesort algorithm combined with the dictionary comparison function. The funcction basically relies on the ASCII values of the characters in the strings to determine the sorting order. Characters with an ASCII value > Z and < a cause problems, because in a dictionary compare, upper and lower case are considered equivalent.

Unfortunately, in the comparison function, the upper/lower equivalency logic is placed AFTER the straight ASCII comparison. If an inbetween char is compared to a lowercase letter, it will be sorted before that letter; if it is compared to an uppercase letter, it wil be sorted after that letter. Thus the differences between the outputs for the two inputs shown -- the inbetween char is being compared to a different char, basically, in each case.

A solution is to convert all alpha chars to upper or lower case, and then do the comparison. This incurs a minor speed penalty, as we now have an extra conversion step, but as others have noted, it doesn't matter how fast it is if it gives you the wrong answer.

I elected to convert to lower instead of upper; this has the effect that the inbetween characters ` ^ \ _ [ and ] will sort before alpha chars. This seems appropriate because most other punctuation occurs before A as well.

This is now fixed in 8.3b2.

- eric

-- 01/13/2000 ericm

Discussion

  • Brent B. Welch

    Brent B. Welch - 2000-10-26
    • priority: 5 --> 2
    • status: open --> closed-fixed
     
  • Don Porter

    Don Porter - 2001-04-05
    • labels: 104244 --> 17. Commands I-L