Consider this command:
expect -c 'spawn sh -c "echo -n -e \"abc\""; expect -re "(a?)(a)(bc)"; puts "\n"; for { set i 1 } { $i < 4 } { incr i } { puts -nonewline "($i): \""; puts -nonewline $expect_out($i,string); puts "\"" }'
With recent versions (8.6.17) of TCL/expect I see output like this:
spawn sh -c echo -n -e "abc"
abc
(1): "abc"
(2): "a"
(3): "bc"
The first of these, marked (1) is, I think, wrong. This should be the empty string.
Looking in expect.c we see this block of code in expMatchProcess:
Tcl_RegExpGetInfo(re, &info);
buf = Tcl_NewUnicodeObj (buffer,esPtr->input.use);
for (i=0;i<=info.nsubs;i++) {
int start, end;
Tcl_Obj *val;
start = info.matches[i].start;
end = info.matches[i].end-1;
if (start == -1) continue;
if (e->indices) {
/* start index */
sprintf(name,"%d,start",i);
sprintf(value,"%d",start);
out(name,value);
/* end index */
sprintf(name,"%d,end",i);
sprintf(value,"%d",end);
out(name,value);
}
/* string itself */
sprintf(name,"%d,string",i);
val = Tcl_GetRange(buf, start, end);
This is where the substring matches are collected. From man 3 Tcl_RegExpGetInfo we learn:
The start and end values are Unicode character indices relative to the offset location within
objPtr where matching began. The start index identifies the first character of the matched
subexpression. The end index identifies the first character after the matched subexpression.
If the subexpression matched the empty string, then start and end will be equal. If the
subexpression did not participate in the match, then start and end will be set to -1.
```
So, within the above block, for an empty match, `end < start` due to the `-1` in the line `end = info.matches[i].end-1;`.
If the match was at the start of the buffer then `start == 0 && end == -1`.
Next, from `man 3 Tcl_GetRange` we learn:
Tcl_GetRange returns a newly created value comprised of the characters between first and last
(inclusive) in the value's Unicode representation. If the value's Unicode representation is
invalid, the Unicode representation is regenerated from the value's string representation. If
first < 0, then the returned string starts at the beginning of the value. If last < 0, then
the returned string ends at the end of the value.
```
This text doesn't say anything about what happens in the general case when end < start, but for the specific case where end == -1, we get the end of the input buffer, which is exactly what we see.
The current calculation of end cannot change otherwise the $expect_out(1,end) values would change, which could break existing code.
So, I think what is needed, is to avoid calling Tcl_GetRange when end < start, which is what the patch I propose does. There's also a new test included.
I agree that the behavior is wrong and your patch looks good.