Previously recorded at 3065455, we have found that there is a general issue newly introduced on head on Windows, I suspect related to the -DUNICODE work.
You can trigger this bug with install-tzdata, but only when you use a prefix path of sufficient length. This one in particular worked for me:
prefix = C:/Tcl-test/we-want-a-long-path/and-this-makes-it-longer/dont-you-know-longer-still/yes-this-will-be-long-enough
exec_prefix = C:/Tcl-test/we-want-a-long-path/and-this-makes-it-longer/dont-you-know-longer-still/yes-this-will-be-long-enough
bindir = ${exec_prefix}/bin
libdir = C:/Tcl-test/we-want-a-long-path/and-this-makes-it-longer/dont-you-know-longer-still/yes-this-will-be-long-enough/lib
aka --prefix=C:/Tcl-test/we-want-a-long-path/and-this-makes-it-longer/dont-you-know-longer-still/yes-this-will-be-long-enough when building.
So what's happening here? I believe something in the switch from regular to -DUNICODE mode is now overflowing some aspect of a dstring (they have static space of 200b, and must grow otherwise). Or perhaps a buffer overrun where 1b was appended where 1*sizeof(WCHAR) is needed now. In any case, the crash symptoms indicate dstring related smashing, which is of course used all over Windows file handling. The -DUNICODE will have distinctly affected this behavior as well.
One crash stack, which crashed in a mem debug build with "unable to alloc 306 bytes, z:/cvs/tcl/tcl8.6/win/tclWinFile.c line 3232". Of course I have enough mem, but some mem smashing must be occurring.
tcl86g.dll!Tcl_DbCkalloc(unsigned int size=306, const char * file=0x1018425c, int line=3232) Line 393 + 0x16 bytes C
> tcl86g.dll!TclNativeCreateNativeRep(Tcl_Obj * pathPtr=0x023d6310) Line 3232 + 0x13 bytes C
tcl86g.dll!Tcl_FSGetInternalRep(Tcl_Obj * pathPtr=0x023d6310, const Tcl_Filesystem * fsPtr=0x10132c38) Line 2160 + 0x7 bytes C
tcl86g.dll!Tcl_FSGetNativePath(Tcl_Obj * pathPtr=0x023d6310) Line 4571 + 0xe bytes C
tcl86g.dll!TclpObjStat(Tcl_Obj * pathPtr=0x023d6310, _stat64 * statPtr=0x0017ebd4) Line 1985 + 0xf bytes C
tcl86g.dll!Tcl_FSStat(Tcl_Obj * pathPtr=0x023d6310, _stat64 * buf=0x0017ebd4) Line 2034 + 0x10 bytes C
tcl86g.dll!FileCopyRename(Tcl_Interp * interp=0x023004d8, int objc=5, Tcl_Obj * const * objv=0x023015e8, int copyFlag=1) Line 147 + 0xd bytes C
BTW, this is heisenbug like, and does need multiple runs to confirm.
Trace of DString activity for a particular dstring adress.
After rewriting the DString functions to have Debug variants (like we have for Tcl_Bj machinery) I got a good trace of DString actions. The attached log is actually a reduced form which looks only actions on the DString which crashed with high guard error in DStringFree.
Investigating the locations recorded it seems that the crash is in
"TclpObjRemoveDirectory" lines 1007-1014, with "TraverseWinTree" (*)
the main suspect doing the mem smash. The DString is the native
representation of some path.
Call sequence
TclpObjRemoveDirectory
--> DoRemoveDirectory (1)
--> TraverseWinTree
The "DoRemoveJustDirectory" called by (1) seems to only read
this DString (sees it only as a TCHAR* pointer).
Regardless, this seems to be the area to look into.
I suspect the problematic line is:
nativeSource[oldSourceLen + 1] = '\0';
because if nativeSource is a TCHAR, then
this does something different than expected.
However, in order to quickly fix the HEAD
for now just compile this file without -DUNCODE
(there are more files like that, which still need
to be investigated, so that doesn' t hurt)
Therefore, keeping open until it is fully fixed.
proposed fix
I worked through the code, trying to find locations where UNICODE
strings were appended to a buffer, trying to let is end with a double
null-byte. In most places, this is done by using
Tcl_DStringSetLength(.... , len+1);
Tcl_DStringSetLength(.... , len);
The first call allocates enough space to accomodate for the extra
null byte, The second call sets the length correctly, and puts
the principal null-byte there.
However, there were two places where it was done differently.
In tclWinFCmd.c, line 1307:
nativeSource[oldSourceLen + 1] = '\0';
If oldSourceLen is the string length in bytes, and
nativeSource is a WCHAR array, this will set two
null-bytes in the string, but at a double index
of the intended location. The UNICODE string
will not be null-terminated, which explains
exactly Andreas' stack trace.
Another location was at tclWinPipe.c, line 3136:
*(WCHAR *)(namePtr + Tcl_DStringLength(&buf) + 1) = '\0';
Not that bad, but casting a char * to a WCHAR *, while
not even being sure that the pointer is aligned to
a two-byte boundry, is clearly not correct.
Here is the patch, which changed both locations to
use the double Tcl_DStringSetLength method. This
should - once and for all - fix this.
Andreas (or Jeff), could you please confirm that this
fixes the bug? Thanks!
Originally, I intended to modify tclUtil.c such that
on Windows two null-bytes are always guaranteed.
However, there is a lot of code which assumes
insertion of a single null-byte, so this lead to
test failures.
Fixed on HEAD. Thanks for the big help, Andreas!