From: SourceForge.net <no...@so...> - 2003-02-05 17:16:04
|
Bugs item #624919, was opened at 2002-10-17 15:07 You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=624919&group_id=10894 Category: 10. Objects Group: = 8.4.0 Status: Open Resolution: Remind >Priority: 4 Submitted By: Nobody/Anonymous (nobody) Assigned to: Don Porter (dgp) Summary: Tcl_AppendToObj docs confusing Initial Comment: TCL 8.4.0 Windows XP info exists array(name) fails when name is long. Example (show both working and failing functions) % set studyuid $a(0020 000d) 1.2.840.113619.2.43.16112.2141964.87.55.870876696.1 % set seriesuid a(0020 000e) 1.2.840.113619.2.43.16112.2141964.41.48.870879458.1 5 % set set study($studyuid) $seriesuid 1.2.840.113619.2.43.16112.2141964.41.48.870879458.1 5 % puts $study($studyuid) 1.2.840.113619.2.43.16112.2141964.41.48.870879458.1 5 % set study(test) 2 2 % puts [info exists study(test)] 1 % set b test test % puts [info exists study($b)] 1 % puts [info exists study($studyuid)] 0 This should be 1 ---------------------------------------------------------------------- >Comment By: Jeffrey Hobbs (hobbs) Date: 2003-02-05 09:22 Message: Logged In: YES user_id=72656 The way to convert between utf-8 and byte array is already in Tcl - you create a ByteArray obj and ask for a String obj, or vice versa. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2003-02-05 05:40 Message: Logged In: YES user_id=79902 Hmm. Perhaps we need a way to convert between UTF-8 strings and byte-arrays? I would be keen on methods to convert arbitrary byte-arrays into "strings" (i.e. potentially non-UTF8 contents of the bytes member) being restricted to the testing extension, as such things are for testing of the core only and not for general use. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2003-01-28 09:46 Message: Logged In: YES user_id=80530 The [bytestring] command is used in several tests in the parse*.test and utf.test files. For example, the tests utf-1.* appear to be checking for correct UTF-8 encoding. Is there an alternative way to express these tests without allowing non-UTF-8 strings in objPtr->bytes ? ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2003-01-28 05:23 Message: Logged In: YES user_id=79902 As far as I'm concerned, it's a fault for a Tcl_Obj string rep to contain non-UTF8 data, and the behaviour of a great fraction of the core is undefined in that case. If anyone needs this sort of thing, perhaps they should consider refactoring (perhaps with the aid of the [binary] command to indicate that it is the representation that is being examined...) If I'm wrong in this, please explain exactly how and exactly what is being looked for in code that does this sort of trick. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2003-01-27 08:42 Message: Logged In: YES user_id=80530 Note that [tcltest::bytestring] is equivalent to (implemented as) [encoding convertfrom identity]. This is apparently useful to be able to test/compare strings that are not valid UTF-8. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2002-11-16 04:33 Message: Logged In: YES user_id=79902 [encoding convertfrom identity] doesn't fill me with glee I must admit. In fact, I'd be ever so much happier if it didn't exist. IMO, we should distinguish between the way that code-points are encoded as bytes, and the way that characters are mapped to code-points. In particular, [encoding] should be used to alter the mapping between characters and code-points (e.g. code-point 00A4 is the international currency symbol in UNICODE but is the euro symbol in ISO 8859-15). Where I get worried is when we start using it to muck around with the mapping to bytes; I'm of the opinion that the mapping of code-points to bytes should be (in the bytes-part of a Tcl_Obj, and consequently in Tcl_AppendToObj) based on that of UTF-8 (people can also use (possibly lossy) single-byte mappings via byte arrays and double-byte mappings via the string object type.) So what does this all mean? Well, it means that [encoding convertfrom identity] Must Die. There should not be any way to produce non-UTF8 bytes content (using the term broadly as a way of encoding code-points as bytes, and not providing those CPs with an interpretation.) ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2002-10-22 07:08 Message: Logged In: YES user_id=80530 Note that this is a general problem. Uncertainty about what encoding is required for the string pointed to by objPtr->bytes. It arose in Tcl Bug 584603 as well. If we are requiring UTF-8, then there's probably additional places in the docs to note this. Also, do we know of any extensions/users of Tcl_Obj's that have used the documented freedom to have non-encoded embedded NULLs in the counted strings pointed to by objPtr->bytes that we are now belatedly declaring illegal? Note in particular that Tcl's own command [encoding convertfrom identity] is now illegal by this documentation change, since it can return a Tcl_Obj with a non-UTF8 string rep. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2002-10-22 05:21 Message: Logged In: YES user_id=79902 Hmm. The documentation was completely out of synch with the implementation; what was written and what was true were quite different things, and had been since Tcl8.1... ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2002-10-22 02:44 Message: Logged In: YES user_id=79902 Review of the code indicates that the manpage is wrong; 'bytes' may not contain NUL bytes (well, it won't cause a memory fault, but strange effects - as seen here - might happen) except as an end-of-string marker when 'length' is -1 (or not long enough to overlap the NUL). Indeed, it is not even documented that 'bytes' is UTF8! Will fix... ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2002-10-22 01:30 Message: Logged In: YES user_id=79902 Hmm. I agree that the Tcl_AppendToObj documentation could be much clearer, and might possibly even be wrong; can we *really* take embedded NULs in the "bytes" argument correctly, or do we always need them two-byte encoded if they are not the end-of-string marker? ---------------------------------------------------------------------- Comment By: Mahlon Stacy (mahlonstacy) Date: 2002-10-21 11:49 Message: Logged In: YES user_id=595029 OK, I converted the offending TclStringObj to TclByteArrayObj, without other changes, and it seems to work OK. (we'll do more testing). FWIW, I interpreted the man page for Tcl_AppendToObj to suggest that the function would properly encode any string passed into it. Guess not. Thanks for clearing this up. ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2002-10-21 10:52 Message: Logged In: YES user_id=72656 I don't think it does complicate things at all, you are just looking at the wrong kind of object. A "String" is a utf-8 string. You want "ByteArray"s, so string map {String ByteArray} in your code - the APIs are all there. ---------------------------------------------------------------------- Comment By: Mahlon Stacy (mahlonstacy) Date: 2002-10-21 10:48 Message: Logged In: YES user_id=595029 OK, thanks Jeff. This complicates the coding for me, but I understand the issues. I do need to keep the NULL, when it's present, but I'll have to manage the use of the value as a subscript in another way. ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2002-10-21 10:40 Message: Logged In: YES user_id=72656 Do you really intend to have the NULL there, or do you just want to ensure that it's null terminated? If the latter, don't do anything extra - Tcl handles that. If the former, then you should either be using the Tcl_ByteArrayObj stuff, or you should use Tcl_ExternalToUtf and friends. That said: a) No, there are lots of other APIs to handle that, as noted above. b) This may indicate exactly where the problem is. While you can print it just fine (and it may be holding the NULL in there, you just can't see it), the info exists may not include the null when it passes the value through a strlen or such (that's why NULL get's special encoding), which is the source of the problem you are seeing. ---------------------------------------------------------------------- Comment By: Mahlon Stacy (mahlonstacy) Date: 2002-10-21 10:34 Message: Logged In: YES user_id=595029 Makes sense. But shouldn't either a) Tcl_AppendToObject catch and fix an embedded NULL, and/or b) whatever the subscript, if you can print the value of an object, shouldn't [info exists] on that same object always be true? ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2002-10-21 10:30 Message: Logged In: YES user_id=72656 That's not the correct thing to do. Tcl_Obj's are supposed to be utf-8 correct as strings, with the minor exception that NULLs are represented as two bytes (\xC0\x80 IIRC) to allow them to be passed around safely. The violation of this *may* cause problems, which was the red flag that waved at me. ---------------------------------------------------------------------- Comment By: Mahlon Stacy (mahlonstacy) Date: 2002-10-21 10:25 Message: Logged In: YES user_id=595029 No, I don't think so. strlen(buffer) is 7; i = 7 range of buffer[] is 0 - 6 buffer[7] = NULL sets the 8th char to NULL i++ increments the length to 8; in this example, buffer[7] was already null because we used strcpy. But in my working program, the values are not null terminated, they are described by length. Using the construct above just guarantees a null at the end of the value. Also, buffer is declared as an array... there's no overrun. ---------------------------------------------------------------------- Comment By: Jeffrey Hobbs (hobbs) Date: 2002-10-21 10:14 Message: Logged In: YES user_id=72656 Woah, bogosity filter hitting hard: strcpy(buffer,"21 test"); i = strlen(buffer); buffer[i] = (char) NULL; i++; Tcl_AppendToObj(element,buffer,i); What's with i++ here? That's telling AppendToObj to take more bytes than are valid out of buffer ... ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2002-10-21 09:55 Message: Logged In: NO Fair enough. Here's a sample that fails. C Source code: #include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #ifdef _MSC_VER #include <io.h> #include <fcntl.h> #include <winsock.h> #endif #include "tcl.h" int test_tcl(ClientData clientData, Tcl_Interp *interp, int argc, char **argv) { int i; char buffer[512]; Tcl_Obj *element, *theList, *theString; theString = Tcl_NewStringObj("ELEMENTLIST", - 1); theList = Tcl_NewListObj(1, &theString); sprintf(buffer,""); element = Tcl_NewStringObj(buffer, strlen(buffer)); strcpy(buffer,"21 test"); i = strlen(buffer); buffer[i] = (char) NULL; i++; Tcl_AppendToObj(element,buffer,i); Tcl_ListObjAppendElement(NULL,theList,element); Tcl_SetObjResult(interp,theList); return(TCL_OK); } #ifdef WIN32 __declspec(dllexport) #endif int Testtcl_Init(Tcl_Interp *interp) { Tcl_CreateCommand (interp, "testobject", test_tcl, (ClientData) NULL, (Tcl_CmdDeleteProc *)NULL); return(TCL_OK); } Compile the source into a shared library (I've done this on both PC and SGI, and both fail). Then start tclsh and execute this script: % load testtcl.dll % array set a [testobject] % parray a a(ELEMENTLIST) = 21 test % set b $a(ELEMENTLIST) 21 test % string length $b 8 % set r($b) 2 2 % puts $r($b) 2 % info exists r($b) 0 -Mahlon ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2002-10-21 09:08 Message: Logged In: YES user_id=80530 I think we have to assume this is a bug in your program, unless you provide a complete bit of code that supplies legal arguments to Tcl_AppendToObj(), but then produces results that are contrary to the documentation. Your followup is an improvement over the original report, but still does not provide enough information for anyone else to reproduce your problem. (What is "elementItem" ? What values do *scratch and stringLength have when passed into Tcl_AppendToObj(). etc...) ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2002-10-21 08:49 Message: Logged In: NO Yeah, worked for me on what I could type in, too, but the bug persists. I did some more digging. Here's the scenario. The offending array arguments have embedded nulls at the end. The objects are created in C: dicomDumpObject( ...tcl args ...) { char buffer[8192]; Tcl_Obj *theSize, *element, *theTagd, *theTagl, *elementsize, *theList, *theString; sprintf(buffer,""); element = Tcl_NewStringObj(buffer, strlen(buffer)); theString = Tcl_NewStringObj("ELEMENTLIST", -1); theList = Tcl_NewListObj(1, &theString); stringLength = elementItem->element.length; strncpy((char *)scratch, elementItem->element.d.string, stringLength); scratch[stringLength] = '\0'; Tcl_AppendToObj(element,(char *)scratch,stringLength); Tcl_ListObjAppendElement(NULL,theList,element); Tcl_SetObjResult(interp,theList); return(TCL_OK); } There are other items in the list, such that the entire list is putarray format, so to read the objects into TCL, I use: array set a [dicomDumpObject $o] This populates the array correctly, but the subscripts that contain appended nulls fail when using [info exists a($v)]. Yes, the string array names are ASN values, probably much like LDAP. This procedure uses TCL from end to end, after the values are copied in using Tcl_AppendToObj, which according to the man page, handles almost anything. -Mahlon ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2002-10-18 02:42 Message: Logged In: YES user_id=79902 Beats me what's going on, though those long strings remind me of LDAP, so if there's a problem with an extension mutating objects when it shouldn't, that could be what's going on. In any case, it works for me going on the basis of what I can type in. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2002-10-17 20:55 Message: Logged In: YES user_id=80530 Can anyone else make sense of this? Can the submitter try again? A cut and paste of an actual interactive session, or a demo script would be an improvement. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=624919&group_id=10894 |