From: SourceForge.net <no...@so...> - 2009-02-05 15:21:12
|
Bugs item #1665628, was opened at 2007-02-21 21:44 Message generated for change (Comment added) made by dkf You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1665628&group_id=10894 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: 12. ByteArray Object Group: obsolete: 8.4.14 Status: Open Resolution: None Priority: 5 Private: No Submitted By: Jeff Rogers (dvrsn) Assigned to: Donal K. Fellows (dkf) Summary: cannot efficiently modify a ByteArray Initial Comment: There is no way to efficiently modify a ByteArray, either by appending data to the end or by removing data from the front (as might be useful for a binary stream buffer). Consider one task to read from a binary socket and buffer the data and another to decode counted strings from the front of the binary string: variable buffer proc read_handler {sock} { variable buffer append buffer [read $sock] } proc process_buffer {} { variable buffer if {[string length $buffer] > 0} { binary scan $buffer "I" dlen binary scan $buffer "x4 a$dlen" data binary scan $buffer "x4 x$dlen a*" buffer } } The first problem is that "append" only operates on string reps, which requires the entire ByteArray to be scanned and converted to the utf-8 (or unicode) encoding. (append also appears to unset the length of the string, meaning that a subsequent "string length" must recompute it). The second problem is that scanning a few bytes from the front of the ByteArray and the modifying the original means that the entire ByteArray needs to be copied. (maintaining a separate cursor is efficient but clumsy). "append" should probably do the sensible thing and append ByteArrays efficiently (as well as setting the length correctly), but since this is specifically dealing with binary strings it may make sense to implement these operations entirely as subcommands of "binary". I suggest binary append varName ?arg arg ...? to append the ByteArray values of all args to the named binary value; and binary dscan varName formatString ?varName varName ...? dscan = "destructive scan", with functionality identical to "binary scan" except that it takes a varName instead of a string, and it changes the ByteArray object to have the start offset at the end of the format arguments This would need the ByteArray object to store an offset indicating where the start of the data is. The length computation of a ByteArray object would need to be changed to use the offset but this should be contained within tclBinary.c. The destructive operation could either just change the offset or actually modify the underlying bytes. If it works by changing the offset then there would need to be some point at which it would actually modify the bytes because the memory between the start of the bytes and the offset is wasted; this point could be some hardcoded maximum or a heuristic measure based on the overall size of the data. (i.e., if the wasted space is larger than the used space, copy the live data to the beginning of the allocated space, limiting the wasted space to 100% overhead and using a non-overlapping move operation). Another method for modifying a binary object could be binary set varName formatSting ?arg arg ...? which would be something like "binary format" working on an existing bytearray. ---------------------------------------------------------------------- >Comment By: Donal K. Fellows (dkf) Date: 2009-02-05 15:21 Message: You want (and have time) to go through and audit all the code to see what new error cases that introduces? ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2009-02-05 15:07 Message: Thanks Donal. Still, you didn't answer: Why don't we raise an error in the conversion to byte array in that case ? After all, conversions *to* internal reps are allowed to fail (eg ill-formed lists).These situations occur when the system encoding doesn't allow proper mapping of some of the input. Why hide that data loss ? ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2009-02-05 14:41 Message: I've done that already (see 2568434). Still keeping this issue open as a master. Currently converted things: Tcl_AppendObjToObj Tcl_GetCharLength Tcl_GetRange Tcl_GetUniChar TEBC:INST_CONCAT1 TEBC:INST_STR_CMP TEBC:INST_STR_LEN TEBC:INST_STR_INDEX TEBC:INST_STR_MATCH This means, for example, that [append] now does the Right Thing whether or not it is compiled. ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2009-02-05 14:04 Message: Why don't we raise an error in the conversion to byte array in that case ? After all, conversions *to* internal reps are allowed to fail (eg ill-formed lists). These situations occur when the system encoding doesn't allow proper mapping of some of the input. Why hide that data loss ? Regarding the isCanonical (here isLossless) flag: that was my suggestion for all similar situations in 2564363... But I admit I'd rather avoid it by simply forbidding loss of sync. In the meantime, it seems I need to add the constraint to the CONCAT1 code, right ? ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2009-02-05 11:00 Message: If you convert a string with characters in the \u0100-\uffff range to a byte array, the high byte of that character is dropped. If you modify the byte array then (e.g. by Tcl_SetByteArrayLength) the string rep will be thrown away. At that point, you'll have lost information. By requiring that there be no string rep in the first place, that means that there cannot be any information to lose and the operation is safe. (Otherwise, the fallback approach of working with strings is used, which produces correct results but more slowly.) Theoretically, we could be more sophisticated and record whether the string rep is generated from the bytearray or vice versa and use that as the safety constraint - we do the equivalent with lists - but the change is more intrusive and complex, and most practical bytearrays won't have string reps anyway. ---------------------------------------------------------------------- Comment By: Alexandre Ferrieux (ferrieux) Date: 2009-02-05 09:52 Message: This is very reminiscent of a similar optimization I made on INST_CONCAT1 for byte arrays last year. However, I don't see a trace of the "no string rep" constraint. So, either this constraint is needed and I must fix the CONCAT1 code, or it is not and you can remove it from your code. Can you elaborate on "the string-based formal semantics of the routine is circumventable" ? ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2009-02-05 09:34 Message: I've enhanced Tcl_AppendObjToObj so that it does byte array concatenation safely and (reasonably) efficiently. It only does this optimization when both objects are bytearrays *and* the object being updated has no string rep (not a big problem; bytearrays don't usually have one unless people do things like printing them out for debugging purposes) - without that safety constraint, the string-based formal semantics of the routine is circumventable. This makes [append] of bytearrays to a variable containing a bytearray much faster according to my (informal) timing. Other basic operations still need work, so leaving this issue open. ---------------------------------------------------------------------- Comment By: Donal K. Fellows (dkf) Date: 2007-02-24 13:51 Message: Logged In: YES user_id=79902 Originator: NO I think it is reasonable to expect that [string index], [string range] and concatenation (but not [concat]) would work reasonably efficiently on bytearray objs. Moreover, making this happen can be done at this stage of the release cycle (unlike adding new subcommands to [binary], which would require TIPping and would formally constitute a new feature instead of an "optimization"). ---------------------------------------------------------------------- Comment By: Jeff Rogers (dvrsn) Date: 2007-02-23 02:09 Message: Logged In: YES user_id=24996 Originator: YES Improved patch, handles shared objects correctly. Also changes 'dscan' to modify the ByteArray only if all arguments were processed, so that for example binary dscan foo "a a1000" char data doesn't modify foo if it has less than 1001 bytes. File Added: tclBinary.patch ---------------------------------------------------------------------- Comment By: Jeff Rogers (dvrsn) Date: 2007-02-22 22:11 Message: Logged In: YES user_id=24996 Originator: YES Attached is a patch (against 8.4.14) implementing the "set" and "dscan" subcommands. I'm not sure is the variable handling is correct (particularly, should the reference count be incremented when creating a new object?) Rather than a separate "append" subcommand, the pattern would be binary set buffer "@[string length $buffer] a*" $dataToAppend Note too, that "set" will shrink the existing ByteArray if requested (meaning it will shrink if the format does not end with "@[string length $buffer]"). For an in-place/destructive modification, this is probably the right thing to do. File Added: tclBinary.patch ---------------------------------------------------------------------- Comment By: Jeff Rogers (dvrsn) Date: 2007-02-22 18:45 Message: Logged In: YES user_id=24996 Originator: YES I will try the suggestion in my application, but the general problem with binary scan [read $sock 4] I dlen binary scan [read $sock $dlen] a* data is that when $sock is non-blocking you are not guaranteed to actually read $dlen bytes; while if it is blocking then it can incur an unacceptably long pause in event handling. Beyond that, if the "data" itself contains additional counted strings then this technique is of no help since you are no longer reading from a stream. I wonder if the problem has not been noticed more because it only shows up (as do so many others) with scale -- I did not notice it until my buffer sizes were several megabytes. ---------------------------------------------------------------------- Comment By: Don Porter (dgp) Date: 2007-02-22 15:07 Message: Logged In: YES user_id=80530 Originator: NO While the observations appear to be correct, and we should consider addressing what we can... ...the way to avoid these problems is to not use a "buffer" in the first place, I think. Tcl is not C and C coding patterns often don't map well onto it. Instead of one huge [read] into a "buffer" from which you struggle to process things due to some limitations and inefficiencies of the base commands, why not only [read] as many bytes as you need at a time to do what you need to do? binary scan [read $sock 4] I dlen binary scan [read $sock $dlen] a* data I think the availability of this simple workaround is why we haven't noticed these issues much before. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1665628&group_id=10894 |