Menu

#461 destructive scan of a ByteArray

open
5
2009-02-21
2007-02-21
Jeff Rogers
No

There is no way to efficiently modify a ByteArray, either by appending data to the end or by removing data from the front (as might be useful for a binary stream buffer).

Consider one task to read from a binary socket and buffer the data and another to decode counted strings from the front of the binary string:

variable buffer
proc read_handler {sock} {
variable buffer
append buffer [read $sock]
}

proc process_buffer {} {
variable buffer
if {[string length $buffer] > 0} {
binary scan $buffer "I" dlen
binary scan $buffer "x4 a$dlen" data
binary scan $buffer "x4 x$dlen a*" buffer
}
}

The first problem is that "append" only operates on string reps, which requires the entire ByteArray to be scanned and converted to the utf-8 (or unicode) encoding. (append also appears to unset the length of the string, meaning that a subsequent "string length" must recompute it).

The second problem is that scanning a few bytes from the front of the ByteArray and the modifying the original means that the entire ByteArray needs to be copied. (maintaining a separate cursor is efficient but clumsy).

"append" should probably do the sensible thing and append ByteArrays efficiently (as well as setting the length correctly), but since this is specifically dealing with binary strings it may make sense to implement these operations entirely as subcommands of "binary".

I suggest
binary append varName ?arg arg ...?
to append the ByteArray values of all args to the named binary value; and
binary dscan varName formatString ?varName varName ...?
dscan = "destructive scan", with functionality identical to "binary scan" except that it takes a varName instead of a string, and it changes the ByteArray object to have the start offset at the end of the format arguments This would need the ByteArray object to store an offset indicating where the start of the data is. The length computation of a ByteArray object would need to be changed to use the offset but this should be contained within tclBinary.c. The destructive operation could either just change the offset or actually modify the underlying bytes. If it works by changing the offset then there would need to be some point at which it would actually modify the bytes because the memory between the start of the bytes and the offset is wasted; this point could be some hardcoded maximum or a heuristic measure based on the overall size of the data. (i.e., if the wasted space is larger than the used space, copy the live data to the beginning of the allocated space, limiting the wasted space to 100% overhead and using a non-overlapping move operation).

Another method for modifying a binary object could be
binary set varName formatSting ?arg arg ...?
which would be something like "binary format" working on an existing bytearray.

Discussion

  • Don Porter

    Don Porter - 2007-02-22

    Logged In: YES
    user_id=80530
    Originator: NO

    While the observations appear to
    be correct, and we should consider
    addressing what we can...

    ...the way to avoid these problems
    is to not use a "buffer" in the
    first place, I think. Tcl is not
    C and C coding patterns often don't
    map well onto it.

    Instead of one huge [read] into a
    "buffer" from which you struggle
    to process things due to some
    limitations and inefficiencies
    of the base commands, why not
    only [read] as many bytes as you
    need at a time to do what you need
    to do?

    binary scan [read $sock 4] I dlen
    binary scan [read $sock $dlen] a* data

    I think the availability of this
    simple workaround is why we haven't noticed
    these issues much before.

     
  • Jeff Rogers

    Jeff Rogers - 2007-02-22

    Logged In: YES
    user_id=24996
    Originator: YES

    I will try the suggestion in my application, but the general problem with

    binary scan [read $sock 4] I dlen
    binary scan [read $sock $dlen] a* data

    is that when $sock is non-blocking you are not guaranteed to actually read $dlen bytes; while if it is blocking then it can incur an unacceptably long pause in event handling. Beyond that, if the "data" itself contains additional counted strings then this technique is of no help since you are no longer reading from a stream.

    I wonder if the problem has not been noticed more because it only shows up (as do so many others) with scale -- I did not notice it until my buffer sizes were several megabytes.

     
  • Jeff Rogers

    Jeff Rogers - 2007-02-22

    Logged In: YES
    user_id=24996
    Originator: YES

    Attached is a patch (against 8.4.14) implementing the "set" and "dscan" subcommands. I'm not sure is the variable handling is correct (particularly, should the reference count be incremented when creating a new object?)

    Rather than a separate "append" subcommand, the pattern would be

    binary set buffer "@[string length $buffer] a*" $dataToAppend

    Note too, that "set" will shrink the existing ByteArray if requested (meaning it will shrink if the format does not end with "@[string length $buffer]"). For an in-place/destructive modification, this is probably the right thing to do.

    File Added: tclBinary.patch

     
  • Jeff Rogers

    Jeff Rogers - 2007-02-23

    Logged In: YES
    user_id=24996
    Originator: YES

    Improved patch, handles shared objects correctly.

    Also changes 'dscan' to modify the ByteArray only if all arguments were processed, so that for example

    binary dscan foo "a a1000" char data

    doesn't modify foo if it has less than 1001 bytes.
    File Added: tclBinary.patch

     
  • Jeff Rogers

    Jeff Rogers - 2007-02-23
     
  • Donal K. Fellows

    Logged In: YES
    user_id=79902
    Originator: NO

    I think it is reasonable to expect that [string index], [string range] and concatenation (but not [concat]) would work reasonably efficiently on bytearray objs. Moreover, making this happen can be done at this stage of the release cycle (unlike adding new subcommands to [binary], which would require TIPping and would formally constitute a new feature instead of an "optimization").

     
  • Donal K. Fellows

    I've enhanced Tcl_AppendObjToObj so that it does byte array concatenation safely and (reasonably) efficiently. It only does this optimization when both objects are bytearrays *and* the object being updated has no string rep (not a big problem; bytearrays don't usually have one unless people do things like printing them out for debugging purposes) - without that safety constraint, the string-based formal semantics of the routine is circumventable. This makes [append] of bytearrays to a variable containing a bytearray much faster according to my (informal) timing.

    Other basic operations still need work, so leaving this issue open.

     
  • Alexandre Ferrieux

    This is very reminiscent of a similar optimization I made on INST_CONCAT1 for byte arrays last year. However, I don't see a trace of the "no string rep" constraint. So, either this constraint is needed and I must fix the CONCAT1 code, or it is not and you can remove it from your code. Can you elaborate on "the string-based formal semantics of the routine is circumventable" ?

     
  • Donal K. Fellows

    If you convert a string with characters in the \u0100-\uffff range to a byte array, the high byte of that character is dropped. If you modify the byte array then (e.g. by Tcl_SetByteArrayLength) the string rep will be thrown away. At that point, you'll have lost information. By requiring that there be no string rep in the first place, that means that there cannot be any information to lose and the operation is safe. (Otherwise, the fallback approach of working with strings is used, which produces correct results but more slowly.) Theoretically, we could be more sophisticated and record whether the string rep is generated from the bytearray or vice versa and use that as the safety constraint - we do the equivalent with lists - but the change is more intrusive and complex, and most practical bytearrays won't have string reps anyway.

     
  • Alexandre Ferrieux

    Why don't we raise an error in the conversion to byte array in that case ? After all, conversions *to* internal reps are allowed to fail (eg ill-formed lists).
    These situations occur when the system encoding doesn't allow proper mapping of some of the input. Why hide that data loss ?

    Regarding the isCanonical (here isLossless) flag: that was my suggestion for all similar situations in 2564363...
    But I admit I'd rather avoid it by simply forbidding loss of sync.

    In the meantime, it seems I need to add the constraint to the CONCAT1 code, right ?

     
  • Donal K. Fellows

    I've done that already (see 2568434). Still keeping this issue open as a master.

    Currently converted things:
    Tcl_AppendObjToObj
    Tcl_GetCharLength
    Tcl_GetRange
    Tcl_GetUniChar
    TEBC:INST_CONCAT1
    TEBC:INST_STR_CMP
    TEBC:INST_STR_LEN
    TEBC:INST_STR_INDEX
    TEBC:INST_STR_MATCH
    This means, for example, that [append] now does the Right Thing whether or not it is compiled.

     
  • Alexandre Ferrieux

    Thanks Donal.
    Still, you didn't answer:

    Why don't we raise an error in the conversion to byte array in that case ?
    After all, conversions *to* internal reps are allowed to fail (eg
    ill-formed lists).These situations occur when the system encoding doesn't allow proper
    mapping of some of the input. Why hide that data loss ?

     
  • Donal K. Fellows

    You want (and have time) to go through and audit all the code to see what new error cases that introduces?

     
  • Alexandre Ferrieux

    Not in a hurry, but I'd happily TIP it for 8.7. Rotten branches must be cut frankly sometimes.
    Would you veto it ?

     
  • Donal K. Fellows

    • labels: 105654 --> 12. ByteArray Object
    • milestone: 595232 -->
    • summary: cannot efficiently modify a ByteArray --> destructive scan of a ByteArray
     
  • Donal K. Fellows

    Right now, Tcl_GetByteArrayFromObj is not an API that can fail. (To be exact, it can't report the nature of the failure since it does not take a Tcl_Interp argument.) Going through the API to change this would be substantially disruptive, and equally so for third-party code. And Tcl_Panic is not appropriate either; that shouldn't be used for cases that are not C-level programming errors (we do it for memory management only because we don't know a good way to recover in general...) Given all this, I'm unwilling to countenance what you're proposing; the adding of new failure modes is tricky.

    The original problem seems solved enough that this can be a FRQ.

     
  • Nobody/Anonymous

    Well, you were so quick to answer that one that I did TIP 346 in between ;-)
    Now may I ask why you, as the TIP editor, tidied it up it without a remark, if it was so clearly unacceptable ?

    (Also, looking back at it, I see the sentence about [encoding system] is inaccurate. Indeed T_GBAFO converts the UTF8 strep to ISO-8859-1 regardless of the surrounding encoding system. Sorry about that.)

    Last question: FRQ== Fédération de Rugby du Québec ?

     
  • Donal K. Fellows

    FRQ=Feature ReQuest

    I prefer to discuss TIP designs on the tcl-core mailing list. But in this case... the majority of the TIP discusses features at the Tcl level. There's only one sentence which is likely to be a problem, and that's the one which talks about allowing Tcl_GetByteArrayFromObj to return TCL_ERROR. That implies a total change of type signature for that function.

    But it is far better to discuss this on tcl-core, and in any case there is no hurry as it won't be in 8.6. It's also off-topic for *this* FRQ anyway, which is about efficient update APIs.