[Tcl-bugs] [ tcl-Bugs-1665628 ] cannot efficiently modify a ByteArray

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Bugs item #1665628, was opened at 2007-02-21 21:44
Message generated for change (Comment added) made by dkf
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1665628&group_id=10894

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: 12. ByteArray Object
Group: obsolete: 8.4.14
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Jeff Rogers (dvrsn)
Assigned to: Donal K. Fellows (dkf)
Summary: cannot efficiently modify a ByteArray

Initial Comment:
There is no way to efficiently modify a ByteArray, either by appending data to the end or by removing data from the front (as might be useful for a binary stream buffer).  

Consider one task to read from a binary socket and buffer the data and another to decode counted strings from the front of the binary string:

  variable buffer
  proc read_handler {sock} {
    variable buffer
    append buffer [read $sock]
  }

  proc process_buffer {} {
    variable buffer
    if {[string length $buffer] > 0} {
      binary scan $buffer "I" dlen
      binary scan $buffer "x4 a$dlen" data
      binary scan $buffer "x4 x$dlen a*" buffer
    }
  }

The first problem is that "append" only operates on string reps, which requires the entire ByteArray to be scanned and converted to the utf-8 (or unicode) encoding.  (append also appears to unset the length of the string, meaning that a subsequent "string length" must recompute it).

The second problem is that scanning a few bytes from the front of the ByteArray and the modifying the original means that the entire ByteArray needs to be copied.  (maintaining a separate cursor is efficient but clumsy).

"append" should probably do the sensible thing and append ByteArrays efficiently (as well as setting the length correctly), but since this is specifically dealing with binary strings it may make sense to implement these operations entirely as subcommands of "binary".

I suggest
  binary append varName ?arg arg ...?
to append the ByteArray values of all args to the named binary value; and 
  binary dscan varName formatString ?varName varName ...?
dscan = "destructive scan", with functionality identical to "binary scan" except that it takes a varName instead of a string, and it changes the ByteArray object to have the start offset at the end of the format arguments   This would need the ByteArray object to store an offset indicating where the start of the data is.  The length computation of a ByteArray object would need to be changed to use the offset but this should be contained within tclBinary.c.  The destructive operation could either just change the offset or actually modify the underlying bytes.  If it works by changing the offset then there would need to be some point at which it would actually modify the bytes because the memory between the start of the bytes and the offset is wasted; this point could be some hardcoded maximum or a heuristic measure based on the overall size of the data.  (i.e., if the wasted space is larger than the used space, copy the live data to the beginning of the allocated space, limiting the wasted space to 100% overhead and using a non-overlapping move operation).

Another method for modifying a binary object could be
  binary set varName formatSting ?arg arg ...?
which would be something like "binary format" working on an existing bytearray.

----------------------------------------------------------------------

>Comment By: Donal K. Fellows (dkf)
Date: 2009-02-05 15:21

Message:
You want (and have time) to go through and audit all the code to see what
new error cases that introduces?

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2009-02-05 15:07

Message:
Thanks Donal.
Still, you didn't answer:

Why don't we raise an error in the conversion to byte array in that case
?
After all, conversions *to* internal reps are allowed to fail (eg
ill-formed lists).These situations occur when the system encoding doesn't
allow proper
mapping of some of the input. Why hide that data loss ?

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2009-02-05 14:41

Message:
I've done that already (see 2568434). Still keeping this issue open as a
master.

Currently converted things:
  Tcl_AppendObjToObj
  Tcl_GetCharLength
  Tcl_GetRange
  Tcl_GetUniChar
  TEBC:INST_CONCAT1
  TEBC:INST_STR_CMP
  TEBC:INST_STR_LEN
  TEBC:INST_STR_INDEX
  TEBC:INST_STR_MATCH
This means, for example, that [append] now does the Right Thing whether or
not it is compiled.

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2009-02-05 14:04

Message:
Why don't we raise an error in the conversion to byte array in that case ?
After all, conversions *to* internal reps are allowed to fail (eg
ill-formed lists).
These situations occur when the system encoding doesn't allow proper
mapping of some of the input. Why hide that data loss ?

Regarding the isCanonical (here isLossless) flag: that was my suggestion
for all similar situations in 2564363...
But I admit I'd rather avoid it by simply forbidding loss of sync.

In the meantime, it seems I need to add the constraint to the CONCAT1
code, right ?

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2009-02-05 11:00

Message:
If you convert a string with characters in the \u0100-\uffff range to a
byte array, the high byte of that character is dropped. If you modify the
byte array then (e.g. by Tcl_SetByteArrayLength) the string rep will be
thrown away. At that point, you'll have lost information. By requiring that
there be no string rep in the first place, that means that there cannot be
any information to lose and the operation is safe. (Otherwise, the fallback
approach of working with strings is used, which produces correct results
but more slowly.) Theoretically, we could be more sophisticated and record
whether the string rep is generated from the bytearray or vice versa and
use that as the safety constraint - we do the equivalent with lists - but
the change is more intrusive and complex, and most practical bytearrays
won't have string reps anyway.

----------------------------------------------------------------------

Comment By: Alexandre Ferrieux (ferrieux)
Date: 2009-02-05 09:52

Message:
This is very reminiscent of a similar optimization I made on INST_CONCAT1
for byte arrays last year. However, I don't see a trace of the "no string
rep" constraint. So, either this constraint is needed and I must fix the
CONCAT1 code, or it is not and you can remove it from your code. Can you
elaborate on "the string-based formal semantics of the routine is
circumventable" ?

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2009-02-05 09:34

Message:
I've enhanced Tcl_AppendObjToObj so that it does byte array concatenation
safely and (reasonably) efficiently. It only does this optimization when
both objects are bytearrays *and* the object being updated has no string
rep (not a big problem; bytearrays don't usually have one unless people do
things like printing them out for debugging purposes) - without that safety
constraint, the string-based formal semantics of the routine is
circumventable. This makes [append] of bytearrays to a variable containing
a bytearray much faster according to my (informal) timing.

Other basic operations still need work, so leaving this issue open.

----------------------------------------------------------------------

Comment By: Donal K. Fellows (dkf)
Date: 2007-02-24 13:51

Message:
Logged In: YES 
user_id=79902
Originator: NO

I think it is reasonable to expect that [string index], [string range] and
concatenation (but not [concat]) would work reasonably efficiently on
bytearray objs. Moreover, making this happen can be done at this stage of
the release cycle (unlike adding new subcommands to [binary], which would
require TIPping and would formally constitute a new feature instead of an
"optimization").

----------------------------------------------------------------------

Comment By: Jeff Rogers (dvrsn)
Date: 2007-02-23 02:09

Message:
Logged In: YES 
user_id=24996
Originator: YES

Improved patch, handles shared objects correctly.

Also changes 'dscan' to modify the ByteArray only if all arguments were
processed, so that for example 

 binary dscan foo "a a1000" char data

doesn't modify foo if it has less than 1001 bytes.
File Added: tclBinary.patch

----------------------------------------------------------------------

Comment By: Jeff Rogers (dvrsn)
Date: 2007-02-22 22:11

Message:
Logged In: YES 
user_id=24996
Originator: YES

Attached is a patch (against 8.4.14) implementing the "set" and "dscan"
subcommands.  I'm not sure is the variable handling is correct
(particularly, should the reference count be incremented when creating a
new object?)

Rather than a separate "append" subcommand, the pattern would be

binary set buffer "@[string length $buffer] a*" $dataToAppend

Note too, that "set" will shrink the existing ByteArray if requested
(meaning it will shrink if the format does not end with "@[string length
$buffer]").  For an in-place/destructive modification, this is probably the
right thing to do.

File Added: tclBinary.patch

----------------------------------------------------------------------

Comment By: Jeff Rogers (dvrsn)
Date: 2007-02-22 18:45

Message:
Logged In: YES 
user_id=24996
Originator: YES

I will try the suggestion in my application, but the general problem with

binary scan [read $sock 4] I dlen
binary scan [read $sock $dlen] a* data

is that when $sock is non-blocking you are not guaranteed to actually read
$dlen bytes; while if it is blocking then it can incur an unacceptably long
pause in event handling.  Beyond that, if the "data" itself contains
additional counted strings then this technique is of no help since you are
no longer reading from a stream.

I wonder if the problem has not been noticed more because it only shows up
(as do so many others) with scale -- I did not notice it until my buffer
sizes were several megabytes.

----------------------------------------------------------------------

Comment By: Don Porter (dgp)
Date: 2007-02-22 15:07

Message:
Logged In: YES 
user_id=80530
Originator: NO

While the observations appear to
be correct, and we should consider
addressing what we can...

...the way to avoid these problems
is to not use a "buffer" in the
first place, I think.  Tcl is not
C and C coding patterns often don't
map well onto it.

Instead of one huge [read] into a
"buffer" from which you struggle
to process things due to some
limitations and inefficiencies
of the base commands, why not
only [read] as many bytes as you
need at a time to do what you need 
to do?

binary scan [read $sock 4] I dlen
binary scan [read $sock $dlen] a* data

I think the availability of this
simple workaround is why we haven't noticed
these issues much before.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=110894&aid=1665628&group_id=10894

[Tcl-bugs] [ tcl-Bugs-1665628 ] cannot efficiently modify a ByteArray

The Tool Command Language implementation

[Tcl-bugs] [ tcl-Bugs-1665628 ] cannot efficiently modify a ByteArray