Tcl / Read-Only Feature Requests / #461 destructive scan of a ByteArray

Don Porter - 2007-02-22

Logged In: YES
user_id=80530
Originator: NO

While the observations appear to
be correct, and we should consider
addressing what we can...

...the way to avoid these problems
is to not use a "buffer" in the
first place, I think. Tcl is not
C and C coding patterns often don't
map well onto it.

Instead of one huge [read] into a
"buffer" from which you struggle
to process things due to some
limitations and inefficiencies
of the base commands, why not
only [read] as many bytes as you
need at a time to do what you need
to do?

binary scan [read $sock 4] I dlen
binary scan [read $sock $dlen] a* data

I think the availability of this
simple workaround is why we haven't noticed
these issues much before.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Rogers - 2007-02-22

Logged In: YES
user_id=24996
Originator: YES

I will try the suggestion in my application, but the general problem with

binary scan [read $sock 4] I dlen
binary scan [read $sock $dlen] a* data

is that when $sock is non-blocking you are not guaranteed to actually read $dlen bytes; while if it is blocking then it can incur an unacceptably long pause in event handling. Beyond that, if the "data" itself contains additional counted strings then this technique is of no help since you are no longer reading from a stream.

I wonder if the problem has not been noticed more because it only shows up (as do so many others) with scale -- I did not notice it until my buffer sizes were several megabytes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Rogers - 2007-02-22

Logged In: YES
user_id=24996
Originator: YES

Attached is a patch (against 8.4.14) implementing the "set" and "dscan" subcommands. I'm not sure is the variable handling is correct (particularly, should the reference count be incremented when creating a new object?)

Rather than a separate "append" subcommand, the pattern would be

binary set buffer "@[string length $buffer] a*" $dataToAppend

Note too, that "set" will shrink the existing ByteArray if requested (meaning it will shrink if the format does not end with "@[string length $buffer]"). For an in-place/destructive modification, this is probably the right thing to do.

File Added: tclBinary.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Rogers - 2007-02-23

Logged In: YES
user_id=24996
Originator: YES

Improved patch, handles shared objects correctly.

Also changes 'dscan' to modify the ByteArray only if all arguments were processed, so that for example

binary dscan foo "a a1000" char data

doesn't modify foo if it has less than 1001 bytes.
File Added: tclBinary.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Rogers - 2007-02-23

tclBinary.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2007-02-24

Logged In: YES
user_id=79902
Originator: NO

I think it is reasonable to expect that [string index], [string range] and concatenation (but not [concat]) would work reasonably efficiently on bytearray objs. Moreover, making this happen can be done at this stage of the release cycle (unlike adding new subcommands to [binary], which would require TIPping and would formally constitute a new feature instead of an "optimization").

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2009-02-05

I've enhanced Tcl_AppendObjToObj so that it does byte array concatenation safely and (reasonably) efficiently. It only does this optimization when both objects are bytearrays *and* the object being updated has no string rep (not a big problem; bytearrays don't usually have one unless people do things like printing them out for debugging purposes) - without that safety constraint, the string-based formal semantics of the routine is circumventable. This makes [append] of bytearrays to a variable containing a bytearray much faster according to my (informal) timing.

Other basic operations still need work, so leaving this issue open.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexandre Ferrieux - 2009-02-05

This is very reminiscent of a similar optimization I made on INST_CONCAT1 for byte arrays last year. However, I don't see a trace of the "no string rep" constraint. So, either this constraint is needed and I must fix the CONCAT1 code, or it is not and you can remove it from your code. Can you elaborate on "the string-based formal semantics of the routine is circumventable" ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2009-02-05

If you convert a string with characters in the \u0100-\uffff range to a byte array, the high byte of that character is dropped. If you modify the byte array then (e.g. by Tcl_SetByteArrayLength) the string rep will be thrown away. At that point, you'll have lost information. By requiring that there be no string rep in the first place, that means that there cannot be any information to lose and the operation is safe. (Otherwise, the fallback approach of working with strings is used, which produces correct results but more slowly.) Theoretically, we could be more sophisticated and record whether the string rep is generated from the bytearray or vice versa and use that as the safety constraint - we do the equivalent with lists - but the change is more intrusive and complex, and most practical bytearrays won't have string reps anyway.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexandre Ferrieux - 2009-02-05

Why don't we raise an error in the conversion to byte array in that case ? After all, conversions *to* internal reps are allowed to fail (eg ill-formed lists).
These situations occur when the system encoding doesn't allow proper mapping of some of the input. Why hide that data loss ?

Regarding the isCanonical (here isLossless) flag: that was my suggestion for all similar situations in 2564363...
But I admit I'd rather avoid it by simply forbidding loss of sync.

In the meantime, it seems I need to add the constraint to the CONCAT1 code, right ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2009-02-05

I've done that already (see 2568434). Still keeping this issue open as a master.

Currently converted things:
Tcl_AppendObjToObj
Tcl_GetCharLength
Tcl_GetRange
Tcl_GetUniChar
TEBC:INST_CONCAT1
TEBC:INST_STR_CMP
TEBC:INST_STR_LEN
TEBC:INST_STR_INDEX
TEBC:INST_STR_MATCH
This means, for example, that [append] now does the Right Thing whether or not it is compiled.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexandre Ferrieux - 2009-02-05

Thanks Donal.
Still, you didn't answer:

Why don't we raise an error in the conversion to byte array in that case ?
After all, conversions *to* internal reps are allowed to fail (eg
ill-formed lists).These situations occur when the system encoding doesn't allow proper
mapping of some of the input. Why hide that data loss ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2009-02-05

You want (and have time) to go through and audit all the code to see what new error cases that introduces?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alexandre Ferrieux - 2009-02-05

Not in a hurry, but I'd happily TIP it for 8.7. Rotten branches must be cut frankly sometimes.
Would you veto it ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2009-02-21

labels: 105654 --> 12. ByteArray Object

milestone: 595232 -->

summary: cannot efficiently modify a ByteArray --> destructive scan of a ByteArray
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2009-02-21

Right now, Tcl_GetByteArrayFromObj is not an API that can fail. (To be exact, it can't report the nature of the failure since it does not take a Tcl_Interp argument.) Going through the API to change this would be substantially disruptive, and equally so for third-party code. And Tcl_Panic is not appropriate either; that shouldn't be used for cases that are not C-level programming errors (we do it for memory management only because we don't know a good way to recover in general...) Given all this, I'm unwilling to countenance what you're proposing; the adding of new failure modes is tricky.

The original problem seems solved enough that this can be a FRQ.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2009-02-21

Well, you were so quick to answer that one that I did TIP 346 in between ;-)
Now may I ask why you, as the TIP editor, tidied it up it without a remark, if it was so clearly unacceptable ?

(Also, looking back at it, I see the sentence about [encoding system] is inaccurate. Indeed T_GBAFO converts the UTF8 strep to ISO-8859-1 regardless of the surrounding encoding system. Sorry about that.)

Last question: FRQ== Fédération de Rugby du Québec ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Donal K. Fellows - 2009-02-21

FRQ=Feature ReQuest

I prefer to discuss TIP designs on the tcl-core mailing list. But in this case... the majority of the TIP discusses features at the Tcl level. There's only one sentence which is likely to be a problem, and that's the one which talks about allowing Tcl_GetByteArrayFromObj to return TCL_ERROR. That implies a total change of type signature for that function.

But it is far better to discuss this on tcl-core, and in any case there is no hurry as it won't be in 8.6. It's also off-topic for *this* FRQ anyway, which is about efficient update APIs.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

destructive scan of a ByteArray

The Tool Command Language implementation

Group

Searches

Help

#461 destructive scan of a ByteArray

Discussion