Re: [TCLCORE] Agressive shimmering

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On 2013-01-29, at 1:49 AM, Donal K. Fellows <don...@ma...> wrote:
> On 29/01/2013 06:10, Jeff Hobbs wrote:
>> A lot of baggage came in 8.1, and some of the solutions may have had
>> better alternatives than the direction chosen.  Let's start with
>> regexps.  8.1 introduced the updated Spencer regexp.  At the time, an
>> original piece of work that was innovative for REs in several ways.
>> This operated purely in UCS-16 space, which forced conversions any
>> time you did a regexp.  Meanwhile some other APIs did the same in an
>> aggressive push to unicode, while the interface to the system was
>> almost never in UCS-2.
> 
> Different OSes have different preferred interface encodings. Windows
> seems to like UCS-2 (or something very close to it). OSX likes UTF-8

Note that while Windows has UCS-2 affinity, it's rare that any data from the file system is already in that format - more likely cp1252 or similar.

>> The long and short is … I think you might actually succeed in making
>> a faster overall Tcl if you used utf-8 internal rep.  It would
>> require tossing the Spencer RE, and likely updates to some other
>> areas.
> 
> I suspect it depends on the mix of operations. Currently, the operations
> that are strongly boosted by having the system of UCS-2(-ish)
> representation are discovering the length and performing index
> resolution (and hence constructing substrings); these are changed from
> O(N) to O(1) in return for the extra cost of keeping around additional
> information about the representation. That's genuinely significant, even
> if not actually required for RE matching. (The UTF-8 representation has
> other advantages; in particular, it's great for interoperability with
> code that mostly doesn't want to care. That's a lot of Tcl!)

And the paranthetical point is important.  You can fully parse Tcl code in a completely utf-8 agnostic way.  It's still correct to parse utf-8 by tokenizing on ascii-7 chars without any regard for interspersed utf-8 chars.  There are other tricks to optimize utf-8 handling as you scan or jump through, some of which we employ already.

Jeff

Re: [TCLCORE] Agressive shimmering

The Tool Command Language implementation

Re: [TCLCORE] Agressive shimmering