From: Jeff H. <je...@ac...> - 2013-01-29 09:59:43
|
On 2013-01-29, at 1:49 AM, Donal K. Fellows <don...@ma...> wrote: > On 29/01/2013 06:10, Jeff Hobbs wrote: >> A lot of baggage came in 8.1, and some of the solutions may have had >> better alternatives than the direction chosen. Let's start with >> regexps. 8.1 introduced the updated Spencer regexp. At the time, an >> original piece of work that was innovative for REs in several ways. >> This operated purely in UCS-16 space, which forced conversions any >> time you did a regexp. Meanwhile some other APIs did the same in an >> aggressive push to unicode, while the interface to the system was >> almost never in UCS-2. > > Different OSes have different preferred interface encodings. Windows > seems to like UCS-2 (or something very close to it). OSX likes UTF-8 Note that while Windows has UCS-2 affinity, it's rare that any data from the file system is already in that format - more likely cp1252 or similar. >> The long and short is … I think you might actually succeed in making >> a faster overall Tcl if you used utf-8 internal rep. It would >> require tossing the Spencer RE, and likely updates to some other >> areas. > > I suspect it depends on the mix of operations. Currently, the operations > that are strongly boosted by having the system of UCS-2(-ish) > representation are discovering the length and performing index > resolution (and hence constructing substrings); these are changed from > O(N) to O(1) in return for the extra cost of keeping around additional > information about the representation. That's genuinely significant, even > if not actually required for RE matching. (The UTF-8 representation has > other advantages; in particular, it's great for interoperability with > code that mostly doesn't want to care. That's a lot of Tcl!) And the paranthetical point is important. You can fully parse Tcl code in a completely utf-8 agnostic way. It's still correct to parse utf-8 by tokenizing on ascii-7 chars without any regard for interspersed utf-8 chars. There are other tricks to optimize utf-8 handling as you scan or jump through, some of which we employ already. Jeff |