Tom Emerson <tremerson@...> writes:
> On Tue, Apr 16, 2013 at 3:13 PM, Christophe Rhodes <csr21@...> wrote:
>> Great! Welcome. I have a work-in-progress local branch which
>> storesmore of the necessary information about code points from
>> UnicodeData.txt and implements normalization, and might hand that branch
>> over to a student (or use it as introductory reading material) for the
>> SoC, if a sufficiently interested one shows up. (If not, well, I'll
>> keep on working on it in my own slow way -- but would also be open to
>> direct collaboration)
> I'd certainly be interested in working on it with you: having all the
> normalization forms supported efficiently has been on my personal task list
> for a while.
OK; I've got round to pushing a branch (volatile, could be rebased at
any instant, at least if I can work out how) to github. It supports
NFD and NFKD semi-efficiently; there are some low-hanging fruit to
improve them (by precomputing the recursive decomposition at build-time
rather than decomposing recursively at run-time, for example; also by
doing a first pass just checking codepoints).
It does not yet support NFC or NFKC; I'm still contemplating coming up
with a viciously clever indexing scheme for the primary composition
lookup (hashing pairs of build-time allocated integers somehow to lookup
compositions in a table). However, the NFD/NFKD support is tested using
the Unicode normalization test vectors, so improvements to it can be
made and tested with a reasonable amount of confidence.
The tree's at
<https://github.com/csrhodes/sbcl/tree/unicode-improvements>. Let me
know how/whether it works for you.