From: Christophe R. <cs...@ca...> - 2013-04-19 09:45:15
|
Tom Emerson <tre...@gm...> writes: > On Tue, Apr 16, 2013 at 3:13 PM, Christophe Rhodes <cs...@ca...> wrote: > >> Great! Welcome. I have a work-in-progress local branch which >> storesmore of the necessary information about code points from >> UnicodeData.txt and implements normalization, and might hand that branch >> over to a student (or use it as introductory reading material) for the >> SoC, if a sufficiently interested one shows up. (If not, well, I'll >> keep on working on it in my own slow way -- but would also be open to >> direct collaboration) > > I'd certainly be interested in working on it with you: having all the > normalization forms supported efficiently has been on my personal task list > for a while. OK; I've got round to pushing a branch (volatile, could be rebased at any instant, at least if I can work out how) to github. It supports NFD and NFKD semi-efficiently; there are some low-hanging fruit to improve them (by precomputing the recursive decomposition at build-time rather than decomposing recursively at run-time, for example; also by doing a first pass just checking codepoints). It does not yet support NFC or NFKC; I'm still contemplating coming up with a viciously clever indexing scheme for the primary composition lookup (hashing pairs of build-time allocated integers somehow to lookup compositions in a table). However, the NFD/NFKD support is tested using the Unicode normalization test vectors, so improvements to it can be made and tested with a reasonable amount of confidence. The tree's at <https://github.com/csrhodes/sbcl/tree/unicode-improvements>. Let me know how/whether it works for you. Best, Christophe |