From: Robert J. M. <sf-...@ro...> - 2005-12-01 19:11:55
|
Here's a new version of file-string-length which takes into account encoding issues. If FSL is passed a (string containing a) character which can't be represented in the output stream's external-format, FSL returns NIL rather than, as it used to, making a sort of wild guess. I'm a touch less happy with it because, in order to keep it minimally intrusive, I'm somewhat abusing the current define-external-format definition. DEF/variable-width now gets an additional parameter specifying the maximum number of bytes to which a single character can encode. The sizer function which was added in the last patch now actually _does the encoding_ (this is the bit I'm not entirely happy with) into a dummy octet array and catches the stream-encoding-error if it's signalled. The alternative though is to add a "character-encodable-p"-style clause to define-external-format and modify all the uses of that macro to provide them, but since "check if encodable" and "encode" are such very similar things to do to a character I didn't see a tremendously OAOO way to separate them. I've added a couple of tests to external-format.impure.lisp. The first one checks that FSL, given a latin-1 stream, returns 1 for character codes 0 through 255 and nil for codes 256 through char-code-limit. This one is not fast, since it goes character-by-character through the entire set of unicode code points; FSLing a string looks up the external format's sizer function and sets up the encoding error condition handler only once, so (file-string-length utf8-stream "all 1114112 characters") is much faster. The other is a "spot check" for latin-9, asserting that #\euro-sign has a file-string-length of 1, and #\coptic-capital-letter-hori returns NIL. -- Robert Macomber sf-...@ro... |