From: Krzysztof D. <krz...@gm...> - 2014-02-27 00:08:47
|
I started looking around the Unicode-related code, since I am planning on applying to GSoC to work on Unicode improvements, and I found a bug in digit-char-p. Specifically, there was a cond branch for non-ASCII digits, but it was being skipped for radixes <= 10. I've attached a patch that fixes the bug and adds test cases to catch it if it comes back. Could someone give me some more information about code freeze, and whether my patch is important enough to go through anyhow? Also, who is generally the right person to talk to about Unicode issues? kad |
From: Tom E. <tr...@tr...> - 2014-02-27 14:38:21
|
Nice patch: I tried CCL and LispWorks and neither support non-ASCII digits in digit-char-p, and the HyperSpec is of course mum on the issue so I think this is a good solution. It will make SBCL incompatible with other implementations, but when it comes to Unicode that's par for the course. Christophe Rhodes is the primary committer for Unicode features. For SoC 2013 I was signed up to be a mentor for anyone working on the Unicode support, and will probably do so again this year. I'm glad that you're interested in improving CCL's Unicode support. -tree On Wed, Feb 26, 2014 at 7:08 PM, Krzysztof Drewniak <krz...@gm... > wrote: > I started looking around the Unicode-related code, since I am planning > on applying to GSoC to work on Unicode improvements, and I found a bug > in digit-char-p. Specifically, there was a cond branch for non-ASCII > digits, but it was being skipped for radixes <= 10. I've attached a > patch that fixes the bug and adds test cases to catch it if it comes > back. Could someone give me some more information about code freeze, and > whether my patch is important enough to go through anyhow? > > Also, who is generally the right person to talk to about Unicode issues? > > kad > > > ------------------------------------------------------------------------------ > Flow-based real-time traffic analytics software. Cisco certified tool. > Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer > Customize your own dashboards, set traffic alerts and generate reports. > Network behavioral analysis & security monitoring. All-in-one tool. > > http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk > _______________________________________________ > Sbcl-devel mailing list > Sbc...@li... > https://lists.sourceforge.net/lists/listinfo/sbcl-devel > > -- Tom Emerson tr...@tr... http://www. <http://www.dreamersrealm.net/tree>treerex.net/ |
From: Tom E. <tr...@tr...> - 2014-02-27 16:18:19
|
By the way, there is an open bug for this issue (which relates alphanumericp) which includes a discussion on possible fixes: https://bugs.launchpad.net/sbcl/+bug/1177986 On Wed, Feb 26, 2014 at 7:08 PM, Krzysztof Drewniak <krz...@gm... > wrote: > I started looking around the Unicode-related code, since I am planning > on applying to GSoC to work on Unicode improvements, and I found a bug > in digit-char-p. Specifically, there was a cond branch for non-ASCII > digits, but it was being skipped for radixes <= 10. I've attached a > patch that fixes the bug and adds test cases to catch it if it comes > back. Could someone give me some more information about code freeze, and > whether my patch is important enough to go through anyhow? > > Also, who is generally the right person to talk to about Unicode issues? > > kad > > > ------------------------------------------------------------------------------ > Flow-based real-time traffic analytics software. Cisco certified tool. > Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer > Customize your own dashboards, set traffic alerts and generate reports. > Network behavioral analysis & security monitoring. All-in-one tool. > > http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk > _______________________________________________ > Sbcl-devel mailing list > Sbc...@li... > https://lists.sourceforge.net/lists/listinfo/sbcl-devel > > -- Tom Emerson tr...@tr... http://www. <http://www.dreamersrealm.net/tree>treerex.net/ |
From: Krzysztof D. <krz...@gm...> - 2014-02-27 20:32:29
Attachments:
signature.asc
|
Thanks. I've looked over that discussion, and I can probably take a stab at a more complete digit-char-p sometime in the future (maybe I should save it for the summer?). Right now, digit-char-p had what looked like Unicode support, but it seemed buggy (#x೨1 went to 33 (this works in master), but ೨1 was a variable, not 21, like people would expect. On 02/27/2014 10:18 AM, Tom Emerson wrote: > By the way, there is an open bug for this issue (which relates > alphanumericp) which includes a discussion on possible fixes: > > https://bugs.launchpad.net/sbcl/+bug/1177986 > > > > On Wed, Feb 26, 2014 at 7:08 PM, Krzysztof Drewniak > <krz...@gm... <mailto:krz...@gm...>> wrote: > > I started looking around the Unicode-related code, since I am planning > on applying to GSoC to work on Unicode improvements, and I found a bug > in digit-char-p. Specifically, there was a cond branch for non-ASCII > digits, but it was being skipped for radixes <= 10. I've attached a > patch that fixes the bug and adds test cases to catch it if it comes > back. Could someone give me some more information about code freeze, and > whether my patch is important enough to go through anyhow? > > Also, who is generally the right person to talk to about Unicode issues? > > kad > > ------------------------------------------------------------------------------ > Flow-based real-time traffic analytics software. Cisco certified tool. > Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer > Customize your own dashboards, set traffic alerts and generate reports. > Network behavioral analysis & security monitoring. All-in-one tool. > http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk > _______________________________________________ > Sbcl-devel mailing list > Sbc...@li... > <mailto:Sbc...@li...> > https://lists.sourceforge.net/lists/listinfo/sbcl-devel > > > > > -- > Tom Emerson > tr...@tr... <mailto:tr...@tr...> > http://www. <http://www.dreamersrealm.net/tree>treerex.net/ > <http://treerex.net/> |
From: Stas B. <sta...@gm...> - 2014-02-28 10:12:23
|
Krzysztof Drewniak <krz...@gm...> writes: > I started looking around the Unicode-related code, since I am planning > on applying to GSoC to work on Unicode improvements, and I found a bug > in digit-char-p. Specifically, there was a cond branch for non-ASCII > digits, but it was being skipped for radixes <= 10. I've attached a > patch that fixes the bug and adds test cases to catch it if it comes > back. Could someone give me some more information about code freeze, and > whether my patch is important enough to go through anyhow? > > Also, who is generally the right person to talk to about Unicode issues? I'm not really convinced that DIGIT-CHAR-P should work with unicode style numbers, since PARSE-INTEGER and READ then should interpret such codepoints as digits too. And all this complication and performance penalty for a feature that almost nobody will use, which may break existing code, and the code that actually depends on such a feature becomes unportable. It probably should be explicit, e.g, sb-unicode:digit-char-p or sb-unicode:alphanumericp, etc. -- With best regards, Stas. |
From: Christophe R. <cs...@ca...> - 2014-02-28 12:06:30
|
Krzysztof Drewniak <krz...@gm...> writes: > I started looking around the Unicode-related code, since I am planning > on applying to GSoC to work on Unicode improvements, and I found a bug > in digit-char-p. Specifically, there was a cond branch for non-ASCII > digits, but it was being skipped for radixes <= 10. I've attached a > patch that fixes the bug and adds test cases to catch it if it comes > back. Could someone give me some more information about code freeze, and > whether my patch is important enough to go through anyhow? I hope to release sbcl later today, at which point master gets unfrozen. (I have intermittent connectivity, so it might not get done until technically tomorrow). I think I agree with Stas, that using non-standard characters as digits and expecting them to work as numbers is pain waiting to happen; I think that the first step in improving the unicode support that we have is providing easy access to character properties and similar; I think we need this for letters too, given cases such as (cl:lower-case-p #\ß) (which is enforced because CL lower-case characters must have a distinct single-character uppercase equivalent.) I think I'd suggest just providing a direct translation of the Unicode character attributes into functions in an sb-unicode package, and then following that think about implementing Unicode collation algorithms, titlecasing, and similar. > Also, who is generally the right person to talk to about Unicode issues? Lots of people have opinions :) probably best to keep the discussion on-list. |
From: Tom E. <tre...@gm...> - 2014-02-28 15:24:41
|
On Fri, Feb 28, 2014 at 7:06 AM, Christophe Rhodes <cs...@ca...> wrote: > I think I agree with Stas, that using non-standard characters as digits > and expecting them to work as numbers is pain waiting to happen; I think [snip] As I've thought about it, I agree, but the behavior is inconsistent: it does support non-ASCII numerals when the radix > 10, which appears to be by design since it explicitly looks up the numeric value for the character: CL-USER> (digit-char-p #\U0662 10) NIL CL-USER> (digit-char-p #\U0662 16) 2 CL-USER> (parse-integer "٢9" :radix 10) ; Evaluation aborted on #<SB-INT:SIMPLE-PARSE-ERROR "junk in string ~S" {1004EC4C13}>. CL-USER> (parse-integer "٢9" :radix 16) 41 2 So IMHO digit-char-p needs to be changed either to allow any type of Unicode digit in any base, or to not allow anything other than #\0 -- #\9. -- Tom Emerson tre...@gm... http://www.dreamersrealm.net/tree |
From: Christophe R. <cs...@ca...> - 2014-02-28 16:21:41
|
Tom Emerson <tre...@gm...> writes: > On Fri, Feb 28, 2014 at 7:06 AM, Christophe Rhodes <cs...@ca...> wrote: >> I think I agree with Stas, that using non-standard characters as digits >> and expecting them to work as numbers is pain waiting to happen; I think > [snip] > > As I've thought about it, I agree, but the behavior is inconsistent: > it does support non-ASCII numerals when the radix > 10, which appears > to be by design since it explicitly looks up the numeric value for the > character: Oh, right, I didn't say explicitly: I think the current behaviour of digit-char-p (which is unchanged since the Unicode merge, or at least the great whitespace explosion) is wrong in that it shouldn't actually consider non-ascii to be digits even for radixes larger than 10. Still working on releasing later today. Cheers, Christophe |
From: Krzysztof D. <krz...@gm...> - 2014-02-28 19:24:56
Attachments:
signature.asc
|
On 02/28/2014 10:21 AM, Christophe Rhodes wrote: > Tom Emerson <tre...@gm...> writes: > >> On Fri, Feb 28, 2014 at 7:06 AM, Christophe Rhodes <cs...@ca...> wrote: >>> I think I agree with Stas, that using non-standard characters as digits >>> and expecting them to work as numbers is pain waiting to happen; I think >> [snip] >> >> As I've thought about it, I agree, but the behavior is inconsistent: >> it does support non-ASCII numerals when the radix > 10, which appears >> to be by design since it explicitly looks up the numeric value for the >> character: > > Oh, right, I didn't say explicitly: I think the current behaviour of > digit-char-p (which is unchanged since the Unicode merge, or at least > the great whitespace explosion) is wrong in that it shouldn't actually > consider non-ascii to be digits even for radixes larger than 10. > > Still working on releasing later today. > > Cheers, > > Christophe > There is (as mentioned in #1177986), the CLHS invariant that (alphanumeric x) => (or (alpha-char-p x) (digit-char-p x)). With the current definitions of alphanumericp and alpha-char-p (in terms of Unicode categories), digit-char-p has to handle Unicode digits to be standards-compliant. The other option is to restrict alphanumericp and alpha-char-p to be ASCII-only, which would, IMO, defeat the purpose of Unicode support. kad |
From: Elias M. <lo...@gm...> - 2014-03-01 13:49:35
|
On 1 March 2014 00:21, Christophe Rhodes <cs...@ca...> wrote: > Oh, right, I didn't say explicitly: I think the current behaviour of > digit-char-p (which is unchanged since the Unicode merge, or at least > the great whitespace explosion) is wrong in that it shouldn't actually > consider non-ascii to be digits even for radixes larger than 10. > I'm sorry for breaking into this discussion, and not even knowing what the whitespace explosion actually refers to. But, this led me to note that the Unicode space characters are not actually space characters in SBCL. For example, the sequence U+0031 DIGIT ONE, U+2003 EM SPACE, U+0032 DIGIT TWO is interpreted as a single symbol name comprised of three characters, as opposed to a sequence of the two digits 1 and 2. Is this correct behaviour? If you guys are discussing supporting all Unicode digits, wouldn't it make sense to support all Unicode spacing as well? Regards, Elias (Loke on #lisp, in case you don't recognise my name) |
From: Krzysztof D. <krz...@gm...> - 2014-02-28 22:20:36
|
Re: the list, it was a mistake. I've re-sent to the list. On 02/28/2014 03:57 PM, Christophe Rhodes wrote: > Krzysztof Drewniak <krz...@gm...> writes: > >> On 02/28/2014 10:21 AM, Christophe Rhodes wrote: >>> Oh, right, I didn't say explicitly: I think the current behaviour of >>> digit-char-p (which is unchanged since the Unicode merge, or at least >>> the great whitespace explosion) is wrong in that it shouldn't actually >>> consider non-ascii to be digits even for radixes larger than 10. >>> >>> Still working on releasing later today. >>> >>> Cheers, >>> >>> Christophe >>> >> There is (as mentioned in #1177986), the CLHS invariant that >> (alphanumeric x) => (or (alpha-char-p x) (digit-char-p x)). > > That's not actually a CLHS invariant; that's in the "Notes" section, > which (per CLHS 1.4.3) is not formally part of the standard. (I also > think it's not implied by the text that is formally part of the > standard). > Thanks. I was going off of something that someone said on Launchpad. >> The other option is to restrict alphanumericp and alpha-char-p to be >> ASCII-only, which would, IMO, defeat the purpose of Unicode support. > > I don't see why it would defeat the purpose of Unicode support. There's > a lot more to handling unicode than being able to use it in program > source text; supporting a useful set of operators that can work with > Unicode data is, I think, more likely to be generally useful than > deciding that programmers can represent numeric constants using > Arabic-Indic numerals. > > Is there a reason you sent this privately to me, and not directly to the > list? > True, providing a set of useful operators is more fundamental, and I plan on working on that. For example, an sb-unicode:length that handles decomposed characters correctly would be useful (and not too difficult). I sent in this patch because there was already support for Unicode digits (which no one probably used anyway), but it was broken. It's possible that supporting non-ASCII digits is a bad idea. I think that it won't really hurt to have that code in, unless the performance penalty is prohibitive (I don't really know how frequently digit-char-p is generally called, or how much an extra comparison would hurt). kad |
From: Christophe R. <cs...@ca...> - 2014-03-01 14:05:32
|
Elias Mårtenson <lo...@gm...> writes: > On 1 March 2014 00:21, Christophe Rhodes <cs...@ca...> wrote: > >> Oh, right, I didn't say explicitly: I think the current behaviour of >> digit-char-p (which is unchanged since the Unicode merge, or at least >> the great whitespace explosion) is wrong in that it shouldn't actually >> consider non-ascii to be digits even for radixes larger than 10. > > I'm sorry for breaking into this discussion, and not even knowing what the > whitespace explosion actually refers to. But, this led me to note that the > Unicode space characters are not actually space characters in SBCL. For > example, the sequence U+0031 DIGIT ONE, U+2003 EM SPACE, U+0032 DIGIT TWO > is interpreted as a single symbol name comprised of three characters, as > opposed to a sequence of the two digits 1 and 2. > > Is this correct behaviour? If you guys are discussing supporting all > Unicode digits, wouldn't it make sense to support all Unicode spacing as > well? I think that doing exotic things to support Unicode in *program text* should be a much lower priority than supporting Unicode on string and stream *data*. If after we support a set of Unicode operations on string and stream data, it then looks like there's a natural way for them to apply to program code, then we can certainly think about it -- or if there's a compelling use case for being able to use multiple different space characters in program text. Cheers, Christophe |
From: Krzysztof D. <krz...@gm...> - 2014-03-02 19:51:24
Attachments:
signature.asc
|
On 03/01/2014 08:05 AM, Christophe Rhodes wrote: > Elias Mårtenson <lo...@gm...> writes: > >> On 1 March 2014 00:21, Christophe Rhodes <cs...@ca...> wrote: >> >>> Oh, right, I didn't say explicitly: I think the current behaviour of >>> digit-char-p (which is unchanged since the Unicode merge, or at least >>> the great whitespace explosion) is wrong in that it shouldn't actually >>> consider non-ascii to be digits even for radixes larger than 10. >> >> I'm sorry for breaking into this discussion, and not even knowing what the >> whitespace explosion actually refers to. But, this led me to note that the >> Unicode space characters are not actually space characters in SBCL. For >> example, the sequence U+0031 DIGIT ONE, U+2003 EM SPACE, U+0032 DIGIT TWO >> is interpreted as a single symbol name comprised of three characters, as >> opposed to a sequence of the two digits 1 and 2. >> >> Is this correct behaviour? If you guys are discussing supporting all >> Unicode digits, wouldn't it make sense to support all Unicode spacing as >> well? > > I think that doing exotic things to support Unicode in *program text* > should be a much lower priority than supporting Unicode on string and > stream *data*. > Whatever the ultimate priorities of the project are, the limited support for Unicode in program text (specifically, digit-char-p) has a bug in it, which should be fixed. > If after we support a set of Unicode operations on string and > stream data, it then looks like there's a natural way for them to apply > to program code, then we can certainly think about it -- or if there's a > compelling use case for being able to use multiple different space > characters in program text. > I think that Unicode-in-source and Unicode-in-data are two relatively independent issues, and can be worked on separately. kad |