From: Kjell R. <kje...@ma...> - 2019-09-25 08:11:13
|
Den 2019-09-24 kl. 17:03, skrev Dimitry Sibiryakov: > 24.09.2019 16:27, Kjell Rilbe wrote: >> The built-in function ASCII_CHAR(n) seems to only accept integers 0..255 >> and not have any character set support whatsoever. > > ASCII (which this function has in name) define only 127 symbols. Yes, obviously. >> As a workaround, do string literals include some escape syntax to insert >> an arbitrary code point, similar to for example in C#? >> For example: >> '|\u0066|' = 'f' >> >> Or are such escape mechanisms in the plans? > > Yes. Read README.hex_literals.txt in docs. > As far as I can see, that document concerns 1) ability to specify integer values using hex notation, and 2) ability to specify an arbitrary sequence of bytes as a string of character set octets. While the latter would allow you to "manually" encode a unicode character in for example UTF-8, it's not very practical. It would be a lot more useful with an ability to specify the character codepoint inside a string literal, and have that codepoint automatically encoded into the string using that string's character set and encoding. For example, the capital letter Ö with Unicode codepoint U+00D6 would be written as for example '\u00d6' inside an UTF-8 string literal, and encoded as the sequence 0xC3 0xB6. If '\u00d6' were written in an WIN1252 string literal, it would be encoded as a single 0xD6. If it were written inside a ISO8859_7 string literal that code point doesn't exist, and should throw a (transliteration) error. The suggested <binary string literal> could be used to write characters using the literal's encoding directly. E.g. För UTF-8 literal, the character Ö could be written as '\xC3\xB6', and in an WIN1252 literal it could be written as '\xD6'. Since these kinds of escapes would be a breaking change to how string literals are parsed, a solution would have to be found to determine if a specific string literal is to be parsed with these kinds of escapes or not. A prefix than could be combined with any character set prefix? Another approach, that might suffice, would be to add a function that would take the codepoint and a character set and return that codepoint encoded in that character set. For example: UNICODE_CHAR(0xD6 as UTF8) would return a string in UTF8 character set containing bytes 0xC3 0xB6. UNICODE_CHAR(0xD6 as WIN1252) would return a string in WIN1252 character set containing byte 0xD6. UNICODE_CHAR(0xD6 as ISO8859_7) would throw a transliteration error. cast(x'C3B6' as varchar(10) character set UTF8) would return an UTF8 string 'Ö', so the suggested <binary string literal> does solve the case when you want to write the character code sequence for the specific character set that you're using. But it doesn't help if you want to specify the Unicode codepoint. Regards, Kjell |