I have been looking around on the Internet trying to find a way to convert Unicode characters to ASCII characters.
I am able to convert ASCII to Unicode, but I'm not able to convert the other way.
How is that done?
The reason I need this is that I am trying to get a list of SAPI voices and I want to store the plain ASCII strings of the voice's in a standard character string.
Please help,
Peter.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have checked out the size difference. I know that an Unicode character corresponds to an unsigned short and I know that an ASCII character is a char.
But I still am unable to convert the Unicode string to its ASCII equal. Am I missing something that's right in frontof my nose?
Thanks in advancce,
Peter.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2006-02-14
Searching on ANSI/Unicode rather than ASCII/Unicode may get you more useful information.
I imagine the comment about size was related to the size of the set rather than the size of the data type.
The ANSI character set (of which ASCII is a 7bit subset), supports 256 characters codes, whereas Unicode supports 65536 codes. The point being, that being a superset, an accurate 'lossless' conversion might not be possible.
However the ASCII codes 0x00 to 0x7f have the same value in Unicode, so apart from discarding the most significant byte (by casting), no conversion is necessary. I am not certain, but I presume the full ANSI set 0x00-0xff is a subset of Unicode. Any non ANSI codes in the Unicode, will be converted to the wrong character - the right character may not even exist in the ANSI set.
unicode supports uint32_t codes actually. only the common "characters" are representable below 64k. the reason utf16 is used over utf32 is a matter of utf16/usc2 being the character set. a more robust standard.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Technically, Unicode is just a series of numbered characters with no particular size or representation. Each system on which Unicode is implemented decides how many bits they want to dedicate to each character. Windows wchar_t is 16 bits and IIRC on Linux wchar_t is 32 bits. Java Unicode is 32 bit.
>the reason utf16 is used over utf32 is a matter of utf16/usc2 being the character set. a more robust standard.
Don't you mean UCS2 and UCS4? The selection of UTF-7/8/16/32 depends on what your system stores natively and how efficiently you want the higher characters to be stored. They all can store 32 bit characters.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
true. what i meant was the parts of unicode require larger than uint16_t or unsigned short to represent all characters in a single type. not that unicode specifically requires any length.
"Don't you mean UCS2 and UCS4?"
no. i didnt. or yes i did. depends on where you are asking about which part.
"The selection of UTF-7/8/16/32 depends on what your system stores natively and how efficiently you want the higher characters to be stored."
yea which is actually what i meant.
it should have been (granted my fault for not checking)
the reason utf16 is used over utf32 is a matter of utf16/usc2 being the smaller character set. and utf16the more robust standard.
that may be subject to optinion but utf16 can represent all unicode values but usc2 cant.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you very much for those answers. They've helped me solve the problem at hand. I managed to get my GetVoiceList function to work and now have a comma-separated string of SAPI 5.1 voices installed on my own computer.
Thanks very much for all the helpful answers,
Peter.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi everyone,
I have been looking around on the Internet trying to find a way to convert Unicode characters to ASCII characters.
I am able to convert ASCII to Unicode, but I'm not able to convert the other way.
How is that done?
The reason I need this is that I am trying to get a list of SAPI voices and I want to store the plain ASCII strings of the voice's in a standard character string.
Please help,
Peter.
you might check out the size difference between them both and then think about your question again...
Hi,
I have checked out the size difference. I know that an Unicode character corresponds to an unsigned short and I know that an ASCII character is a char.
But I still am unable to convert the Unicode string to its ASCII equal. Am I missing something that's right in frontof my nose?
Thanks in advancce,
Peter.
If you're UNICODE contains only 7 bit ASCII then cast or take every other char. If you have real multi language text that needs to be converted:
WideCharToMultiByte
wctombs
Or you can encode it yourself. It's not hard:
http://en.wikipedia.org/wiki/UTF-8
Searching on ANSI/Unicode rather than ASCII/Unicode may get you more useful information.
I imagine the comment about size was related to the size of the set rather than the size of the data type.
The ANSI character set (of which ASCII is a 7bit subset), supports 256 characters codes, whereas Unicode supports 65536 codes. The point being, that being a superset, an accurate 'lossless' conversion might not be possible.
However the ASCII codes 0x00 to 0x7f have the same value in Unicode, so apart from discarding the most significant byte (by casting), no conversion is necessary. I am not certain, but I presume the full ANSI set 0x00-0xff is a subset of Unicode. Any non ANSI codes in the Unicode, will be converted to the wrong character - the right character may not even exist in the ANSI set.
I have found a product that does claim to do what you ask, and it is available for free trial: http://www.datamystic.com/textpipe/unicode_ansi.html
Searching on ANSI rather than ASCII may get you more useful information.
Clifford
unicode supports uint32_t codes actually. only the common "characters" are representable below 64k. the reason utf16 is used over utf32 is a matter of utf16/usc2 being the character set. a more robust standard.
WinXP's Notepad can do simple conversion. Check out the SaveAs features.
Technically, Unicode is just a series of numbered characters with no particular size or representation. Each system on which Unicode is implemented decides how many bits they want to dedicate to each character. Windows wchar_t is 16 bits and IIRC on Linux wchar_t is 32 bits. Java Unicode is 32 bit.
>the reason utf16 is used over utf32 is a matter of utf16/usc2 being the character set. a more robust standard.
Don't you mean UCS2 and UCS4? The selection of UTF-7/8/16/32 depends on what your system stores natively and how efficiently you want the higher characters to be stored. They all can store 32 bit characters.
true. what i meant was the parts of unicode require larger than uint16_t or unsigned short to represent all characters in a single type. not that unicode specifically requires any length.
"Don't you mean UCS2 and UCS4?"
no. i didnt. or yes i did. depends on where you are asking about which part.
"The selection of UTF-7/8/16/32 depends on what your system stores natively and how efficiently you want the higher characters to be stored."
yea which is actually what i meant.
it should have been (granted my fault for not checking)
the reason utf16 is used over utf32 is a matter of utf16/usc2 being the smaller character set. and utf16 the more robust standard.
that may be subject to optinion but utf16 can represent all unicode values but usc2 cant.
Hi everyone,
Thank you very much for those answers. They've helped me solve the problem at hand. I managed to get my GetVoiceList function to work and now have a comma-separated string of SAPI 5.1 voices installed on my own computer.
Thanks very much for all the helpful answers,
Peter.