From: George R. <gr...@us...> - 2003-09-26 18:45:07
|
<disclaimer> I guess I should remind you that the icuio (ustdio) library is not a supported library. It's also not tested very well. That is why the API is marked as draft, and the source code is contained in the extra directory. You're using an unstable API that is changing in future releases of ICU. The other libraries are supported and are tested much more thoroughly. </disclaimer> I'll try to explain the rest of the problem with this code. I don't think the ustdio library works very well with the * specifier. It's probably trying to retrieve too many arguments for writing. I'm also not sure that you even meant to use the en_US locale. The en_US locale will try to parse the comma after the number, but the en_US_POSIX locale won't try to parse the comma. If you would like to debug and suggest improvements, it would be appreciated. Code contributions are also appreciated. George Rhoten IBM Globalization Center of Competency/ICU San José, CA, USA dan...@as... Sent by: icu...@os... 09/25/2003 05:34 PM To: icu...@os... cc: Subject: RE: Reading UTF-16 encoded text file with u_fgets Folks, I will be reading a UTF8 encoded file since it resolves the "endian" issue, nulls, etc. u_fgets seems to work fine now with utf8. However, using u_sscanf now is giving a memory violation when trying to read the info into variables. Anyone have insight into u_sscanf? Here's a portion of the u_fopen, u_fgets, and u_sscanf: uint32_t dwTmpNum = 0; UChar tmpmsg [MAX_MESSAGE_TEXT]; char* FORMATSTR = "%d%*2c%[^\n]%*c"; UChar* pmsgline; fpMsgFile = u_fopen (wszFilename, "r", ICUDefLocale, CDC_MSGFILE_ENC); if (x_fgets ( fpMsgFile, sizeof(pmsgline)/sizeof(XCHAR), pmsgline ) == NULL) { ...error handling... } dwNumCols = u_sscanf (pmsgline, ICUDefLocale, FORMATSTR, &dwTmpNum, tmpmsg); ICUDefLocale = "en_US" (I'm also on Windows 2000) pmsgline points correctly to the input line returned by fgets. It is something like: 2000, This is a product message text file to be translated by users The idea here is to put the number (2000) into the dwTmpNum variable and the rest of the line (after the comma and spaces) into the tmpmsg buffer. Thanks for any insight, Dan -----Original Message----- From: George Rhoten [mailto:gr...@us...] Sent: Thursday, September 25, 2003 3:24 PM To: dan...@as... Cc: icu...@ww... Subject: Re: FW: Reading UTF-16 encoded text file with u_fgets I'm not exactly sure what the problem is, but I'm going to presume that the file is UTF-16LE (not UTF16-BE). If that is true, then you are running into the characters \u000d and \u000a. Maybe this is a Windows file that you're dealing with on a Unix machine. The function u_fgets() doesn't work very well. The version of u_fgets that will be going into ICU 2.8 now follows the UAX #13 guidelines for newline handling. The original implementation just didn't work very well. I recommend that you upgrade to ICU 2.8 when it comes out around December. If you really want the fix now, look at the fix that I put into CVS earlier this week. http://oss.software.ibm.com/cvs/icu/icu/source/extra/ustdio/ustdio.c George Rhoten IBM Globalization Center of Competency/ICU San José, CA, USA dan...@as... Sent by: icu...@os... 09/25/2003 12:38 PM To: icu...@os... cc: Subject: FW: Reading UTF-16 encoded text file with u_fgets Actually, here's some more info: The first call to u_fgets returns the first line correctly. It skips the 0xFFFE at the beginning of the file. The returned buffer includes the trailing 0x0D00 as well as null termination 0x0000. However, the subsequent call returns only 0x0A00 and the null termination 0x0000. WHY? Dan > -----Original Message----- > From: Dan Morales > Sent: Thursday, September 25, 2003 2:35 PM > To: 'icu...@os...' > Subject: Reading UTF-16 encoded text file with u_fgets > > I am reading lines from a text file encoded as utf-16. It seems the first > buffer returned by u_fgets seems OK. > > Here's what I see in a hex dump of the utf-16 file: > > 1. Beginning of file has 0xFFFE > > 2. Each line seems to end with: 0x0D00 > > 3. Each line begins with: 0x0A00 > > > Here's my question: > > u_fgets seems to include the 0x0A00 at the beginning of each subsequent > line. What is this character? My code seems to mess up because I do some > u_strcmp with the beginning of the buffer to see if the line is a comment > or not (looking for "//"). Can I just ignore the first UChar of the > buffer? > > Please, any help on this is very welcome! > > Thanks, > Dan _______________________________________________ icu...@os... - icu4c-support mailing list To Un/Subscribe: http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c-suppor t _______________________________________________ icu...@os... - icu4c-support mailing list To Un/Subscribe: http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c-support |