RE: Reading UTF-16 encoded text file with u_fgets

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

<disclaimer>
I guess I should remind you that the icuio (ustdio) library is not a 
supported library.  It's also not tested very well.  That is why the API 
is marked as draft, and the source code is contained in the extra 
directory.  You're using an unstable API that is changing in future 
releases of ICU.  The other libraries are supported and are tested much 
more thoroughly.
</disclaimer>

I'll try to explain the rest of the problem with this code.  I don't think 
the ustdio library works very well with the * specifier.  It's probably 
trying to retrieve too many arguments for writing.

I'm also not sure that you even meant to use the en_US locale.  The en_US 
locale will try to parse the comma after the number, but the en_US_POSIX 
locale won't try to parse the comma.

If you would like to debug and suggest improvements, it would be 
appreciated.  Code contributions are also appreciated.

George Rhoten
IBM Globalization Center of Competency/ICU  San José, CA, USA

dan...@as...
Sent by: icu...@os...
09/25/2003 05:34 PM

        To:     icu...@os...
        cc: 
        Subject:        RE: Reading UTF-16 encoded text file with u_fgets

Folks,

I will be reading a UTF8 encoded file since it resolves the "endian" 
issue,
nulls, etc.  u_fgets seems to work fine now with utf8.  However, using
u_sscanf now is giving a memory violation when trying to read the info 
into
variables.  Anyone have insight into u_sscanf?

Here's a portion of the u_fopen, u_fgets, and u_sscanf:

uint32_t    dwTmpNum   = 0;
UChar       tmpmsg  [MAX_MESSAGE_TEXT];
char*       FORMATSTR  = "%d%*2c%[^\n]%*c";
UChar*      pmsgline;

fpMsgFile = u_fopen (wszFilename, "r", ICUDefLocale, CDC_MSGFILE_ENC);
if (x_fgets ( fpMsgFile,
sizeof(pmsgline)/sizeof(XCHAR),
pmsgline
) == NULL)
{
...error handling...
}
dwNumCols = u_sscanf (pmsgline, ICUDefLocale, FORMATSTR, &dwTmpNum,
tmpmsg);

ICUDefLocale = "en_US" (I'm also on Windows 2000)
pmsgline points correctly to the input line returned by fgets.
It is something like:

2000,  This is a product message text file to be translated by users

The idea here is to put the number (2000) into the dwTmpNum variable and 
the
rest of the line (after the comma and spaces) into the tmpmsg buffer.

Thanks for any insight,

Dan

-----Original Message-----
From: George Rhoten [mailto:gr...@us...]
Sent: Thursday, September 25, 2003 3:24 PM
To: dan...@as...
Cc: icu...@ww...
Subject: Re: FW: Reading UTF-16 encoded text file with u_fgets

I'm not exactly sure what the problem is, but I'm going to presume that
the file is UTF-16LE (not UTF16-BE).  If that is true, then you are
running into the characters \u000d and \u000a.  Maybe this is a Windows
file that you're dealing with on a Unix machine.

The function u_fgets() doesn't work very well.  The version of u_fgets
that will be going into ICU 2.8 now follows the UAX #13 guidelines for
newline handling.  The original implementation just didn't work very well.

I recommend that you upgrade to ICU 2.8 when it comes out around December.
If you really want the fix now, look at the fix that I put into CVS
earlier this week.
http://oss.software.ibm.com/cvs/icu/icu/source/extra/ustdio/ustdio.c

George Rhoten
IBM Globalization Center of Competency/ICU  San José, CA, USA

dan...@as...
Sent by: icu...@os...
09/25/2003 12:38 PM

To:     icu...@os...
cc:
Subject:        FW: Reading UTF-16 encoded text file with u_fgets

Actually, here's some more info:

The first call to u_fgets returns the first line correctly.  It skips the
0xFFFE at the beginning of the file.  The returned buffer includes the
trailing 0x0D00 as well as null termination 0x0000.

However, the subsequent call returns only 0x0A00 and the null termination
0x0000.   WHY?

Dan

>  -----Original Message-----
> From:         Dan Morales
> Sent: Thursday, September 25, 2003 2:35 PM
> To:   'icu...@os...'
> Subject:      Reading UTF-16 encoded text file with u_fgets
>
> I am reading lines from a text file encoded as utf-16.  It seems the
first
> buffer returned by u_fgets seems OK.
>
> Here's what I see in a hex dump of the utf-16 file:
>
> 1. Beginning of file has 0xFFFE
>
> 2. Each line seems to end with:  0x0D00
>
> 3. Each line begins with: 0x0A00
>
>
> Here's my question:
>
> u_fgets seems to include the 0x0A00 at the beginning of each subsequent
> line.  What is this character?  My code seems to mess up because I do
some
> u_strcmp with the beginning of the buffer to see if the line is a
comment
> or not (looking for "//").  Can I just ignore the first UChar of the
> buffer?
>
> Please, any help on this is very welcome!
>
> Thanks,
> Dan
_______________________________________________
icu...@os... - icu4c-support mailing list
To Un/Subscribe:
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c-suppor

t
_______________________________________________
icu...@os... - icu4c-support mailing list
To Un/Subscribe:
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu4c-support

RE: Reading UTF-16 encoded text file with u_fgets

Open Source C/C++/Java libraries from Unicode

RE: Reading UTF-16 encoded text file with u_fgets