Re: [gobo-eiffel-develop] UC_STRING is XML parser output

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Franck Arnaud wrote:
> 
> Once Eric's UC_STRING (that inherits from STRING) is in, it will
> be trivially possible to have UC_STRINGs only for strings
> that actually contain >127 characters.

I finally committed the new implementation of UC_STRING in CVS.
I ran the bootstrap and the test cases in debug mode (i.e. with
all assertions on) with no problem. It was under Windows NT with:

  ISE 5.1.14
  HACT 4.0.1
  VE 4.0 (build 4001)
  SE -0.74b21

and:

  MSVC 6.0

I took this opportunity to add to the global test procedure
(i.e. when one executes 'geant test_*' in $GOBO/test/) the
test cases for XML library using Oasis, and also the compilation
of the XML examples. I had some problems with the test cases
for the XML library using Oasis, but I got the same problems
with the old implementation of UC_STRING, so I will report
that to this mailing list in another message since it is not
related to the new UC_STRING.

As already mentioned many times in this mailing list, having
UC_STRING inherit from STRING is not a great design, to say
the least. What we want is to have a common class interface
for our XML library when using STRING and UC_STRING, and hence
avoiding duplicated code. A good design would probably be
to have a common ancestor for STRING and UC_STRING sharing
the common interface, but this class does not exist in ELKS.
So having UC_STRING inherit from STRING is a workaround.

If you were using the old implementation of UC_STRING, the
major changes are that UC_CHARACTER is not expanded anymore
(so make sure to explicitly create it now), and UC_STRING
is now deferred. It currently has one concrete descendant,
UC_UTF8_STRING, but implementations based on 16 ad 32 bits
will follow. Unless you explicitly want to create a UTF-8
string, it is recommended to continue declaring unicode
strings with UC_STRING and use the factory routine
UC_UNICODE_FACTORY.new_unicode_string to create them
(other factory routines with different signatures could
be added in the future). This will make it easier for a
project to switch between one unicode encoding and another
by just modifying one routine (or a small set of routines).

The routines in the new UC_STRING try to follow those
in ELSK 2001 STRING. I also added routines from the
old UC_STRING and marked them as obsolete to make the
transition smoother. If there is a routine in the old
UC_STRING that you used to call and is not available
anymore in the new UC_STRING, just let me know and
I'll try to add it as obsolete in the new UC_STRING.

Now, for those who will want to take advantage of the
fact that UC_STRING inherits from STRING, please note
that the only routines that are garanted to be portable
and polymorphically available in STRING and UC_STRING
are those which are listed in KS_STRING (in $GOBO/library/
kernel/elks/). So when writing a routine accepting
STRINGs but where UC_STRINGs are expected, please use
only these routines. Note that routines in STRING
which assumed that the arguments were about characters
with code less than `Maximum_character_code' have been
renamed with 'latin1' in their names in class UC_STRING,
even though it is  clear that there is no garanty that
characters with code between 128 and 255 are encoded
using Latin-1. When STRING.item is called polymorphically
on a UC_STRING and the character has a code greater 
than 255, then '%U' is returned. Therefore it is
recommended to handle character codes (i.e. INTEGER)
instead if CHARACTER (or even UC_CHARACTER, because
UC_CHARACTER is not expanded and a new object is created
each time, which is time and memory consumming). For
that there is STRING.item_code, and many other routines
in UC_STRING with names containing 'code' (e.g.
`append_code'). I started to do that in the Regexp
library to make it Unicode aware, and it seems
to work quite well, both in terms of correctness
but also in terms of performance and memory usage
(when compiled with SE with no GC).

As already discussed with Franck, some of the routines
of STRING (even though listed in KS_STRING) will cause
problems (probably a run-time crash) when the target
is dynamically attached to a STRING and the argument
is dynamically attached to a UC_STRING. This is because
the implementation of STRING provided by the Eiffel
vendors is not aware of the unicode encoding in
UC_STRING. To work around this problem helper routines
will be provided when possible, such as a `concat'
routine instead of calling `append_string' as already
explained during the discussion on this topic with
Franck. These routines are not available yet.

PS: A test case for the new Unicode classes is
available in UC_TEST_UTF8_STRING in $GOBO/test/kernel/.
As a reminder, there is also a test case for testing
the routines listed in KS_STRING with the class STRING
provided by all Eiffel compilers. This test case is
in KS_TEST_STRING.

PPS: Because of a bug in VE 4.0, the new UC_STRING
does not work when compiling with the inlining
optimization. This bug does not allow polymorphic
calls of `put' and `item' in class STRING. So while
waiting for this bug to be fixed it is recommended
not to use VE's ESD inlining optimization option.

-- 
Eric Bezault
mailto:er...@go...
http://www.gobosoft.com

______________________________________________________________________________
ifrance.com, l'email gratuit le plus complet de l'Internet !
vos emails depuis un navigateur, en POP3, sur Minitel, sur le WAP...
http://www.ifrance.com/_reloc/email.emailif