Re: non-ascii handling

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Jens Vagelpohl wrote:
> in handling these various kinds of strings (both UTF-8 encoded unicode 
> and latin-1 encoded unicode for web browser consumption) i always end up 
> running into trouble at some point because in some situations strings 
> get encoded more than once.

Especially when implementing web applications you have to take great care to 
define the charset used. Use <form accept-charset="utf-8" ..> or similar to 
also define the charset of the form input data. (This does not prevent e.g. 
StarOffice from sending ISO-8859-1 data.)

Also set the charset of the output in the HTTP header *and* the <head> section.

> does anyone know of a quick and fast test to 
> determine whether a string is already encoded in a certain encoding? my 
> knowledge of regular expressions (which i assume it would take for that) 
> is extremely limited at best.

Hmm, in some situations a try: unicode() except UnicodeError: might help. 
But I do not recommend such a solution (although sometimes used in web2ldap) 
and you have to apply specific knowledge about the character sets/encodings.

 From my personal experience it's much work to make an application 
Unicode-aware. But you should try to clean up your application design. It's 
worth the effort.

Ciao, Michael.