From: Zoran V. <zv...@ar...> - 2006-08-20 13:12:41
|
On 15.08.2006, at 21:09, Stephen Deasey wrote: > > In the documents you serve, do you specify the encoding *within* the > document, at th etop of the HTML file for example? Or are you serving > XML in which case the default for that is utf-8 anyway (I think, off > the top of my head...). We put it in headers, as extension of the content-type: text/html; charset=utf-8 > > I agree with Zoran. ns_conn encoding should be the way to change the > encoding (input or output) at runtime. > > The mime-type header sent back to the client does need to reflect the > encoding used, but ns_conn encoding should drive that, not the other > way around. Correct! At the moment, we do not mingle with ns_conn encoding stuff, rather we set the mime-type accordingly. > > We can check the mime-type header for a charset declaration, and if > it's not there, add one for the current value of ns_conn encoding. Right. > > One problem to be resolved here is that Tcl encoding names do not > match up with HTTP charset names. HTTP talks about iso-8859-1, while > Tcl talks about iso8859-1. There are lookup routines to convert HTTP > charset names to Tcl encoding names, but not the other way around. > Tcl_GetEncodingName() returns the Tcl name for an encoding, not the > charset alias we used to get the encoding. This is just-another-nuisance we'll have to swallow. We can write our own name-conversion between the known Tcl and HTTP encoding names. You say: There are lookup routines to convert HTT charset names to Tcl encoding names Who wrote them? I cannot imagine other way arround would be the problem. > > We could store the charset, as well as the encoding, for the conn. But > I was wondering: could we junk all the alias stuff and, in the > Naviserver install process, create a directory for encodings files and > fill it with symlinks to the real Tcl encoding files, unsing the > charset name? > > You call ns_conn encoding with a charset. Naviserver converts the > charset name to a Tcl encoding name. The return value is the name of > the encoding, which is *not* the name of the charset you passed in! I > don't know if that's intended, but it's really confusing. Uh.. I would not rely on that symlinking, if possible (think of Windows). I believe the better way would be a HTTP->Tcl and Tcl->HTTP encoding- name lookup function. > > Another place this trips up: In the config for the tests Michael > added: > > ns_section "ns/mimetypes" > ns_param .utf2utf_adp "text/plain; charset=utf-8" > ns_param .iso2iso_adp "text/plain; charset=iso-8859-1" > > ns_section "ns/encodings" > ns_param .utf2utf_adp "utf-8" > ns_param .iso2iso_adp "iso-8859-1" > > The ns/encodings are the encoding to use to read an ADP file from > disk, accoring to extension. It solves the problem: the web designers > editor doesn't support utf-8. (I wonder if this is still valid any > more?) > > But, the code is actually expecting Tcl encoding names here, not a > charset, so this config is busted. It doesn't show up in the tests > because the only alternative encoding we're using is iso-8859-1, which > also happens to be the default. > > This is probably just a bug. The code uses Ns_GetEncoding() when it > should use Ns_GetCharsetEncoding(). But that highlights another bug: > when would you ever want to call Ns_GetEncoding()? You always want to > take into account the charset aliases we carefully set up. This > probably shouldn't be a public function. > > > The strategy of driving the encoding from the mime-type has some other > problems. You have to create a whole bunch of fake mime-types / > extension mappings just to support multiple encodings (the > ns/mimetypes above). > > What if there is no extension? Or you want to keep the .adp (or > whatever) extension, but serve content in different encodings from > different parts of the URL tree? Currently you have to put code in > each ADP to set the mime-type (which is always the same) explicitly, > to set the charset as a side effect. > > AOLserver 4.5 has a ns_register_encoding command, which is perhaps an > improvement on plain file extensions. > > Both AOLserver and our current code base have the bug where the > character set is only set for mime-types of type text/*. This makes a > certain ammount of sense -- you don't want to be text encoding a > dynamicaly generated gif, for example. > > However, the correct mime-type for XHTML is application/xml+html. So > in this case, an ADP which generates XHTML will not have the correct > encoding applied if you're relying on the mime-type. > > > Here's another odity: Ns_ConSetWriteEncodedFlag() / ns_conn > write_encoded. This is yet another way to sort-of-set the encoding, > which is essentially a single property. > > The only code wich uses this is the ns_write command, and I think it > has it backwards. By default, if Ns_ConnGetWriteEncodedFlag() returns > false, ns_write assumes it's writing binary data. But nothing actually > sets this flag, so ns_write doesn't encode text at all. > > We should remove the WRITE_ENCODED stuff. > > How do we handle binary data from Tcl anyway? There's a -binary switch > to ns_return, and the write_encoded flag for ns_write. I was > wondering if we could just check the type of the Tcl object passed in > to any of the ns_return-like functions to see if it's type > "bytearray". A byte array *could* get shimmered to a string, and then > back again without data losss, but that's probably unlikely in > practice. > > > There's also the problem of input encodings. If you're supporting > multiple encodings, how do you know what encoding the query data is > in? A couple of solutions suggested in Rob Mayoff's guide are to put > this in a hidden form field, or to put it in a cookie. > > Here's an interesting bug: You need to get the character set a form > was encoded in, so you call ns_queryget ... This first invokes the > legacy ns_getform call which, among other things, pulls an file upload > data out of the content stream and puts it into temp files. > > Now, you have to assume *some* encoding to get at the query data in > the first place. So let's guess and say utf-8. Uh oh, our hidden form > field says iso-8859-2. OK, so we call ns_conn encoding iso-8859-2 to > reset the encoding, and this call flushes the query data which was > previously decoded using utf-8. > It also flushes our uploaded files. The kicker here is uploaded files > aren't even decoded using a text encoding, so when the query data is > again decoded, this time using iso-8859-2, the the uploaded files will > be exactly the same as they were before. > > > > I'm sure there's some more stuff I'm forgetting. Anyway, here's how I > think it should be: > > * utf-8 by default Yes. > * mime-types are just mime-types Yes. > * always hack the mime-type for text data to add the charset Yes (shouldn't this automatically be done by the server?) > * text is anything sent via Ns_ConnReturnCharData() Yes. > * binary is a Tcl bytearray object > * static files are served as-is, text or binary Yes, as-is, i.e. no conversion taking place. > * multiple encodings are handled via calling ns_conn encoding > * folks need to do this manually. no more file extension magic > Yes. > I think a nice way for folks to handle multiple encodings is to > register a filter, which you can of course use to simulate the file > extension scheme in place now, the AOLserver 4.5 ns_register_encoding > stuff, and more, because it's a filter. You can also do things like > check query data or cookies for the charset to use. > > > Questions that need answered: > > * can we junk charset aliases in nsd/encodings.c and use a dir of > symlinks? No. I hate being dependent on the distribution directory. > * can we junk ns/encodings in 2006? I'm afraid it will not be that fast! > * is checking for bytearray type a good way to handle binary Tcl > objects? Hm... how would you (generally) check for a byte-array? You can get a byte-array from object but you can't (without looking into the object type which is really not something that is portable) say which type of object you are dealing with. I believe the main source of problem here is somebody slurping the whole file and wanting to return that file as-is i.e. w/o any translations. In that case, he/she cound use [ns_conn encoding] to set the encoding of the connection before calling ns_return. This way we can strip away all the extra options from the content-returning commands and request the writer to use ns_conn to set the correct encoding OR to skip encoding altogether (for example: ns_conn encoding binary). Wouldn't that make sense? > * does the above scheme handle all the requirements? > No idea! I'd say, as our requirements are pretty small, it definitely does. But Bernd & co. have more to say as they do obviously need much more then we do. Cheers Zoran |