From: Stephen D. <sd...@gm...> - 2006-08-15 19:09:17
|
On 8/14/06, Zoran Vasiljevic <zv...@ar...> wrote: > > On 14.08.2006, at 22:43, Stephen Deasey wrote: > > > > > * Your clients don't understand utf-8, and you serve content in > > multiple languages which don't share a common character set. Sucks to > > be you. > > > > I think the whole purpose of that encooding mess is this above. > With time this will be less and less important, so how much > should we really care? > From the technical perspective, it is nice to have universal > and general sulution, but from the practical side: it costs > time and money to keep it arround... I agree. I was wondering if we should junk the whole mess, but I think we can minimise the impact without loosing the ability to support multiple encodings, and in fact improve the support. > > > > I've been working on some patches to fix the various bugs so don't > > worry about it too much. But I'd appreciate feedback on how you > > actually use the encoding support. > > I use it this way: leave everything as-is. I never had to > tweak any of the existing encoding (encoding) knobs. And I never had > anybody complaining. And we do serve japanese, chinese and > european languages. Allright, the client is always either IE > or Mozilla or Safari (prerequisite) so mine is perhaps not a > good example. In the documents you serve, do you specify the encoding *within* the document, at th etop of the HTML file for example? Or are you serving XML in which case the default for that is utf-8 anyway (I think, off the top of my head...). Another possibility is that you happen to be using browsers which are smart enough to reparse a document if it doesn't happen to be in the encoding it first expected. I think the big guys do this -- not sure your mobile phone will be so forgiving. > Apropos chunked encoding: I still believe that the vectorized > IO is OK and the way you transform UTF8 on the fly is also OK. > So, if any content encoding has to take place, you can really > only do it with the chunked encoding OR by converting the whole > content in memory prior sending it, and giving the correct content > length OR by just omitting the content length all together. > I do not think there are other options. > > I'm curious what you will come up with ;-) I'll handle the IO stuff in a separate post. Here's something I wrote up re encodings and such: (This applies to case 3: supporting multiple encodings) I agree with Zoran. ns_conn encoding should be the way to change the encoding (input or output) at runtime. The mime-type header sent back to the client does need to reflect the encoding used, but ns_conn encoding should drive that, not the other way around. We can check the mime-type header for a charset declaration, and if it's not there, add one for the current value of ns_conn encoding. One problem to be resolved here is that Tcl encoding names do not match up with HTTP charset names. HTTP talks about iso-8859-1, while Tcl talks about iso8859-1. There are lookup routines to convert HTTP charset names to Tcl encoding names, but not the other way around. Tcl_GetEncodingName() returns the Tcl name for an encoding, not the charset alias we used to get the encoding. We could store the charset, as well as the encoding, for the conn. But I was wondering: could we junk all the alias stuff and, in the Naviserver install process, create a directory for encodings files and fill it with symlinks to the real Tcl encoding files, unsing the charset name? You call ns_conn encoding with a charset. Naviserver converts the charset name to a Tcl encoding name. The return value is the name of the encoding, which is *not* the name of the charset you passed in! I don't know if that's intended, but it's really confusing. Another place this trips up: In the config for the tests Michael added: ns_section "ns/mimetypes" ns_param .utf2utf_adp "text/plain; charset=utf-8" ns_param .iso2iso_adp "text/plain; charset=iso-8859-1" ns_section "ns/encodings" ns_param .utf2utf_adp "utf-8" ns_param .iso2iso_adp "iso-8859-1" The ns/encodings are the encoding to use to read an ADP file from disk, accoring to extension. It solves the problem: the web designers editor doesn't support utf-8. (I wonder if this is still valid any more?) But, the code is actually expecting Tcl encoding names here, not a charset, so this config is busted. It doesn't show up in the tests because the only alternative encoding we're using is iso-8859-1, which also happens to be the default. This is probably just a bug. The code uses Ns_GetEncoding() when it should use Ns_GetCharsetEncoding(). But that highlights another bug: when would you ever want to call Ns_GetEncoding()? You always want to take into account the charset aliases we carefully set up. This probably shouldn't be a public function. The strategy of driving the encoding from the mime-type has some other problems. You have to create a whole bunch of fake mime-types / extension mappings just to support multiple encodings (the ns/mimetypes above). What if there is no extension? Or you want to keep the .adp (or whatever) extension, but serve content in different encodings from different parts of the URL tree? Currently you have to put code in each ADP to set the mime-type (which is always the same) explicitly, to set the charset as a side effect. AOLserver 4.5 has a ns_register_encoding command, which is perhaps an improvement on plain file extensions. Both AOLserver and our current code base have the bug where the character set is only set for mime-types of type text/*. This makes a certain ammount of sense -- you don't want to be text encoding a dynamicaly generated gif, for example. However, the correct mime-type for XHTML is application/xml+html. So in this case, an ADP which generates XHTML will not have the correct encoding applied if you're relying on the mime-type. Here's another odity: Ns_ConSetWriteEncodedFlag() / ns_conn write_encoded. This is yet another way to sort-of-set the encoding, which is essentially a single property. The only code wich uses this is the ns_write command, and I think it has it backwards. By default, if Ns_ConnGetWriteEncodedFlag() returns false, ns_write assumes it's writing binary data. But nothing actually sets this flag, so ns_write doesn't encode text at all. We should remove the WRITE_ENCODED stuff. How do we handle binary data from Tcl anyway? There's a -binary switch to ns_return, and the write_encoded flag for ns_write. I was wondering if we could just check the type of the Tcl object passed in to any of the ns_return-like functions to see if it's type "bytearray". A byte array *could* get shimmered to a string, and then back again without data losss, but that's probably unlikely in practice. There's also the problem of input encodings. If you're supporting multiple encodings, how do you know what encoding the query data is in? A couple of solutions suggested in Rob Mayoff's guide are to put this in a hidden form field, or to put it in a cookie. Here's an interesting bug: You need to get the character set a form was encoded in, so you call ns_queryget ... This first invokes the legacy ns_getform call which, among other things, pulls an file upload data out of the content stream and puts it into temp files. Now, you have to assume *some* encoding to get at the query data in the first place. So let's guess and say utf-8. Uh oh, our hidden form field says iso-8859-2. OK, so we call ns_conn encoding iso-8859-2 to reset the encoding, and this call flushes the query data which was previously decoded using utf-8. It also flushes our uploaded files. The kicker here is uploaded files aren't even decoded using a text encoding, so when the query data is again decoded, this time using iso-8859-2, the the uploaded files will be exactly the same as they were before. I'm sure there's some more stuff I'm forgetting. Anyway, here's how I think it should be: * utf-8 by default * mime-types are just mime-types * always hack the mime-type for text data to add the charset * text is anything sent via Ns_ConnReturnCharData() * binary is a Tcl bytearray object * static files are served as-is, text or binary * multiple encodings are handled via calling ns_conn encoding * folks need to do this manually. no more file extension magic I think a nice way for folks to handle multiple encodings is to register a filter, which you can of course use to simulate the file extension scheme in place now, the AOLserver 4.5 ns_register_encoding stuff, and more, because it's a filter. You can also do things like check query data or cookies for the charset to use. Questions that need answered: * can we junk charset aliases in nsd/encodings.c and use a dir of symlinks? * can we junk ns/encodings in 2006? * is checking for bytearray type a good way to handle binary Tcl objects? * does the above scheme handle all the requirements? Bugs to fix: * query data flushing is too extreme. don't flush files * junk Ns_Conn*WriteEncodedFlag() / ns_conn write_encoded There's also the content-length bug, but I think that's a separate problem. I'm going to look more into that next as I wrote it, so if anyone else want to tackle any of the above because they need it soon, go ahead. If not, I'll do that next. |
From: Bernd E. <eid...@we...> - 2006-08-21 09:26:43
|
Hi Stephen, > In the documents you serve, do you specify the encoding *within* the > document, at th etop of the HTML file for example? Or are you serving > XML in which case the default for that is utf-8 anyway (I think, off > the top of my head...). Usually we specify it both ways, in the meta http-equiv part of the HTML header and the Content-Type header of the HTTP response. > Another possibility is that you happen to be using browsers which are > smart enough to reparse a document if it doesn't happen to be in the > encoding it first expected. I think the big guys do this -- not sure > your mobile phone will be so forgiving. I'd say we are perfectly happy with just setting up the config file via the ns/encodings + ns/mimetypes sections and let the server handle the rest. The less knobs the better. We know (or can control) the encoding of files on disk, we set up the encoding of the database - and then we simply want to return the specified encoding. We have different sites running with iso-8859-1, -15 and utf-8. Usually we have no need to do runtime changes, but if so, I would like to see ns_conn to do the expected thing. Only relying on (aka. being forced to use) UTF-8 would not be optimal as a potential naviserver user might want to use another specified encoding or avoid a UTF/unicode database setup for whatever reason, e.g. performance, storage or to avoid collation issues (sorting orders). For us using only web and http moving with every installation to UTF-8 is nevertheless the way to go. > (This applies to case 3: supporting multiple encodings) > > > I agree with Zoran. ns_conn encoding should be the way to change the > encoding (input or output) at runtime. yes. > Another place this trips up: In the config for the tests Michael added: > > ns_section "ns/mimetypes" > ns_param .utf2utf_adp "text/plain; charset=utf-8" > ns_param .iso2iso_adp "text/plain; charset=iso-8859-1" > > ns_section "ns/encodings" > ns_param .utf2utf_adp "utf-8" > ns_param .iso2iso_adp "iso-8859-1" > > The ns/encodings are the encoding to use to read an ADP file from > disk, accoring to extension. It solves the problem: the web designers > editor doesn't support utf-8. If you focus here only on web designers and adp files. It could be every other kind of usage as well (file exports etc.). > But, the code is actually expecting Tcl encoding names here, not a > charset, so this config is busted. It doesn't show up in the tests > because the only alternative encoding we're using is iso-8859-1, which > also happens to be the default. this is correct, an annoying thing to be aware of. > The strategy of driving the encoding from the mime-type has some other > problems. You have to create a whole bunch of fake mime-types / > extension mappings just to support multiple encodings (the > ns/mimetypes above). > > What if there is no extension? Or you want to keep the .adp (or > whatever) extension, but serve content in different encodings from > different parts of the URL tree? Currently you have to put code in > each ADP to set the mime-type (which is always the same) explicitly, > to set the charset as a side effect. this is true. it does not affect our apps, as we commit to one encoding and then cache the HTML output to files on disk, but it is not nice if you have the need to change it. > * utf-8 by default > * mime-types are just mime-types > * always hack the mime-type for text data to add the charset > * text is anything sent via Ns_ConnReturnCharData() > * binary is a Tcl bytearray object > * static files are served as-is, text or binary > * multiple encodings are handled via calling ns_conn encoding > * folks need to do this manually. no more file extension magic > I think a nice way for folks to handle multiple encodings is to > register a filter, which you can of course use to simulate the file > extension scheme in place now, the AOLserver 4.5 ns_register_encoding > stuff, and more, because it's a filter. You can also do things like > check query data or cookies for the charset to use. As our app has one main filter that handles the file dispatching we simply would place it there. But we should find a solution that is both flexible and compatible in respect of the "file extension magic", if possible! > Questions that need answered: > > * can we junk charset aliases in nsd/encodings.c and use a dir of symlinks? i would vote for non filesystem based lookup function. > * can we junk ns/encodings in 2006? i would not recommend it as the server loses purposes. Bernd. |
From: Zoran V. <zv...@ar...> - 2006-08-20 13:12:41
|
On 15.08.2006, at 21:09, Stephen Deasey wrote: > > In the documents you serve, do you specify the encoding *within* the > document, at th etop of the HTML file for example? Or are you serving > XML in which case the default for that is utf-8 anyway (I think, off > the top of my head...). We put it in headers, as extension of the content-type: text/html; charset=utf-8 > > I agree with Zoran. ns_conn encoding should be the way to change the > encoding (input or output) at runtime. > > The mime-type header sent back to the client does need to reflect the > encoding used, but ns_conn encoding should drive that, not the other > way around. Correct! At the moment, we do not mingle with ns_conn encoding stuff, rather we set the mime-type accordingly. > > We can check the mime-type header for a charset declaration, and if > it's not there, add one for the current value of ns_conn encoding. Right. > > One problem to be resolved here is that Tcl encoding names do not > match up with HTTP charset names. HTTP talks about iso-8859-1, while > Tcl talks about iso8859-1. There are lookup routines to convert HTTP > charset names to Tcl encoding names, but not the other way around. > Tcl_GetEncodingName() returns the Tcl name for an encoding, not the > charset alias we used to get the encoding. This is just-another-nuisance we'll have to swallow. We can write our own name-conversion between the known Tcl and HTTP encoding names. You say: There are lookup routines to convert HTT charset names to Tcl encoding names Who wrote them? I cannot imagine other way arround would be the problem. > > We could store the charset, as well as the encoding, for the conn. But > I was wondering: could we junk all the alias stuff and, in the > Naviserver install process, create a directory for encodings files and > fill it with symlinks to the real Tcl encoding files, unsing the > charset name? > > You call ns_conn encoding with a charset. Naviserver converts the > charset name to a Tcl encoding name. The return value is the name of > the encoding, which is *not* the name of the charset you passed in! I > don't know if that's intended, but it's really confusing. Uh.. I would not rely on that symlinking, if possible (think of Windows). I believe the better way would be a HTTP->Tcl and Tcl->HTTP encoding- name lookup function. > > Another place this trips up: In the config for the tests Michael > added: > > ns_section "ns/mimetypes" > ns_param .utf2utf_adp "text/plain; charset=utf-8" > ns_param .iso2iso_adp "text/plain; charset=iso-8859-1" > > ns_section "ns/encodings" > ns_param .utf2utf_adp "utf-8" > ns_param .iso2iso_adp "iso-8859-1" > > The ns/encodings are the encoding to use to read an ADP file from > disk, accoring to extension. It solves the problem: the web designers > editor doesn't support utf-8. (I wonder if this is still valid any > more?) > > But, the code is actually expecting Tcl encoding names here, not a > charset, so this config is busted. It doesn't show up in the tests > because the only alternative encoding we're using is iso-8859-1, which > also happens to be the default. > > This is probably just a bug. The code uses Ns_GetEncoding() when it > should use Ns_GetCharsetEncoding(). But that highlights another bug: > when would you ever want to call Ns_GetEncoding()? You always want to > take into account the charset aliases we carefully set up. This > probably shouldn't be a public function. > > > The strategy of driving the encoding from the mime-type has some other > problems. You have to create a whole bunch of fake mime-types / > extension mappings just to support multiple encodings (the > ns/mimetypes above). > > What if there is no extension? Or you want to keep the .adp (or > whatever) extension, but serve content in different encodings from > different parts of the URL tree? Currently you have to put code in > each ADP to set the mime-type (which is always the same) explicitly, > to set the charset as a side effect. > > AOLserver 4.5 has a ns_register_encoding command, which is perhaps an > improvement on plain file extensions. > > Both AOLserver and our current code base have the bug where the > character set is only set for mime-types of type text/*. This makes a > certain ammount of sense -- you don't want to be text encoding a > dynamicaly generated gif, for example. > > However, the correct mime-type for XHTML is application/xml+html. So > in this case, an ADP which generates XHTML will not have the correct > encoding applied if you're relying on the mime-type. > > > Here's another odity: Ns_ConSetWriteEncodedFlag() / ns_conn > write_encoded. This is yet another way to sort-of-set the encoding, > which is essentially a single property. > > The only code wich uses this is the ns_write command, and I think it > has it backwards. By default, if Ns_ConnGetWriteEncodedFlag() returns > false, ns_write assumes it's writing binary data. But nothing actually > sets this flag, so ns_write doesn't encode text at all. > > We should remove the WRITE_ENCODED stuff. > > How do we handle binary data from Tcl anyway? There's a -binary switch > to ns_return, and the write_encoded flag for ns_write. I was > wondering if we could just check the type of the Tcl object passed in > to any of the ns_return-like functions to see if it's type > "bytearray". A byte array *could* get shimmered to a string, and then > back again without data losss, but that's probably unlikely in > practice. > > > There's also the problem of input encodings. If you're supporting > multiple encodings, how do you know what encoding the query data is > in? A couple of solutions suggested in Rob Mayoff's guide are to put > this in a hidden form field, or to put it in a cookie. > > Here's an interesting bug: You need to get the character set a form > was encoded in, so you call ns_queryget ... This first invokes the > legacy ns_getform call which, among other things, pulls an file upload > data out of the content stream and puts it into temp files. > > Now, you have to assume *some* encoding to get at the query data in > the first place. So let's guess and say utf-8. Uh oh, our hidden form > field says iso-8859-2. OK, so we call ns_conn encoding iso-8859-2 to > reset the encoding, and this call flushes the query data which was > previously decoded using utf-8. > It also flushes our uploaded files. The kicker here is uploaded files > aren't even decoded using a text encoding, so when the query data is > again decoded, this time using iso-8859-2, the the uploaded files will > be exactly the same as they were before. > > > > I'm sure there's some more stuff I'm forgetting. Anyway, here's how I > think it should be: > > * utf-8 by default Yes. > * mime-types are just mime-types Yes. > * always hack the mime-type for text data to add the charset Yes (shouldn't this automatically be done by the server?) > * text is anything sent via Ns_ConnReturnCharData() Yes. > * binary is a Tcl bytearray object > * static files are served as-is, text or binary Yes, as-is, i.e. no conversion taking place. > * multiple encodings are handled via calling ns_conn encoding > * folks need to do this manually. no more file extension magic > Yes. > I think a nice way for folks to handle multiple encodings is to > register a filter, which you can of course use to simulate the file > extension scheme in place now, the AOLserver 4.5 ns_register_encoding > stuff, and more, because it's a filter. You can also do things like > check query data or cookies for the charset to use. > > > Questions that need answered: > > * can we junk charset aliases in nsd/encodings.c and use a dir of > symlinks? No. I hate being dependent on the distribution directory. > * can we junk ns/encodings in 2006? I'm afraid it will not be that fast! > * is checking for bytearray type a good way to handle binary Tcl > objects? Hm... how would you (generally) check for a byte-array? You can get a byte-array from object but you can't (without looking into the object type which is really not something that is portable) say which type of object you are dealing with. I believe the main source of problem here is somebody slurping the whole file and wanting to return that file as-is i.e. w/o any translations. In that case, he/she cound use [ns_conn encoding] to set the encoding of the connection before calling ns_return. This way we can strip away all the extra options from the content-returning commands and request the writer to use ns_conn to set the correct encoding OR to skip encoding altogether (for example: ns_conn encoding binary). Wouldn't that make sense? > * does the above scheme handle all the requirements? > No idea! I'd say, as our requirements are pretty small, it definitely does. But Bernd & co. have more to say as they do obviously need much more then we do. Cheers Zoran |
From: Stephen D. <sd...@gm...> - 2006-08-24 20:08:24
|
On 8/20/06, Zoran Vasiljevic <zv...@ar...> wrote: > > On 15.08.2006, at 21:09, Stephen Deasey wrote: > > * is checking for bytearray type a good way to handle binary Tcl > > objects? > > Hm... how would you (generally) check for a byte-array? > You can get a byte-array from object but you can't > (without looking into the object type which is really not > something that is portable) say which type of object you > are dealing with. Sure you can: byteArrayTypePtr = Tcl_GetObjType("bytearray"); if (objPtr->typePtr == byteArrayTypePtr) { /* It's a bute array... */ } > I believe the main source of problem here is somebody > slurping the whole file and wanting to return that file > as-is i.e. w/o any translations. In that case, he/she > cound use [ns_conn encoding] to set the encoding of the > connection before calling ns_return. This way we can > strip away all the extra options from the content-returning > commands and request the writer to use ns_conn to set > the correct encoding OR to skip encoding altogether > (for example: ns_conn encoding binary). > Wouldn't that make sense? I was thinking more of the case where you dynamically create a binary object, like a 'captcha' image. You want to ns_return it and have the server do the right thing without having to fiddle with a -binary switch. Another place where checking for a byte array might be good is the caching code. when you cache an object, it first gets converted to a valid utf-8 rep. When you get the object out of the cache, if you treat it as a byte array, evrything still works -- the conversion from byte array to utf-8 string and back again is non lossy. It's just non optimal. As we're starting to see things like the -binary switch spread, I was wondering if it was, in general, a good idea to check for byte arrays and have things work transparently. Are there any gotchas? |
From: Zoran V. <zv...@ar...> - 2006-08-24 20:23:34
|
On 24.08.2006, at 22:08, Stephen Deasey wrote: >> >> Hm... how would you (generally) check for a byte-array? >> You can get a byte-array from object but you can't >> (without looking into the object type which is really not >> something that is portable) say which type of object you >> are dealing with. > > > Sure you can: > > byteArrayTypePtr = Tcl_GetObjType("bytearray"); > > if (objPtr->typePtr == byteArrayTypePtr) { > /* It's a bute array... */ > } Yup. I know that. I'm just not sure if you are "allowed" to peek at the type from the outside. Normally, Tcl would provide you with such an API, like Tcl_IsByteArrayObj(objPtr) or such. The fact they don't obviously means something, I believe. > >> I believe the main source of problem here is somebody >> slurping the whole file and wanting to return that file >> as-is i.e. w/o any translations. In that case, he/she >> cound use [ns_conn encoding] to set the encoding of the >> connection before calling ns_return. This way we can >> strip away all the extra options from the content-returning >> commands and request the writer to use ns_conn to set >> the correct encoding OR to skip encoding altogether >> (for example: ns_conn encoding binary). >> Wouldn't that make sense? > > > I was thinking more of the case where you dynamically create a binary > object, like a 'captcha' image. You want to ns_return it and have the > server do the right thing without having to fiddle with a -binary > switch. > > Another place where checking for a byte array might be good is the > caching code. when you cache an object, it first gets converted to a > valid utf-8 rep. When you get the object out of the cache, if you > treat it as a byte array, evrything still works -- the conversion from > byte array to utf-8 string and back again is non lossy. It's just non > optimal. That's true. > > As we're starting to see things like the -binary switch spread, I was > wondering if it was, in general, a good idea to check for byte arrays > and have things work transparently. Are there any gotchas? I will think about that. I would like to see less of those "-binary" things all arround as we all see that it just confuses people. I just have a bad feeling to rely on the object type as it can be changed (dropped) by Tcl easily because of the "everything is a string" paradigm that Tcl still enforces. There must be some better way to do that, although if you ask me how, I can't give an answer to that. Cheers Zoran |