You can subscribe to this list here.
2005 |
Jan
|
Feb
(53) |
Mar
(62) |
Apr
(88) |
May
(55) |
Jun
(204) |
Jul
(52) |
Aug
|
Sep
(1) |
Oct
(94) |
Nov
(15) |
Dec
(68) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(130) |
Feb
(105) |
Mar
(34) |
Apr
(61) |
May
(41) |
Jun
(92) |
Jul
(176) |
Aug
(102) |
Sep
(247) |
Oct
(69) |
Nov
(32) |
Dec
(140) |
2007 |
Jan
(58) |
Feb
(51) |
Mar
(11) |
Apr
(20) |
May
(34) |
Jun
(37) |
Jul
(18) |
Aug
(60) |
Sep
(41) |
Oct
(105) |
Nov
(19) |
Dec
(14) |
2008 |
Jan
(3) |
Feb
|
Mar
(7) |
Apr
(5) |
May
(123) |
Jun
(5) |
Jul
(1) |
Aug
(29) |
Sep
(15) |
Oct
(21) |
Nov
(51) |
Dec
(3) |
2009 |
Jan
|
Feb
(36) |
Mar
(29) |
Apr
|
May
|
Jun
(7) |
Jul
(4) |
Aug
|
Sep
(4) |
Oct
|
Nov
(13) |
Dec
|
2010 |
Jan
|
Feb
|
Mar
(9) |
Apr
(11) |
May
(16) |
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2011 |
Jan
|
Feb
|
Mar
|
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2012 |
Jan
(7) |
Feb
(3) |
Mar
|
Apr
|
May
|
Jun
(3) |
Jul
|
Aug
|
Sep
|
Oct
(92) |
Nov
(28) |
Dec
(16) |
2013 |
Jan
(9) |
Feb
(2) |
Mar
|
Apr
(4) |
May
(4) |
Jun
(6) |
Jul
(14) |
Aug
(12) |
Sep
(4) |
Oct
(13) |
Nov
(1) |
Dec
(6) |
2014 |
Jan
(23) |
Feb
(19) |
Mar
(10) |
Apr
(14) |
May
(11) |
Jun
(6) |
Jul
(11) |
Aug
(15) |
Sep
(41) |
Oct
(95) |
Nov
(23) |
Dec
(11) |
2015 |
Jan
(3) |
Feb
(9) |
Mar
(19) |
Apr
(3) |
May
(1) |
Jun
(3) |
Jul
(11) |
Aug
(1) |
Sep
(15) |
Oct
(5) |
Nov
(2) |
Dec
|
2016 |
Jan
(7) |
Feb
(11) |
Mar
(8) |
Apr
(1) |
May
(3) |
Jun
(17) |
Jul
(12) |
Aug
(3) |
Sep
(5) |
Oct
(19) |
Nov
(12) |
Dec
(6) |
2017 |
Jan
(30) |
Feb
(23) |
Mar
(12) |
Apr
(32) |
May
(27) |
Jun
(7) |
Jul
(13) |
Aug
(16) |
Sep
(6) |
Oct
(11) |
Nov
|
Dec
(12) |
2018 |
Jan
(1) |
Feb
(5) |
Mar
(6) |
Apr
(7) |
May
(23) |
Jun
(3) |
Jul
(2) |
Aug
(1) |
Sep
(6) |
Oct
(6) |
Nov
(10) |
Dec
(3) |
2019 |
Jan
(26) |
Feb
(15) |
Mar
(9) |
Apr
|
May
(8) |
Jun
(14) |
Jul
(10) |
Aug
(10) |
Sep
(4) |
Oct
(2) |
Nov
(20) |
Dec
(10) |
2020 |
Jan
(10) |
Feb
(14) |
Mar
(29) |
Apr
(11) |
May
(25) |
Jun
(21) |
Jul
(23) |
Aug
(12) |
Sep
(19) |
Oct
(6) |
Nov
(8) |
Dec
(12) |
2021 |
Jan
(29) |
Feb
(9) |
Mar
(8) |
Apr
(8) |
May
(2) |
Jun
(2) |
Jul
(9) |
Aug
(9) |
Sep
(3) |
Oct
(4) |
Nov
(12) |
Dec
(13) |
2022 |
Jan
(4) |
Feb
|
Mar
(4) |
Apr
(12) |
May
(15) |
Jun
(7) |
Jul
(10) |
Aug
(2) |
Sep
|
Oct
(1) |
Nov
(8) |
Dec
|
2023 |
Jan
(15) |
Feb
|
Mar
(23) |
Apr
(1) |
May
(2) |
Jun
(10) |
Jul
|
Aug
(22) |
Sep
(19) |
Oct
(2) |
Nov
(20) |
Dec
|
2024 |
Jan
(1) |
Feb
|
Mar
(16) |
Apr
(15) |
May
(6) |
Jun
(4) |
Jul
(1) |
Aug
(1) |
Sep
|
Oct
(13) |
Nov
(18) |
Dec
(6) |
2025 |
Jan
(12) |
Feb
|
Mar
(2) |
Apr
(1) |
May
(11) |
Jun
(5) |
Jul
(4) |
Aug
(1) |
Sep
|
Oct
|
Nov
|
Dec
|
From: Zoran V. <zv...@ar...> - 2006-08-20 13:12:41
|
On 15.08.2006, at 21:09, Stephen Deasey wrote: > > In the documents you serve, do you specify the encoding *within* the > document, at th etop of the HTML file for example? Or are you serving > XML in which case the default for that is utf-8 anyway (I think, off > the top of my head...). We put it in headers, as extension of the content-type: text/html; charset=utf-8 > > I agree with Zoran. ns_conn encoding should be the way to change the > encoding (input or output) at runtime. > > The mime-type header sent back to the client does need to reflect the > encoding used, but ns_conn encoding should drive that, not the other > way around. Correct! At the moment, we do not mingle with ns_conn encoding stuff, rather we set the mime-type accordingly. > > We can check the mime-type header for a charset declaration, and if > it's not there, add one for the current value of ns_conn encoding. Right. > > One problem to be resolved here is that Tcl encoding names do not > match up with HTTP charset names. HTTP talks about iso-8859-1, while > Tcl talks about iso8859-1. There are lookup routines to convert HTTP > charset names to Tcl encoding names, but not the other way around. > Tcl_GetEncodingName() returns the Tcl name for an encoding, not the > charset alias we used to get the encoding. This is just-another-nuisance we'll have to swallow. We can write our own name-conversion between the known Tcl and HTTP encoding names. You say: There are lookup routines to convert HTT charset names to Tcl encoding names Who wrote them? I cannot imagine other way arround would be the problem. > > We could store the charset, as well as the encoding, for the conn. But > I was wondering: could we junk all the alias stuff and, in the > Naviserver install process, create a directory for encodings files and > fill it with symlinks to the real Tcl encoding files, unsing the > charset name? > > You call ns_conn encoding with a charset. Naviserver converts the > charset name to a Tcl encoding name. The return value is the name of > the encoding, which is *not* the name of the charset you passed in! I > don't know if that's intended, but it's really confusing. Uh.. I would not rely on that symlinking, if possible (think of Windows). I believe the better way would be a HTTP->Tcl and Tcl->HTTP encoding- name lookup function. > > Another place this trips up: In the config for the tests Michael > added: > > ns_section "ns/mimetypes" > ns_param .utf2utf_adp "text/plain; charset=utf-8" > ns_param .iso2iso_adp "text/plain; charset=iso-8859-1" > > ns_section "ns/encodings" > ns_param .utf2utf_adp "utf-8" > ns_param .iso2iso_adp "iso-8859-1" > > The ns/encodings are the encoding to use to read an ADP file from > disk, accoring to extension. It solves the problem: the web designers > editor doesn't support utf-8. (I wonder if this is still valid any > more?) > > But, the code is actually expecting Tcl encoding names here, not a > charset, so this config is busted. It doesn't show up in the tests > because the only alternative encoding we're using is iso-8859-1, which > also happens to be the default. > > This is probably just a bug. The code uses Ns_GetEncoding() when it > should use Ns_GetCharsetEncoding(). But that highlights another bug: > when would you ever want to call Ns_GetEncoding()? You always want to > take into account the charset aliases we carefully set up. This > probably shouldn't be a public function. > > > The strategy of driving the encoding from the mime-type has some other > problems. You have to create a whole bunch of fake mime-types / > extension mappings just to support multiple encodings (the > ns/mimetypes above). > > What if there is no extension? Or you want to keep the .adp (or > whatever) extension, but serve content in different encodings from > different parts of the URL tree? Currently you have to put code in > each ADP to set the mime-type (which is always the same) explicitly, > to set the charset as a side effect. > > AOLserver 4.5 has a ns_register_encoding command, which is perhaps an > improvement on plain file extensions. > > Both AOLserver and our current code base have the bug where the > character set is only set for mime-types of type text/*. This makes a > certain ammount of sense -- you don't want to be text encoding a > dynamicaly generated gif, for example. > > However, the correct mime-type for XHTML is application/xml+html. So > in this case, an ADP which generates XHTML will not have the correct > encoding applied if you're relying on the mime-type. > > > Here's another odity: Ns_ConSetWriteEncodedFlag() / ns_conn > write_encoded. This is yet another way to sort-of-set the encoding, > which is essentially a single property. > > The only code wich uses this is the ns_write command, and I think it > has it backwards. By default, if Ns_ConnGetWriteEncodedFlag() returns > false, ns_write assumes it's writing binary data. But nothing actually > sets this flag, so ns_write doesn't encode text at all. > > We should remove the WRITE_ENCODED stuff. > > How do we handle binary data from Tcl anyway? There's a -binary switch > to ns_return, and the write_encoded flag for ns_write. I was > wondering if we could just check the type of the Tcl object passed in > to any of the ns_return-like functions to see if it's type > "bytearray". A byte array *could* get shimmered to a string, and then > back again without data losss, but that's probably unlikely in > practice. > > > There's also the problem of input encodings. If you're supporting > multiple encodings, how do you know what encoding the query data is > in? A couple of solutions suggested in Rob Mayoff's guide are to put > this in a hidden form field, or to put it in a cookie. > > Here's an interesting bug: You need to get the character set a form > was encoded in, so you call ns_queryget ... This first invokes the > legacy ns_getform call which, among other things, pulls an file upload > data out of the content stream and puts it into temp files. > > Now, you have to assume *some* encoding to get at the query data in > the first place. So let's guess and say utf-8. Uh oh, our hidden form > field says iso-8859-2. OK, so we call ns_conn encoding iso-8859-2 to > reset the encoding, and this call flushes the query data which was > previously decoded using utf-8. > It also flushes our uploaded files. The kicker here is uploaded files > aren't even decoded using a text encoding, so when the query data is > again decoded, this time using iso-8859-2, the the uploaded files will > be exactly the same as they were before. > > > > I'm sure there's some more stuff I'm forgetting. Anyway, here's how I > think it should be: > > * utf-8 by default Yes. > * mime-types are just mime-types Yes. > * always hack the mime-type for text data to add the charset Yes (shouldn't this automatically be done by the server?) > * text is anything sent via Ns_ConnReturnCharData() Yes. > * binary is a Tcl bytearray object > * static files are served as-is, text or binary Yes, as-is, i.e. no conversion taking place. > * multiple encodings are handled via calling ns_conn encoding > * folks need to do this manually. no more file extension magic > Yes. > I think a nice way for folks to handle multiple encodings is to > register a filter, which you can of course use to simulate the file > extension scheme in place now, the AOLserver 4.5 ns_register_encoding > stuff, and more, because it's a filter. You can also do things like > check query data or cookies for the charset to use. > > > Questions that need answered: > > * can we junk charset aliases in nsd/encodings.c and use a dir of > symlinks? No. I hate being dependent on the distribution directory. > * can we junk ns/encodings in 2006? I'm afraid it will not be that fast! > * is checking for bytearray type a good way to handle binary Tcl > objects? Hm... how would you (generally) check for a byte-array? You can get a byte-array from object but you can't (without looking into the object type which is really not something that is portable) say which type of object you are dealing with. I believe the main source of problem here is somebody slurping the whole file and wanting to return that file as-is i.e. w/o any translations. In that case, he/she cound use [ns_conn encoding] to set the encoding of the connection before calling ns_return. This way we can strip away all the extra options from the content-returning commands and request the writer to use ns_conn to set the correct encoding OR to skip encoding altogether (for example: ns_conn encoding binary). Wouldn't that make sense? > * does the above scheme handle all the requirements? > No idea! I'd say, as our requirements are pretty small, it definitely does. But Bernd & co. have more to say as they do obviously need much more then we do. Cheers Zoran |
From: Stephen D. <sd...@gm...> - 2006-08-15 19:09:17
|
On 8/14/06, Zoran Vasiljevic <zv...@ar...> wrote: > > On 14.08.2006, at 22:43, Stephen Deasey wrote: > > > > > * Your clients don't understand utf-8, and you serve content in > > multiple languages which don't share a common character set. Sucks to > > be you. > > > > I think the whole purpose of that encooding mess is this above. > With time this will be less and less important, so how much > should we really care? > From the technical perspective, it is nice to have universal > and general sulution, but from the practical side: it costs > time and money to keep it arround... I agree. I was wondering if we should junk the whole mess, but I think we can minimise the impact without loosing the ability to support multiple encodings, and in fact improve the support. > > > > I've been working on some patches to fix the various bugs so don't > > worry about it too much. But I'd appreciate feedback on how you > > actually use the encoding support. > > I use it this way: leave everything as-is. I never had to > tweak any of the existing encoding (encoding) knobs. And I never had > anybody complaining. And we do serve japanese, chinese and > european languages. Allright, the client is always either IE > or Mozilla or Safari (prerequisite) so mine is perhaps not a > good example. In the documents you serve, do you specify the encoding *within* the document, at th etop of the HTML file for example? Or are you serving XML in which case the default for that is utf-8 anyway (I think, off the top of my head...). Another possibility is that you happen to be using browsers which are smart enough to reparse a document if it doesn't happen to be in the encoding it first expected. I think the big guys do this -- not sure your mobile phone will be so forgiving. > Apropos chunked encoding: I still believe that the vectorized > IO is OK and the way you transform UTF8 on the fly is also OK. > So, if any content encoding has to take place, you can really > only do it with the chunked encoding OR by converting the whole > content in memory prior sending it, and giving the correct content > length OR by just omitting the content length all together. > I do not think there are other options. > > I'm curious what you will come up with ;-) I'll handle the IO stuff in a separate post. Here's something I wrote up re encodings and such: (This applies to case 3: supporting multiple encodings) I agree with Zoran. ns_conn encoding should be the way to change the encoding (input or output) at runtime. The mime-type header sent back to the client does need to reflect the encoding used, but ns_conn encoding should drive that, not the other way around. We can check the mime-type header for a charset declaration, and if it's not there, add one for the current value of ns_conn encoding. One problem to be resolved here is that Tcl encoding names do not match up with HTTP charset names. HTTP talks about iso-8859-1, while Tcl talks about iso8859-1. There are lookup routines to convert HTTP charset names to Tcl encoding names, but not the other way around. Tcl_GetEncodingName() returns the Tcl name for an encoding, not the charset alias we used to get the encoding. We could store the charset, as well as the encoding, for the conn. But I was wondering: could we junk all the alias stuff and, in the Naviserver install process, create a directory for encodings files and fill it with symlinks to the real Tcl encoding files, unsing the charset name? You call ns_conn encoding with a charset. Naviserver converts the charset name to a Tcl encoding name. The return value is the name of the encoding, which is *not* the name of the charset you passed in! I don't know if that's intended, but it's really confusing. Another place this trips up: In the config for the tests Michael added: ns_section "ns/mimetypes" ns_param .utf2utf_adp "text/plain; charset=utf-8" ns_param .iso2iso_adp "text/plain; charset=iso-8859-1" ns_section "ns/encodings" ns_param .utf2utf_adp "utf-8" ns_param .iso2iso_adp "iso-8859-1" The ns/encodings are the encoding to use to read an ADP file from disk, accoring to extension. It solves the problem: the web designers editor doesn't support utf-8. (I wonder if this is still valid any more?) But, the code is actually expecting Tcl encoding names here, not a charset, so this config is busted. It doesn't show up in the tests because the only alternative encoding we're using is iso-8859-1, which also happens to be the default. This is probably just a bug. The code uses Ns_GetEncoding() when it should use Ns_GetCharsetEncoding(). But that highlights another bug: when would you ever want to call Ns_GetEncoding()? You always want to take into account the charset aliases we carefully set up. This probably shouldn't be a public function. The strategy of driving the encoding from the mime-type has some other problems. You have to create a whole bunch of fake mime-types / extension mappings just to support multiple encodings (the ns/mimetypes above). What if there is no extension? Or you want to keep the .adp (or whatever) extension, but serve content in different encodings from different parts of the URL tree? Currently you have to put code in each ADP to set the mime-type (which is always the same) explicitly, to set the charset as a side effect. AOLserver 4.5 has a ns_register_encoding command, which is perhaps an improvement on plain file extensions. Both AOLserver and our current code base have the bug where the character set is only set for mime-types of type text/*. This makes a certain ammount of sense -- you don't want to be text encoding a dynamicaly generated gif, for example. However, the correct mime-type for XHTML is application/xml+html. So in this case, an ADP which generates XHTML will not have the correct encoding applied if you're relying on the mime-type. Here's another odity: Ns_ConSetWriteEncodedFlag() / ns_conn write_encoded. This is yet another way to sort-of-set the encoding, which is essentially a single property. The only code wich uses this is the ns_write command, and I think it has it backwards. By default, if Ns_ConnGetWriteEncodedFlag() returns false, ns_write assumes it's writing binary data. But nothing actually sets this flag, so ns_write doesn't encode text at all. We should remove the WRITE_ENCODED stuff. How do we handle binary data from Tcl anyway? There's a -binary switch to ns_return, and the write_encoded flag for ns_write. I was wondering if we could just check the type of the Tcl object passed in to any of the ns_return-like functions to see if it's type "bytearray". A byte array *could* get shimmered to a string, and then back again without data losss, but that's probably unlikely in practice. There's also the problem of input encodings. If you're supporting multiple encodings, how do you know what encoding the query data is in? A couple of solutions suggested in Rob Mayoff's guide are to put this in a hidden form field, or to put it in a cookie. Here's an interesting bug: You need to get the character set a form was encoded in, so you call ns_queryget ... This first invokes the legacy ns_getform call which, among other things, pulls an file upload data out of the content stream and puts it into temp files. Now, you have to assume *some* encoding to get at the query data in the first place. So let's guess and say utf-8. Uh oh, our hidden form field says iso-8859-2. OK, so we call ns_conn encoding iso-8859-2 to reset the encoding, and this call flushes the query data which was previously decoded using utf-8. It also flushes our uploaded files. The kicker here is uploaded files aren't even decoded using a text encoding, so when the query data is again decoded, this time using iso-8859-2, the the uploaded files will be exactly the same as they were before. I'm sure there's some more stuff I'm forgetting. Anyway, here's how I think it should be: * utf-8 by default * mime-types are just mime-types * always hack the mime-type for text data to add the charset * text is anything sent via Ns_ConnReturnCharData() * binary is a Tcl bytearray object * static files are served as-is, text or binary * multiple encodings are handled via calling ns_conn encoding * folks need to do this manually. no more file extension magic I think a nice way for folks to handle multiple encodings is to register a filter, which you can of course use to simulate the file extension scheme in place now, the AOLserver 4.5 ns_register_encoding stuff, and more, because it's a filter. You can also do things like check query data or cookies for the charset to use. Questions that need answered: * can we junk charset aliases in nsd/encodings.c and use a dir of symlinks? * can we junk ns/encodings in 2006? * is checking for bytearray type a good way to handle binary Tcl objects? * does the above scheme handle all the requirements? Bugs to fix: * query data flushing is too extreme. don't flush files * junk Ns_Conn*WriteEncodedFlag() / ns_conn write_encoded There's also the content-length bug, but I think that's a separate problem. I'm going to look more into that next as I wrote it, so if anyone else want to tackle any of the above because they need it soon, go ahead. If not, I'll do that next. |
From: Zoran V. <zv...@ar...> - 2006-08-14 21:16:50
|
On 14.08.2006, at 22:43, Stephen Deasey wrote: > > * Your clients don't understand utf-8, and you serve content in > multiple languages which don't share a common character set. Sucks to > be you. > I think the whole purpose of that encooding mess is this above. With time this will be less and less important, so how much should we really care? From the technical perspective, it is nice to have universal and general sulution, but from the practical side: it costs time and money to keep it arround... > > I've been working on some patches to fix the various bugs so don't > worry about it too much. But I'd appreciate feedback on how you > actually use the encoding support. I use it this way: leave everything as-is. I never had to tweak any of the existing encoding (encoding) knobs. And I never had anybody complaining. And we do serve japanese, chinese and european languages. Allright, the client is always either IE or Mozilla or Safari (prerequisite) so mine is perhaps not a good example. Apropos chunked encoding: I still believe that the vectorized IO is OK and the way you transform UTF8 on the fly is also OK. So, if any content encoding has to take place, you can really only do it with the chunked encoding OR by converting the whole content in memory prior sending it, and giving the correct content length OR by just omitting the content length all together. I do not think there are other options. I'm curious what you will come up with ;-) Cheers Zoran |
From: Stephen D. <sd...@gm...> - 2006-08-14 20:43:10
|
On 8/14/06, Zoran Vasiljevic <zv...@ar...> wrote: > > On 14.08.2006, at 15:58, Gustaf Neumann wrote: > > > no rocket science (but work) > > This is precisely what I was afraid of: work. > Not that I don't like, I just do not have time > for that now. > > Moreover, the ns_http is a C-level implementation, > which means some more work and much more instability > as when you screw something at that level, everything > goes down the drain. > > At the first glance, the nstest_http from the test-suite > needs to be rewritten but as this is entirely written in > tcl and affects only the test suite, this is less of > a problem. > > More problematic is ns_http as this is a public call and > is in the core server. This might affect everybody, so the > stability is important. OTOH, this need not be done to get > the test-suite pass. > > As I know Bernd is very fluent in Tcl and as he's now > on the holiday (means he has a chance to relax and recover) > I vote him to fix the test suite and rewrite the nstest_http :-) > Hello all.. Thanks Michael and Bernd for writing the new encoding tests! I actually picked them up on Friday and have been looking through the code this weekend. So with it all fresh in my mind, it was painful to catch up with the mailing list and see y'all struggle through it! I learnt a new phrase: Yak Shaving. You know when you start something, and them you realise it depends on something else so you need to do that first, and then you discover some other thing that *that* depends on so you start to work on that, and then another thing, and so on...? I've been Yak Shaving. The content-length bug is my fault. I busted it with the vectored IO changes, and it didn't come to light because we weren't testing character set conversion. Vlad suggested the chunked-encoding patch, and Bernd applied it (forgot the ChageLog, oops), but the tests didn't start failing until Michael added the encoding tests which changed the default output character set to iso-8859-1. I'd like to roll back that change. IIRC, the vectored IO changes are new since the last release so the existing tarball should be OK. It's also a one line patch so it's easy to apply if for some reason your dependent on HEAD. It does mess up a whole bunch of other tests and it's not the right way to fix this problem. I don't think nstest_http needs to be chunked encoding aware. There's no way a 5 byte response needs to be chunked! Besides, it's just a simple testing harness that needs to be somewhat not-conformant. I wrote it originally, rathern than use an existing client, because I needed to inject faults and see the server's response. Anyway. Encoding -- it's a bitch. As far as I can tell there's 3 situations you can find yourself in: * All your clients understand utf-8. You can serve them in whatever language you want using the one encoding. This should (be made to) work by default. * Your clients do not all understand utf-8, but you only support one other language. In this case, you can change the server default output encoding, and all should again, work by defaul (ignoring the current bugs). * Your clients don't understand utf-8, and you serve content in multiple languages which don't share a common character set. Sucks to be you. Rob Mayoff wrote the original encoding patches and wrote this helpful document to describe the situation: http://dqd.com/~mayoff/encoding-doc.html This was turn of the century and I'm wondering how much of this still applies today. How many clients do not support utf-8? I've been working on some patches to fix the various bugs so don't worry about it too much. But I'd appreciate feedback on how you actually use the encoding support. I'd explain more but I'm getting kicked off the wi-fi... Tomorrow. |
From: Zoran V. <zv...@ar...> - 2006-08-14 17:22:02
|
On 14.08.2006, at 19:11, Michael Lex wrote: > > Encoding set via [ns_conn] is only respected by [ns_return] if no > OutputCharset is defined and no charset is defined with the > type-argument of [ns_return]. Strange! I see that there are too many knobs to tweak! We must reduce that, absolutely. I mean this whole thing is complicated enough per-se w/o us allowing the people to turn just about every place upside down, leading to absolute confusion (the state that I'm now in). I will have to sit for a while and check this in detail as I do not think that it must be that complicated and versatile. Only thing you need to set is either the default encodings (per config file) and *eventually* be able to override that at runtime, preferably *only* using the [ns_conn]. This would make sense to me. Somebody has (yet) to persuade me that this is not enough! BTW, thank you very much for poking into this can or worms... I believe we will have to remove all those worms or throw away the can and get us a new one! Cheers Zoran |
From: Michael L. <mic...@gm...> - 2006-08-14 17:11:25
|
> Is this really so? I mean, I would expect the [ns_return] > to ignore any optional encoding stuff and delegate all > to [ns_conn] in the similar way how you use channels in Tcl. Encoding set via [ns_conn] is only respected by [ns_return] if no OutputCharset is defined and no charset is defined with the type-argument of [ns_return]. > After all, you do not [puts] with an encoding. You just [puts] > and the [fconfigure channel -encoding] sets the channel to the > desired encoding. > I mean, If I wrote that, I'd do it so. I'd do it like this, too. And in fact this is why I was so confused about [ns_conn encoding]. > You will have to give me some hints on how you use all > those commands. We don't use them at the moment. We set the encoding(s) in the config and with [ns_return]. Bernd asked me to write encoding-tests and thats why I stumbled over the [ns_conn] command. |
From: Zoran V. <zv...@ar...> - 2006-08-14 16:44:20
|
On 14.08.2006, at 18:30, Michael Lex wrote: > ns_conn encoding and that > ns_return is completely independant of anything you set with ns_conn > encoding (if there's a default OutputCharset) Hmmm??? Is this really so? I mean, I would expect the [ns_return] to ignore any optional encoding stuff and delegate all to [ns_conn] in the similar way how you use channels in Tcl. After all, you do not [puts] with an encoding. You just [puts] and the [fconfigure channel -encoding] sets the channel to the desired encoding. I mean, If I wrote that, I'd do it so. You will have to give me some hints on how you use all those commands. What we do is just [ns_return] and never fiddle with that -binary switch (hey, I didn't even know it existed). Also, we never use [ns_conn] to manipulate any encoding so everthing is "default" and just works. So we never had to mingle with encodings (thankfully). Cheers Zoran |
From: Gustaf N. <ne...@wu...> - 2006-08-14 16:44:14
|
Zoran Vasiljevic schrieb: > On 08.08.2006, at 11:48, Michael Lex wrote: > >> When no OutpuCharset is defined, naviserver will send the content >> without transformation, that means utf-8. But according to the RFC >> 2616 (HTTP 1.1) all content without an explicit "charset: ..." in the >> Content-Type-Header should be treated as iso-8859-1 by the clients. >> This causes problems when you have an incomplete configuration (w/o >> OutputCharset). >> > > Heh... should we simply add the UTF8 charset declaration > in absence of the output encoding? I believe this would > be the simplest "fix"? > at least for the tests, when no OutputCharset other than UTF-8 is specified. ...and none of its derivations, such as diacritical enconding on macs. i would think that the added charset declaration on a mac must be different. recoding utf-8 can affect the number of bytes as well. -gustaf PS: yes, the added signature is strange. |
From: Michael L. <mic...@gm...> - 2006-08-14 16:30:11
|
What I would do is remove ns_conn encoding or make it read-only. It is not really necessary as it can be replaced by ns_startcontent or ns_adp_mimetype and is (largely) ignored by ns_return. And it confuses programmers (like me). The problem is: backwards compatibility. Sry ... I forgot a possible use: When you want to change the encoding while sending streamed content (adp-file or ns_write), you have to use ns_conn encoding. So if it is not possible (or sensible) to remove the function, i would be grateful if there was a hint in the (future) documentation, that one should be VERY careful with the use of ns_conn encoding and that ns_return is completely independant of anything you set with ns_conn encoding (if there's a default OutputCharset). |
From: Mike <nee...@gm...> - 2006-08-14 16:06:41
|
OT: anyone else find the signature strange? ;) On 8/14/06, Michael Lex <mic...@gm...> wrote: > Perhaps, it is even better to include a default OutputCharset (utf-8), > when reading the configuration. I think the right place would be > NsUpdateEncodings, but i can be wrong ;-) > > Michael > > ------------------------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel > |
From: Michael L. <mic...@gm...> - 2006-08-14 16:01:22
|
Perhaps, it is even better to include a default OutputCharset (utf-8), when reading the configuration. I think the right place would be NsUpdateEncodings, but i can be wrong ;-) Michael |
From: Zoran V. <zv...@ar...> - 2006-08-14 14:48:44
|
On 08.08.2006, at 11:40, Michael Lex wrote: > For me it looks like the use cases for ns_conn encoding are rare, e.g. > if you use it in a context where the client knows exactly what > encoding to > expect. The documentation should mention the above alternatives and > warn not > to use ns_conn encoding if you "don't know all side effects." Lets put it this way: if you were to write all that from scratch, what would you do? Or, if you were allowed to revamp that (existing) interface(s) what would you remove/add? To be honest, we always serve utf-8 and never had any need to change the encodings, hence I (up to today) largely avoided to look at that code... But I see it deserves some cleanup. Cheers Zoran |
From: Zoran V. <zv...@ar...> - 2006-08-14 14:41:18
|
On 08.08.2006, at 11:48, Michael Lex wrote: > When no OutpuCharset is defined, naviserver will send the content > without transformation, that means utf-8. But according to the RFC > 2616 (HTTP 1.1) all content without an explicit "charset: ..." in the > Content-Type-Header should be treated as iso-8859-1 by the clients. > This causes problems when you have an incomplete configuration (w/o > OutputCharset). Heh... should we simply add the UTF8 charset declaration in absence of the output encoding? I believe this would be the simplest "fix"? Cheers Zoran |
From: Zoran V. <zv...@ar...> - 2006-08-14 14:13:34
|
On 14.08.2006, at 15:58, Gustaf Neumann wrote: > no rocket science (but work) This is precisely what I was afraid of: work. Not that I don't like, I just do not have time for that now. Moreover, the ns_http is a C-level implementation, which means some more work and much more instability as when you screw something at that level, everything goes down the drain. At the first glance, the nstest_http from the test-suite needs to be rewritten but as this is entirely written in tcl and affects only the test suite, this is less of a problem. More problematic is ns_http as this is a public call and is in the core server. This might affect everybody, so the stability is important. OTOH, this need not be done to get the test-suite pass. As I know Bernd is very fluent in Tcl and as he's now on the holiday (means he has a chance to relax and recover) I vote him to fix the test suite and rewrite the nstest_http :-) Cheers, zoran |
From: Gustaf N. <ne...@wu...> - 2006-08-14 13:57:33
|
Zoran Vasiljevic schrieb: > The only trouble is the ton of error messages in the > test suite, as we alone cannot handle chunked encoding > as consumers! I yet have to see if [ns_http] can do that > (I don't think so). > well, implementing chunked encoding in tcl is no rocket science (but work). however, this might be useful for other purposes as well. The xotcl HTTP client library has chunked encoding implemented (but has a different interace than ns_http). One could use this, or snarf the code. I am sure, there are other implementations as well. -gustaf |
From: Zoran V. <zv...@ar...> - 2006-08-14 12:47:26
|
On 14.08.2006, at 14:24, Gustaf Neumann wrote: > Without looking into the code, i would assume that the simplest case > would be to distinguish between cases, where the content-length is > unknown and known. > > unknown means: transformation or dynamic content. > > for unknown cases i would omit in HTTP/1.0 the content-length and use > in HTTP/1.1 the chunked encoding This also makes sense as it is most trivial to implement (less work, more fun), and we basically already have it that way (more or less). The only trouble is the ton of error messages in the test suite, as we alone cannot handle chunked encoding as consumers! I yet have to see if [ns_http] can do that (I don't think so). Cheers, Zoran |
From: Gustaf N. <ne...@wu...> - 2006-08-14 12:24:40
|
Zoran Vasiljevic schrieb: > > On the fly I mean that the message is not encoded in its *entirety* > beforehand, rather it is converted piece-by-piece (hence on-the-fly) > in Ns_ConnWriteVChars(). > there are not much options, when the output encoding changes length. > So, what do we have now? > > A. For HTTP 1.0 clients only, we could/should/must either: > > a. omit content-length and turn keepalive off leaving > the browser to drain the connection until EOF. > HTTP/1.0 does not say anything about keepalive (http://www.ietf.org/rfc/rfc1945.txt) but sending a "connection: close" does not hurt, since some nonstandard clients use it. > b. calculate the content-length in advance by > performing the conversion of the message > in its entirety in the memory using the given > output encoding > the calculation is only needed in cases, where the output is not raw (e.g. delivering images, i would not call it UTF8 encoded (see below), rather "raw") > B. For HTTP 1.1 clients we can turn on chunked encoding > if the output encoding is specified, and is not UTF8 > (basically, this is what Bernd's workaround does). > ... or there is for some other reasons no translation going on (see above) Without looking into the code, i would assume that the simplest case would be to distinguish between cases, where the content-length is unknown and known. unknown means: transformation or dynamic content. for unknown cases i would omit in HTTP/1.0 the content-length and use in HTTP/1.1 the chunked encoding the known-case is a non-brainer. -gustaf > |
From: Zoran V. <zv...@ar...> - 2006-08-14 11:52:00
|
On 14.08.2006, at 12:53, Michael Lex wrote: > I think you get Bernd wrong: The problem was, that Bernd wanted > naviserver to return the content in iso-8859-1 encoding. So the number > of bytes and the number of characters should be equal. > The Content-Length has to be the number of bytes returned, but =20 > naviserver > computed the value with string bytelength of an utf-8 string, which > was, in Bernds, case greater than the bytelength of the iso8859-1 > string. I believe the best way is to peek at the standard (RFC 2616): 14.13 Content-Length The Content-Length entity-header field indicates the size of the entity-body, in decimal number of OCTETs, sent to the recipient or, in the case of the HEAD method, the size of the entity-body that would have been sent had the request been a GET. Content-Length =3D "Content-Length" ":" 1*DIGIT An example is Content-Length: 3495 Applications SHOULD use this field to indicate the transfer-=20 length of the message-body, unless this is prohibited by the rules in section 4.4. This all means that content-length gives total number of *bytes* in the response, regardless of any encoding applied. This also means that in the case of UTF8 encoded string "m=FC" it will be 3 and not 2. If the "m=FC" is sent with ISO8859-1 then the content length wold be 2. Allright. I think I get it now. If this is so, then this means that we cannot possibly give the correct content-length UNLESS we apply the encoding BEFORE sending any headers and body, as we would have to either give the correct value in content-length header OR would need to OMIT the content-length and turn off the keepalive for that response. > > So it seems that chunked encoding is the best possible solution. But > as Gustav said, chunked transfer-encoding is only part of HTTP/1.0 and > some clients don't understand it. Yes, chunked encoding seems feasible there. For clients not supporting the chunked responses, we could convert the entire message beforehand burning some memory and cycles. As there are quite a few of them out there, this may not be of much importance anyways. OK, this makes sense. > > Btw: Aolserver doesn't encode "on-the-fly", but in memory. So they > know the content-length before the content is sent to the recipient. > On the fly I mean that the message is not encoded in its *entirety* beforehand, rather it is converted piece-by-piece (hence on-the-fly) in Ns_ConnWriteVChars(). So, what do we have now? A. For HTTP 1.0 clients only, we could/should/must either: a. omit content-length and turn keepalive off leaving the browser to drain the connection until EOF. b. calculate the content-length in advance by performing the conversion of the message in its entirety in the memory using the given output encoding B. For HTTP 1.1 clients we can turn on chunked encoding if the output encoding is specified, and is not UTF8 (basically, this is what Bernd's workaround does). Is this right? Are there any other options we may have? Zoran > Michael > > ----------------------------------------------------------------------=20= > --- > Using Tomcat but need to do more? Need to support web services, =20 > security? > Get stuff done quickly with pre-integrated technology to make your =20 > job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache =20 > Geronimo > http://sel.as-us.falkag.net/sel?=20 > cmd=3Dlnk&kid=3D120709&bid=3D263057&dat=3D121642 > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel |
From: Michael L. <mic...@gm...> - 2006-08-14 10:53:26
|
I think you get Bernd wrong: The problem was, that Bernd wanted naviserver to return the content in iso-8859-1 encoding. So the number of bytes and the number of characters should be equal. The Content-Length has to be the number of bytes returned, but naviserver computed the value with string bytelength of an utf-8 string, which was, in Bernds, case greater than the bytelength of the iso8859-1 string. So it seems that chunked encoding is the best possible solution. But as Gustav said, chunked transfer-encoding is only part of HTTP/1.0 and some clients don't understand it. Btw: Aolserver doesn't encode "on-the-fly", but in memory. So they know the content-length before the content is sent to the recipient. Michael |
From: Zoran V. <zv...@ar...> - 2006-08-14 10:19:10
|
On 14.08.2006, at 12:00, Zoran Vasiljevic wrote: > I must say: the problem is that the *correct* content-length > header is/would-be difficult (or costly in terms of memory > and time) to compute for dynamic content. Hm... I was too fast, as usual. If I read Bernds email *carefully*, he says: <quote> In my test case, 'string length' on the parsed adp string gives me 7109 bytes, 'string bytelength' 7147 bytes, in the Header 'Content-Length' is 7147 and wget stops after byte 7109 (e.g. Opera requests the page twice, haha, I lost one day to figure out why): string length: 7109 bytes = bytes returned string bytelength: 7147 = Content-Length header </quote> This would mean that the Content-Length is actually defined as number of characters in the given content encoding and NOT number of bytes! This is of course something completely different.... Putting the streamin mode aside, the caller will usually know how many characters he's returning. He may not know how many bytes this will yield in the given encodong allright but this is not important then, as we will do this correctly during the send. So the "bug" you are reffering to is us setting the content-length on the basis of [string bytelength] instead on the basis of [string length] equivalent? If this is so, then the bug should be of course fixed and the "workaround" should be removed. That would leave us to turn on the chunked encoding ONLY when we serve streaming content. Do I see this right? Cheers Zoran |
From: Zoran V. <zv...@ar...> - 2006-08-14 10:00:26
|
On 13.08.2006, at 20:03, Michael Lex wrote: > > This chunked-transfer-encoding thing is only a workaround until the > Content-Length bug is fixed. I've been reading various posts from you, Bernd, Vlad and Stephen about that... I must say: the problem is that the *correct* content-length header is/would-be difficult (or costly in terms of memory and time) to compute for dynamic content. The way how the things are handled now, the caller passes the equivalent to [string length] bytes and an encoding to the underlying routine(s). They encode the content *on-the-fly* so the caller cannot possibly know in advance what to put in the Content-Length header as the *real* output length may/will change depending on the selected encoding. After all, the chunked encoding IS defined exactly for this reason: to allow the recepient to verify the number of bytes received, when it is impossible or not feasible for the generator to specify exact number of bytes sent. If you looking it from that perspective, the "workaround" Bernd has made is actually pretty much OK. I wonder how you can set a correct Content-Length if you use ADP streaming... The above case (using explicit output encoding) is (admitently) not exactly the same as streaming, as you COULD encode the string before sending and then set the correct content-length, but this would mean memory bloat as you will have to transform the entire content beforehand... Therefore, if you ask me, the "workaround" should be left there. Another thing... I guess if we do not return the Content-Length at all, and we do not use chunked transfer encoding, the browser will slurp everything until EOF, right? But this can only be done for non-keepalive connections... Hm... Wrong turn.... If you ask me, I think we must make sure our test routines also know how to decode chunked encodings and we should leave the "workaround" in place. If we all agree that this is the way to go we can check other places in the code and make sure they do the same: if output encoding set, turn-on the chunked transfer encoding. What do others think? Cheers Zoran |
From: Gustaf N. <ne...@wu...> - 2006-08-13 19:44:52
|
Michael Lex schrieb: > It's only going to hit you if you use some other encoding than utf-8 > and then only if you use a HTTP-client, that does not understand > chunked encoding. But every web-browser and virtually every other > client-software should understand this (It's a MUST part of the rfc). > ... to be more precise, of HTTP/1.1, not HTTP/1.0 do you have any statistics, what your clients use? we have still a lot of HTTP/1.0 requests (around 5%). we would not use this currently on our production server. -gustaf |
From: Michael L. <mic...@gm...> - 2006-08-13 18:03:48
|
> I do not want to remove anything if this breaks anybody's code. This chunked-transfer-encoding thing is only a workaround until the Content-Length bug is fixed. > I just want to understand what is going on and how is this > going to hit me (or not) if I start to use this code? > Do I have to know somethind special? We do not set any special > encodings: we always serve utf8. Should I care? It's only going to hit you if you use some other encoding than utf-8 and then only if you use a HTTP-client, that does not understand chunked encoding. But every web-browser and virtually every other client-software should understand this (It's a MUST part of the rfc). Michael |
From: Zoran V. <zv...@ar...> - 2006-08-13 17:41:52
|
On 13.08.2006, at 19:38, Michael Lex wrote: >> As I see from the CVS, there were no changes to the C-code there? >> Did you have to change the C-code?? >> I guess, Bernd will have to explain what he ment to do with that... > > Bernd is on holiday right now. But i think i can explain the problem. > Bernd submitted a patch (or better workaround) to prevent Naviserver > from sending wrong content-length-headers. He simply made naviserver > send all data, that had to be converted to a different encoding, in > chunked transfer-encoding. Unfortunately nstest_http doesn't > understand chunked content. This patch was comitted on July, 13th. > When I worked on the encoding-tests, I had to add an OutputCharset > configuration parameter. So now Naviserver has to convert the content > and sends it in chunked mode, which nstest_http doesn't understand. > And the tests fail. If you remove this workaround, some of my new > encoding-tests still fail because of a wrong content-length. But this > is really a bug. Michael, I do not want to remove anything if this breaks anybody's code. I just want to understand what is going on and how is this going to hit me (or not) if I start to use this code? Do I have to know somethind special? We do not set any special encodings: we always serve utf8. Should I care? Cheers, Zoran > > Michael > > ---------------------------------------------------------------------- > --- > Using Tomcat but need to do more? Need to support web services, > security? > Get stuff done quickly with pre-integrated technology to make your > job easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache > Geronimo > http://sel.as-us.falkag.net/sel? > cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > naviserver-devel mailing list > nav...@li... > https://lists.sourceforge.net/lists/listinfo/naviserver-devel |
From: Michael L. <mic...@gm...> - 2006-08-13 17:38:28
|
> As I see from the CVS, there were no changes to the C-code there? > Did you have to change the C-code?? > I guess, Bernd will have to explain what he ment to do with that... Bernd is on holiday right now. But i think i can explain the problem. Bernd submitted a patch (or better workaround) to prevent Naviserver from sending wrong content-length-headers. He simply made naviserver send all data, that had to be converted to a different encoding, in chunked transfer-encoding. Unfortunately nstest_http doesn't understand chunked content. This patch was comitted on July, 13th. When I worked on the encoding-tests, I had to add an OutputCharset configuration parameter. So now Naviserver has to convert the content and sends it in chunked mode, which nstest_http doesn't understand. And the tests fail. If you remove this workaround, some of my new encoding-tests still fail because of a wrong content-length. But this is really a bug. Michael |