From: Coleman, M. <MK...@St...> - 2008-03-10 16:18:42
|
Matthew Chambers: > >>> "run1,2",3,4 > >> %22run1%2C2%22,3,4 > >> """foo""" (the implicit rule being > >> that a pair of " > > A point of confusion here is that the quoting within > "run1,2" is operating at a different level than the XML > quoting. Perhaps the simplest regime would be to simply > specify '\"' for '"' and '\\' for '\'. Ugly, but everyone > would be familiar with it (from C, sh, etc), and I think it > won't collide with XML escaping. > " and ' cannot appear in an XML file except as quotes for > attributes. So > it would have to be "\"". "\\" would still work to escape > backslash, but I think sticking with the XML way of escaping things is > more consistent. I doubt everyone will be familiar with the C-style > escape convention, even if it's shared by some *nix shells. I think > Windows does use "" to escape quotes, I'm not sure though. I was referring, with my examples, to the inner-level of quoting. There's so much going on here that it's difficult to even talk about, I think. Working from the inside out: 1. I can have a particular "run name" within the comma-separated list. For example: run"with"bizarre"name 2. Within the list, this might look like run"with"biz,arre"name,another"biz,arre"name"< except that that won't work as is, because we can't tell the commas in the names from the commas separating the names. We need to escape the commas within names somehow, together with escaping for the escaping, so that we will still be able to form names that ultimately contain any character sequence. It seems like there are two basic approaches here: (a) use an XML-ish escape mechanism, or (b) use something completely different. For (b) I'll use the C-ish backslash idea. I'm in favor of (b) because (a) will make everyone's head explode. (Note that (a) is *not* straight XML escaping--it can't be. Rather, it'll have to be a matter of running the string in question through XML (or XML-ish) escape interpretation a second time.) Assuming (b), we might have "run\"with\"biz,arre\"name","another\"biz,arre\"name"<" or 'run"with"biz,arre"name','another"biz,arre"name"<' if we decide that this comma-separated format can also use single-quotes instead of double quotes (as XML and Python do). Note carefully, however, that none of this is yet XML! We're still "inside". 3. Moving outward with the first of those two, now we will XML-escape it, so that it is a valid XML attribute: <yyy zzz="run\"with\"biz,arre\"name","another\"biz,arre\"name"<"> That's not enough, though, because we "captured" that string '"<' that looks like XML, but is actually part of the name. We have to be sure to escape the ampersands, too: <yyy zzz="run\"with\"biz,arre\"name","another\"biz,arre\"name&quot;&lt;"> This is indeed pretty awful, but it's difficult to see what would be better. If you want to try (a) above (which is I think what you mean when you say "stick with the XML way of quoting"), I'd be curious to see that worked out in the same way. If it's going into the standard, I definitely think that an example like this should be worked to make sure that everyone understands how things are supposed to work. What leaps out at me is how ugly and complex this is. (It also reminds me of why I don't like XML.) Hopefully most mzML producers will not generate stuff like this, but every consumer will need to correctly interpret it. The chances that everyone will be able to implement this stuff correctly seem very low. I think that an XML person would look at this and say that all of this is a sign that the whole inner structure of this comma separated list is too complex and needs to be broken out with something like <yyy> <zzzlist> <zzz> run"with"biz,arre"name </zzz> <zzz> another"biz,arre"name"quot;"lt; </zzz> </zzzlist> ... I don't necessarily agree with that, but I do think that we're kind of torturing XML by trying to squeeze all of that information into one attribute value. Another alternative that I think should be seriously considered is to just give up and restrict run names to a small set of characters like letters (upper and lower) digits these four characters: .-_: or perhaps a subset of these--that is, roughly the characters used in identifiers in typical programming languages. Would this be a terrible hardship? Mike |