> -----Original Message-----
> From: Osamu TAKEUCHI [mailto:osamu@...]
> Sent: Thursday, September 24, 2009 7:29 AM
> Subject: Re: [Yaml-core] utf8u tag proposal
> > It doesn't take long to see that there are multiple problems (and I
> > didn't even get into more complex types). How to map to and from
> > native types is a problem that the YAML processor and
> application have
> > to cooperate on. It is beyond the scope of the YAML
> specification or the definition of YAML tags.
> > Oren has mentioned the idea of a schema language several times, and
> > that is a way to go about it for more general purpose applications.
>
>
> I do not agree with this table if the "YAML processor" is for
> general purpose. I wish that a general purpose YAML library
> is transparent through serialization and deserialization,
> namely it preserves the data and the data type as far as
> possible. Otherwise, it will not be widely used in real
> applications, especially when we are talking about a library for .NET.
>
> See, my library always distinguish byte from ulong.
>
> // Instantiates a serializer with the default configuration.
> YamlSerializer serializer = new YamlSerializer();
>
> // Put byte and ulong values into an untyped array.
> object[] objs = new object[] { (byte)1, (ulong)1 };
>
> // Serialize them.
> string yaml = serializer.Serialize(objs);
>
> // %YAML 1.2
> // ---
> // - !System.Byte 1
> // - !System.UInt64 1
> // ...
>
> // Deserialize them.
> object[] restored = (object[])( serializer.Deserialize(yaml)[0] );
>
> // Confirm that the types are preserved.
> Assert.IsFalse( restored[0].Equals(restored[1]) );
> // ~~~~~~~ (byte)1 does not equal to (long)1.
To always have an exact match using tags like that would result in every
programming language having its own set of tags. While this is fine for
serialization and deserialization using the same language and library (and,
therefore, useful to solitary applications), but it has problems when we
need to deserialize to and from different languages or libraries (when we
have systems of applications, such as in an enterprise or because we are
using utilities created by others that may be written in other languages).
<Begin long example>
Say, for example, that we want to create a system for posting product
information to an Intranet for reference by our sales department. So lets
say that we create an application that will create the content in C# and we
save the created content to a YAML document. Of course, we want to be able
to edit it again within the C# program as well, so we serialized it using
tags that were specific to each type (local tags such as in your example).
Now lets say that on the server we have a PHP page that actually displays
the content. At first, its still not much of an issue. Since the PHP program
is just displaying data, it generally doesn't need to understand what the
tags mean; it can treat them all as strings. Of course, where it does need
to understand the actual meaning, the PHP programmer now must create a
mapping from each of the tags used for C# to a type that PHP can manipulate.
All goes well in our initial release and are users are happy. So happy, they
ask for an improvement! They want to be able to modify the content on the
PHP page. Uh, oh... we just hit or first snag. Where before we only had to
translate the data into PHP's types, now we also have to map them back
again. And now we have to do it for ~every~ type that is exposed using a
custom tag. Still, if our data is simple enough (if it doesn't have too many
type specific tags), then this might not be too big a deal. But lets
remember that we are having to write our own code to do this. If we were
using a YAML library for PHP, it wouldn't know about all those tags we
created for our C# types and so we have to instruct it what to do with every
one of them rather than letting it handle the details (which it could have
done, if its well written, to and from YAML types).
Still, all in all, it doesn't seem like too much of an issue yet. Then, one
day, the sales department comes to us and tells us that the pricing
information listed on all of their pages needs to be updated. The process is
somewhat complicated because we need to, first of all, read in each product
in a given document, then match it to the data that sales provided, display
it to a user, let them verify it, and then update the original document. We
are then told that a utility exists to do this already. An utility written
in Python that only needs to be told how to recognize the fields in
question. Our boss, therefore, thinks, no big deal and tells our sales
manager that it can be done today. Then, he comes to us and asks us to do
it. And we can't. Why? Because the utility doesn't recognize our data types
and so it refuses to load the document. Oh, we can rewrite it... but it
won't be done by the end of the day (we're either staying really late or
else our boss has to go back to the sales manager and tell him it will take
a few days... guess which he is going to choose?).
<End long example>
As you might guess from this and other posts, I'm a big fan of
interoperability. Even if you don't believe from my (still way too short)
example (because they get much more complex in real life), it just makes
life easier when you have a common data format that applications agree upon.
It allows different groups of developers to use different platforms that
better match their unique needs (the C# vs. PHP part of the example) or even
utitlities developed by 3rd parties (the Python part of the example). This
breaks down the less agreement there is about the format, types, and
encoding used.
As an aside too, in C#, it really isn't necessary to explicitly state what
type each member of a serialized class is. When deserializing, that is
already known from the type of the class being deserialized (and,
consequently the type of each member as it is deserialized). The only thing
that would need to be known is the type of the root object. This can be
accomplished either by the application stating what that type when it calls
the deserializer or by storing it as a scalar at the beginning of the
document or by using a local tag to specify the type. This is one way of
defining a schema (in this case, the schema is implied by the type system)
and is why I mentioned schemas as a solution to the problem of mapping data
types. Schemas can be implicit or explicit. Using a class hierachy would be
implicit whereas an external schema, such as I mentioned, or something
similar would be explicit.
This would be more of an issue with a dynamic language, but they generally
don't care as much about the type in the first place, and generally don't
have as many basic types to begin with, so it usually moot or, once again,
is implicitly defined within the application.
> For me, this is natural because when the users of the library
> serialize a byte value into YAML, their main interest is not
> which data type is of the YAML's standard, but is whether or
> not their data is stored and restored without any modification.
> Othewise, they will give an int value instead of a byte value.
> Many-to-one type mapping by a library in order to use only
> YAML's standard types seems to force the unwanted portability
> to an application with unwanted pain to realize it. Umm, I
> have to agree that I can not have one-to-one map to !!null
> for .NET, as I have written the other day. I now think the
> one-to-many mapping does not hurt the library users.
If you are defining local tags anyways, you can always define them in such a
way that lets you give them a null value when you need to. You are, in
effect, defining the data types that will be in the YAML document and so you
can specify the format of their values in the YAML document (a fact that you
can also use to overcome your line feed issue, again if you are defining
your own local tags).
> I give another example to make the issue clearer.
>
> public class Test
> {
> public byte a = 3; // initial values are specified
> public ulong b = 2.1;
> }
>
> string yaml = serializer.Serialize(new Test());
> // %YAML 1.2
> // ---
> // !Test
> // a: 3
> // b: 2.1
> // ...
>
> Here, the Tags !System.Byte and !System.UIng64 are not
> specified explicitly to the fields a and b but implicitly
> resolved from the tag !Test to the parent node. They are not
> converted to !!int for serialization.
Right, the schema is implied by the Test type. As an aside, most
applications are going to know what type the root object is and so can
provide that type to the deserializer. For this reason, it usually isn't
necessary to specify the type of the root node either.
> So, regarding to the preservation of the original data and
> data type, I understand William's desire. If I determine to
> strictly preserve the content of string variables, for the
> users of the library, I will have to serialize all the string
> variables as !!utf8u.
> At this moment, I'm assuming the users do not require it, though.
Worst case, .NET string variables can always be stored as double quoted
scalars in YAML, unless they have an unmatched surrogate pair. That is not a
common use case, but it was why I said a string could end up being mapped to
!!utf8u (when that tag exists) instead of !!str.
> With reading this, I'm very much afraid that no general
> purpose library will give a support to !!utf8u, because it is
> too uncertain for me what is expected to a library. Thus, I
> do not think it is a good idea to have !!utf8u in the tag
> repository unless we have quite a lot of use cases where an
> independent class similar to my example is beneficially used.
>
> I do not deny that YAML can have its own data type. But when
> no other language have such a data type, we have to think
> carefully how it will be implimented in a library and used in
> an real application. Note that it is almost hopeless to build
> an application without a library, because of the complexity
> of the YAML's syntax.
There is value to the tag even if it isn't commmonly implemented by
libraries. Most libraries allow users to define their own tags and so having
a standard type that users can define when they need it still aids
interoperability (rather than each user defining their own tag). Being a
part of the standard repository also allows users to look up how to
interpret the tag if their library doesn't support it.
> On the other hand, if we adopt the utf8u merely as a
> variation of encoding, as the next,
>
> - !!str @utf8u "..."
> - !!binary @utf8u "..."
> - !<!byte[]> @utf8u "..." # !< > is to have non ns-tag-char
> # "[" and "]" in the tag
>
> it will be more commonly implemented in libraries because the
> role of a library is clear enough. I understand utf8u should
> theoretically be independent of the real data type, to which
> an utf8u data is stored (!!str, !!binary or !byte[]).
> But I still think this will work much better in practice.
>
> In this sense, I would like to have YAML's specification
> practical than ideal. I wish YAML makes the real problems
> easily solved than it is theoretically nicely defined.
>
> Well, here, I assumed we had already had the encoding
> property. Since I'm not sure if the encoding property is
> really benefitial enough to modify the current YAML's syntax,
> I have not reached the conclusion how the utf8u should be
> defined in the YAML specification.
>
>
> > As an aside, I'm still interested in the idea of encoding tags.
> > Although !!utf8u might not be a valid candidate for an encoding tag
> > (because the data it represents doesn't satisfy the
> constraints of the
> > str type), I can see other cases where it might be useful, such as
> > Osamu's earlier line feed problem. In general, however, I
> think some
> > method for providing hints about the original data type
> would be more
> > useful (was that !!null node an object, a string, a byte array, or
> > what?). Schemas, whether defined implicitly or explicitly,
> can solve
> > both of these issues, however. (I guess I've joined the
> schema bandwagon =p).
>
> Yes, I was also thinking the same for the encoding property,
> and reached almost the same conclusion at this point.
>
> At first, I studied doing the next to solve the line break
> preservation problem with !!utf8u,
>
> string s = "abc\r\n def\r\nghi\r\n";
> string yaml = serializer.Serialize(s);
>
> // %YAML 1.2
> // ---
> // !!utf8u |2+
> // abc%0d
> // def%0d
> // ghi%0d
>
> instead of doing the next.
>
> // %YAML 1.2
> // ---
> // "abc\r\n\
> // \ def\r\n\
> // ghi\r\n"
>
> I saw the former is better but still less readable, and it
> does not work if the line break is "\r". I would like the
> next better if I had such an encoding property.
>
> // %YAML 1.2
> // ---
> // @crlf |2+
> // abc
> // def
> // ghi
>
> But then, I would like to have a @crlf in front of evey
> multi-line string, similarly as William wants to have his
> string values all in !!utf8u. So, I thought of specifying the
> default encoding for !!str as the next.
>
> %ENCODING !!str @crlf
>
> At this point, I realized this directive only theoretically
> makes the document portable. I can find almost no use cases,
> where a different application reads the document without the
> directive and causes any problem because of the difference in
> line feed. The only one case I found was that such an
> application loads the file with line break normalization to
> "\n" and saves the file in a double quoted text where the
> line feeds are explicitly escaped as "abd\n def\nghi\n".
> Except for this impractical anxiety, it is not too bad to
> handle the document without the directive, with exchanging
> the knowledge of the real linefeed format for the unescaped
> multi-line strings in the YAML document, as the form of
> implicit or explicit schema for the data.
>
> So, I just added an option to my library to normalize all
> unescaped line feed in the multi line text in YAML to any
> form of line break. I now think this is the best way to solve
> the problem, though the document will not be strictly
> portable. Well, honestly speaking, I also had another point,
> I didn't want to think of "!!str @crlf @utf8u". :(
I've been thinking it would be nice to have decorators in addition to the
tags for a given node. The idea would be that one or more decorators can be
added to a given node to provide additional information that the application
may use. In fact, I would see two different facets to the definition of
decorators. They can be local (application specific) or global (standard)
and they can be required (a library should abort if it doesn't know what to
do with it) or optional (a library can safely ignore the decorator if it
doesn't know what to do with it, but should try to preserve it). I say
"decorators" not "encoding tags" because they are distinct from tags (which
specify a type) and because there is more potential to them than just
specifying an encoding. You could, for example, use a decorator to indicate
the native type of an object without requiring other YAML libraries to
understand it and without changing the YAML type.
In fact, decorators aren't a good way to specify encodings (including your
idea of an encoding tag). The problem with that is that the YAML data type
specifies the encoding. Consequently, alternate encodings need to be a part
of the data type itself and, therefore, their specifiers need to be a part
of the tag. Your earlier idea of using a "#" to attach an extra part is
closer to this than the idea of a detached tag.
In any case, utf8u does not fit in this category. It must have its own type
because it does not obey the constraints on the type defined by the !!str
tag. Nor can it currently be the type defined by the !!binary tag because
the presentation of that type is unacceptable. It could be used in
conjunction with the type specified by !!binary (and this makes sense since
both must generally be returned as byte arrays), but to do so requires that
we be able to modify its presentation (which is why I said the idea of using
"#" to attach an encoding specifier and modify the type makes more sense
than having a separate encoding tag).
For now, I agree with you that the best option for scenarios like your line
feed scenario is for libraries to provide options to the application on how
to interpret the data. Having a way to specify schemas explicitly would go a
long way to solving issues like that one and possibly eliminate any use case
for other forms of decorators (as well as looking nicer since it wouldn't
require decoration on every node). Interestingly, it probably would not
eliminate the need for encodings, but would move it out of the main file
where it clutters everything.
> So, I currently have no need for @binary in my YAML library
> because the parser always knows that the node is encoded with
> base64 without an explicit node property.
My comments about how I would do things in C# were not necessarily the
"right" way or "best" way to do things. They were how I think I would do it,
but with only a little thought about the best way for each type. The main
point of my example was to show that there is a many to many relationship
between native types and YAML types, so please don't take it as a critique
of your method, which I have yet to examine in depth. By using local tags,
you effectively side stepped many of the issues involved in this many to
many relationship.
> Fmm, maybe there are very few use cases where the encoding
> property should be explicitly specified.
>
> However, we should realize that we have implicit encoding
> properties anyway for these nodes. In addition, unless we have
> somehow define the variety of encoding in the YAML specification,
> library implementers can not determine whether or not it should
> use base64 or any other format to encode some specific data.
> Now, the definition of !!binary and other tags gives a regulation
> (or just hints?) for the encoding format, even though their
> original purpose was to define their own data type.
>
> As another example, I would like to have @int encoding for
> all native integer types of .NET and @float encoding for all
> floating point types. I mean, accepting "10_000_000" as
> !!int with rejecting "!System.UInt64 10_000_000" for ulong
> (because .NET by default does not convert "10_000_000" into
> ulong) will make the users confused. So I would like to
> accept every YAML !!int style for every .NET integer type.
> This can not be done unless !!int gives me a hint for the
> standard encoding format for integer type values in YAML.
I'm not sure what you meant by that, so I'll have to think it through...
> Umm, I myself might be a little confused about the
> relationship between a YAML's data type and the accompanying
> encording style. I'm also afraid to be going to mess up
> the YAML specification, despite that I'm trying hard to
> organize the relationship. :(
I don't think Oren and the other spec writers would actually introduce
anything that is going to mess up the specification. There are too many
people hashing over each idea presented and giving feedback, which gives all
of us a chance to learn without fear of messing something up.
|