Thread: [Yaml-core] Re: YAML Implementations a Plenty

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hello Brian. 

Summary...

  1. This is good news, nice to see progress.

  2. I'll be back on the C impl by the 22nd,
     the interfaces are solid, coding won't
     be too tough.

  3. See the new whitespace folding proposal below.
     In this proposal, new lines and leading/trailing
     whitespace is not significant.

  4. I'd like to add binary values and leaves
     back into the core model since this problem
     is solved (see earlier posts)

Details...

| While we wait for your C engine...

Very cool.

| I really want to start playing ith this stuff *now*. 

Absolutely...

| - Interface consists of 2 functions: serialize() and deserialize().
|   - serialize() takes a list of hashes and returns a string
|   - deserialize() takes a string and returns a list of hashes

Nice.  A bulk of the "C" interface is really working on the
parser/printer interface.  Comments on the interface by
implementers in other languages would be very cool.

| - Support Hashes, Lists, Scalars, Objects and Undef(Null).
|   - BTW, why isn't "Null" in the data model currently?

It should be added.  It is in the "C" API.  Also, due to
the re-formulation of the "C" API and the discovery that
binary values can be round-tripped reliably, keys should
be allowed to be binary as well!

| - Can you explain why folding is needed again. Nobody I'm
|   working with seems to get it.

0.   See a compromise proposal below.

1.   Folding is simple.  Consecutive whitespace in the 
     YAML text are condenced into a single space 0x20.
     Other whitespace must be escaped, just like $@#
     and other significant characters must be escaped.

2.   For most of the business use cases, whitespace formatting 
     is not significant.  An extra space or two or a new 
     line or a tab character are not important.   If the 
     distinctions *are* important, then they should be 
     modeled, for example, like the HTML <P>aragraph marker.
     In my opinion, HTML's success was in no small part
     due to it's whitespace folding properties.

3.   For readability, it is often important to keep the YAML
     file word-wrapped by column 76.  If whitespace is significant
     then this becomes problematic or hard-to-define (undersand).
     This is important when re-structing texts (cut/paste).

4.   I strongly object to "layering" this.  If it is layered,
     then there can be different interpretations, etc.
     Confusing, YUCK.  Ask any XML guru how xml:space works.
     You will be greeted with a sigh, or more likely
     blank stare.  Further, YAML is different from XML 
     in that whitespace has markup significance, it is 
     helpful if it does not have application significance.

| Why would you ever want "lossy" data serialization? 

It is not lossy.  Not by any strech of the imagination.
Lossy means what you put in you don't get back out. Not 
true.  If you use the API to send "A  B", then the extra
space will have to be escaped.  This isn't lossy, it
is a simple rule.  Much like :@#$ need to be escaped.

| Maybe folding doesn't belong in the data model.
| it could just be a standard YAML tool, like MIME
| encoding. It would certainly simplify our syntax 
| considerably.

I think I'd like to build in the BASE64 encoding
into the core as well, using the [BASE64] mechanism
per the proposal before coloring.  Given the ability
to strongly support this at the API level, I'm 
in favor.  

How the base64 encoding affects denter: (a) if you
have a unicode string, then the input is encoded
as unicode, (b) if you have a regular, 8-bit clean
string, then...

  1. If the binary value is valid UTF-8, (fitting
     the char production) then the value can be
     saved as unicode.

  2. Otherwise, the value is Base64 encoded.

When loading, the non-base64 strings can be 
read into the unicode scalar value.  Otherwise,
they should be loaded into the 8-bit clean string.

Regardless, from what I understand about the perl
binding, the unicode string is in UTF-8, so this
can be treated as a binary if required without
a data courruption.

...

On the other platforms, there will be a function
asBinary() that when applied to a unicode string
will re-cast the string as a UTF-8 encoding
so that the binary value isn't courrupted.  It
will work just fine... and I'd like to have binary
values built-into the core.

| - Changed the syntax for block mode. See below.

Very cool.

| - Don't support references yet. 
|   - Just to make it simple for everyone to implement.
| 
| - Detect and terminate on circular (but not duplicate) refs.
|   - Duplicate refs are merely serialized multiple times right now.

Ok.

| - Support "#classname" syntax for marking data as objects.
| 
| - Use doublequote to remove ambiguity from certain "special" single
|   line values.

Ok. 

|   - Doublequotes do not fold whitespace in this impl.

Ok.  Let us compromise here (one I've been thinking about
for a while now).  Within a double quoted string value, 
leading and trailing whitespace is folded into a single 
space unless there is a trailing \

   one: "This is a multi line scalar.  Where these two spaces 
         are preserved, but new lines are not preserved."
   two: "If you want a new line.\n It must be explicity given.
         Further if the line ends with a slash, it is contin\
         ued without the intermediate space."

Good compromise?  The primary thing that this
mechanism gives you the ability to re-format as...

                    one: "This is a multi line scalar.  \
                          Where these two spaces are preserved, 
                          but new lines are not preserved."
                    two: "If you want a new line,\n It must be
                          explicitly given. Further if the line
                          ends with a slash, it is continued
                          without the intermediate space."

| foo : %
|     bar : @
|         <<4
|         The "4" above is the number of lines in the block.
|         It is basically an emitter comment,
|         not part of the
|         data model.
|         >>

Ok.  In this case is there a significant
carriage return before The?  And, as we
talked, this block does not have any lines
with leading whitespace? 

|         <<2-
|         The minus sign above indicates that
|           Line #2 doesn't have trailing newline
|         >>

This line does have leading whitespace?

|     baz : <<B5
|     This text contains lines that
|               might otherwise confuse the parser
|     >>
|     A>>
|         See what I mean?
|     B>>

Ok.  But if you do this, I'd make it consistent
and have it end with B5>>, this would mean 
changing the above examples.

|     <<2 : something
|     This syntax could easily be used 
|          for multi-line keys
|     >>
|     <<2 : <<2
|       a
|         key
|     >>
|      and a
|     value
|     >>

Syntax error detected at the "and a", right?

|     Clark : <<3
|     You should note that 
|        this syntax does not suffer from the indentation
|       ambiguity that you worried about on the phone :)
|     >>

If we used the proposed whitespace handling rule above 
then this is equivalent to...

     Clark : "You should note that\n   this syntax does
              not suffer from the indentation\n  ambiguity
              that you worried about on the phone ;)\n"

Is the compomise... ok?  At least it allows me to 
re-format the block so that it can be properly 
word-wrapped again if the map entry above has to 
be pushed in another 20 columns, so that a 76 
column margin can be maintained (with exceptions
of course).

|     Oren: <<3
|        One huge reason for using this syntax
|      Is that it makes it easy to quote a YAML document
|        as a single string in another YAML doc
|     >>

1. I think this is nice, although I think what 
   ever characters immediately follow the << 
   should be used to match the EOS marker...

   <<EOS  .....  EOS>> or something like that.

2. How do you handle a signifiant carriage return
   at the begin of the block?

3. I very much like whitespace folding rules not
   being in effect for this mechansim.  However, 
   it is important for there to be a mechansim
   where whitespace folding isn't a problem.

So glad to see YAML moving forward by other
parties!  

Best,

Clark

----- End forwarded message -----

Thread: [Yaml-core] Re: YAML Implementations a Plenty

yaml-core