Thread: [Yaml-core] Official YAML Subset?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

As I mentioned in my eariler post, I found it useful to implement a small
parser for a subset of YAML geared towards configuration files and
relatively small amounts of meta-data.  I'm proposing that YAML have an
official subset language to guide people who want to write small YAML
parsers.  The working title is POY (Piece Of YAML).

I see two broad usages for YAML.

1.  Data serialization
2.  Metadata

Data serialization can be things like representing any language's arbitrary
data structures in YAML, complete with round-tripping.  Its also things like
dumping out your store's inventory database into YAML for easier
transmission and interoperability.  They can be extremely large.

For these things you need anchors, aliases, transfer methods, complex keys,
etc... along with all the other YAML stuff.  It also means YAML parsers must
be very efficient taking care not to require the whole YAML document be
loaded into memory in case you're reading from an IO stream or a very large
document.

Metadata can be things like config files, build files (like Ant uses XML
for), change logs or extra data about a thing (like RPM spec header files).
They're often used in systems which are not centered around data.  They tend
to be relatively small.  These are typically written by a human or by very
simple code.

They don't require round-tripping.  Since they're small you can be sloppy
about how you use memory, often making parsing easier.  You don't need
anchors, aliass, transfer methods or complex keys.

I see a very broad potential for YAML in metadata, since it means you get,
for free, a data format which is both machine and human readable rather than
inventing Yet Another Ad Hoc format.  In those cases, YAML is merely a
convenience, not an essential part of the program. Making a dependency on a
full YAML parsing library might seem like overkill to an author or not be
possible.  Using a small parser may be easier.

A small parser is certainly easier to write.  My proof-of-concept state
parser (not finished) can be found here:
http://www.pobox.com/~schwern/tmp/YAML-POY/

The state transitions here:
http://www.pobox.com/~schwern/tmp/YAML-POY/States

And a rough draft of the reduced spec (basically, the unneeded productions
were removed) here:
http://www.pobox.com/~schwern/tmp/POY.spec
(BTW  That's what your spec looks like when run through an html2text filter.
More on that in another post)

Here is a short list of what was removed.  Unless otherwise noted they were
removed because they're either not needed or difficult to parse or both.
- Collection flows
- Complex, multi-line or quoted keys.  All that's left is "foo: "
- Scalar only documents
- Folded scalars (ie. >)
- Document end (ie. ...)
- Directives (ie. 3.3.2)
- Node Properties (ie. 3.3.4)
- Transfer Methods (ie. 3.3.5)
- Anchors (ie. 3.3.6)
- Aliases (ie. 3.4)

These things are under consideration for removal.  They're left in because
they're either needed or trivial to parse.
- Single and double quotes (I've figured an easy way to parse them easily)
- Unicode                  (a necessary evil)
- Explicit indentation (ie. |2)  (very easy to handle)
- Chomping (ie. |-)              (ditto)

And the only actual addition to the spec is extending the reserved character
set to include things for forward compability with the full YAML spec.  For
example, flow scalars in the small spec can't start with {, [, *, etc...
because it would mean something different in YAML.

The spec is intended as a complete subset of YAML.  Any POY document will
mean the same thing in YAML.

Real world use cases:
- MakeMaker
- Module::Build
  - Perl is moving to a metadata format for modules (ie. a file to stick
    things like Name, Author and Version into).
  - Both of these cannot depend on YAML.pm because of circular dependencies
    (Module::Build gets away with it becuase YAML is currently built with
    MakeMaker)
  - They would have to provide their own YAML parsers.
- Perl core
  - There are lots and lots of applications for YAML in the Perl core.
    Change logs, test data, config files, etc...
  - Getting YAML.pm into the core is politically hard.  It always degrades
    into a YAML vs XML argument.  A small POY library can be snuck
    in without a lot of bruhaha.

Why:
- The full spec is rather difficult to implement completely.
- POY is relatively easy.
- POY covers the 80% of YAML that humans use.
- POY removes all the really confusing parts
  - Easier to document
  - Easier to convince people of YAML's utility
- People are going to implement subset parsers anyway, might as well have
  one official subset rather than N ad-hoc ones.

Why Not:
- Some parts of POY make no sense once others are removed, except to
  preserve forward compatibilty
  - reserved characters
  - space after the comment
- It "waters down" the official spec.

Problems with POY itself:
- Collection flows are removed because they're difficult to parse,
  yet they're still useful for humans.  I'd like to keep them if an easy
  way can be found to parse them.
- Whither Unicode?
  - Unicode makes things complicated
  - Half the POY users won't care about Unicode
  - The other half need it.
  - Do we...
    - Require Unicode?
    - Make it optional?
    - Make a Unicode POY superset of POY (which remains a subset of YAML)
    - Use a simple subset of the YAML Unicode productions?

Other Solutions:
- Include in the YAML spec a recommended order of implementation so users
  at least have a guideline of what features are likely to be implemented
  and what are not.

-- 

Michael G. Schwern   <sc...@po...>    http://www.pobox.com/~schwern/
Perl Quality Assurance      <pe...@pe...>         Kwalitee Is Job One
I know you get this a lot, but you're breathtaking, like a vision of
simplicity

Thread: [Yaml-core] Official YAML Subset?

yaml-core