Thread: [Yaml-core] yaml bytecodes?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Howdy.  Been thinking about a YAML bytecode specification, mostly as 
a way to commonly express APIs in a manner which is more independent
of language, independent of push vs pull semantics, and in a way such
that various tools (such as a schema validator or formatter) can be 
introduced as components within a YAML processing framework.

The core of this idea is to represent YAML information as "pre-parsed"
instructions, indicating document boundaries, comments, scalars, 
sequences, mappings, etc.   To allow for incremental delivery, the 
stream is delivered in a buffer which can be refilled as many times
as required.   In a simple case, the entire YAML stream may fit in a
single buffer, in other cases, the buffer could be filled with exactly
one document at a time, or more granular where large scalars can be 
split into chunks.  In this way the actual call API is reduced to 
a very small limited set of functions, one for push, one for pull,
and an optional one to expand a buffer.

A suprizing result of this method emerged almost immediately, style
instructions are thought more of as a mechanism to describe the 
"paintbrush" used, so that setting a literal block style for a 
given transfer method is one instruction which then applies to
the remainder of the stream.   This allows for styles, comments, and
other presentation requirements can be entirely distinct instructions
which cna be ignored by a processor, or injected into an existing
stream by a YAML 'painter', or interpreted by the printer.

Anyway, I'd love all of your feedback; so included below is a "C"
header file which attempts to describe this yaml bytecode layer.
I was thinking that I could "implement" this by hooking up 
Syck to produce the bytecode stream from its events.   Following
this I could write a Python wrapper for the bytecode layer, and 
then perhaps a pretty printer, etc.    More advanced components
would be a YPath processor, Schema validator, etc.

Although, perhaps this is the wrong direction.  I'm not sure.
Following is the header file...

Best,

Clark

/*  
 * yamlbyte.h
 *
 * This defines the layout of a YAML bytecode stream, aka "binary YAML",
 * which has the following goals:
 *
 *   parsed: >
 *     Tools written to use this bytecode stream (loaders, filters, etc.)
 *     need not concern themself with the complexity of styles and the
 *     various details of the YAML syntax.   While the bytecodes will
 *     themselves offer complexity, the complexity can better organized
 *     and would reflect the complexity of any other form of API.
 *
 *   push or pull: >
 *     By sticking with a 'data' based API, the differences between push
 *     and pull parsers can be limited to a single API call distinction.
 *     A push parser would call an event handler with a buffer holding
 *     a single or perhaps more than one bytecode instruction.  Where a
 *     pull parser would be called with a buffer and it could fill the
 *     buffer with as many bytecode instructions as would fit.
 *
 *   full or streaming: >
 *     With a bytecode instruction layout, a parser could return an 
 *     entire YAML document as a single result; or it could return
 *     a YAML document in chunks.   In essence, a language could 
 *     provide a SAX like or DOM like API on top of these bytecodes.
 *     For a streaming API, the buffer could be 'moving', and for a
 *     DOM like interface, the buffer could be static.
 *
 *   language independent: >
 *     By specifying a YAML "preparsed" structure as a sequential memory
 *     block each language could wrap the structure or layer the APIs
 *     on the structure as would seem appropriate to that language rather
 *     than having a single API.   In particular, a single YAML bytecode
 *     stream could even be streamed between different applications
 *     via shared memory or a pipe. 
 *
 *   tool support: >
 *     As a least-common denominator, several tools, such as a schema
 *     validator, path expression evaluator, or even transformation
 *     tools could be written to operate directly on the bytecodes.
 *     In this way, each language could wrap a "C" function to perform
 *     these mutations on the bytecodes without having to write the
 *     code themselves.
 *
 *   efficiency: >
 *     While it is not the goal of YAML bytecodes, some people could
 *     argue that a "binary" format is more efficient wire format.  In
 *     any case, the bytecode format can offer some optimizations to
 *     make it smaller for the most prominant situations.
 *
 *   compatibility: >
 *     It is possible to provide for standardized mappings of other
 *     syntaxes (such as XML, RDF, SOAP, BER, etc.) into YAML bytecodes.
 *     In essence, allowing for different parsers and emitters.
 *
 *   TODO: produce an extensive set of before/after samples with
 *         differing buffer sizes to demonstrate how chunking works.
 */

#include <stdint.h>

typedef uint16_t yaml_bytecode_t;
    /* 
     *  A YAML bytecode is a 16 bit value defined as a bitmask field.
     */

typedef uint16_t yaml_argone_t;
    /*
     *  Immediately following a bytecode is an optional argument to the
     *  bytecode.   For an error message bytecode, this would be the
     *  specific error message number.   For a typed node, this is a
     *  handle to a transfer method.
     *
     *  Note: if the value for argone (or argtwo or length) is greater
     *        than 65533 (sixteen bits minus two), then the value is 0xFFFF 
     *        and a 64 bit value is stored at the beginning of the variable
     *        length content.
     */

typedef uint16_t yaml_argtwo_t;
    /*
     *  Immediately following the first argument is a second optional 
     *  argument, the meaning of which depends upon the bytecode.  For
     *  scalar and branch content nodes, this is a handle which indicates 
     *  that the given node may be referenced later.  For an alias node,
     *  this is the previously referenced handle.
     */

typedef uint16_t yaml_length_t;
    /*
     *  To support variable length information, the last 16 bit peice
     *  of information following the bytecode is a length, in 64 bit
     *  chunks till the next bytecode.   For scalar values or error
     *  messages, this is a length of a UTF-16 encoded textual value.
     *  All text values are null terminated, and thus, the worst case
     *  terminator is a 64 bit zero.   Following the null terminated
     *  string value may be the overflow of argone or argtwo.
     *
     *  Note: Even if the length is zero, argone or argtwo or both
     *        may overflow, and in this case, the actual size of the
     *        bytecode may be longer than 64 bits.   The overflows
     *        are stored immediately following the bytecode word
     *        as length, argone, argtwo respectively.
     */

struct yaml_word {
    /* 
     *  A YAML bytecode stream then a sequence of the above four items
     *  packed into a 64 bit value, which, when streamed uses little
     *  endian network byte order.   
     *
     *  There are several design decisions at work with this structure,
     *  first, many computers use 64 bits for struct alignment.  Second,
     *  64 bit computers are on their way and thus having 64 bit lengths
     *  and values is a real possibility.   Third, in YAML data, most 
     *  content nodes are short and most documents are relatively small,
     *  thus, it would be ideal if most non-scalar nodes fit in 64 bits.
     */
     yaml_bytecode_t  bytecode;
     yaml_argone_t    argone;
     yaml_argtwo_t    argtwo;
     yaml_length_t    length;
};
typedef uint64_t yaml_word_t;
typedef yaml_word_t * yaml_word_p;
typedef uint64_t yaml_size_t;    /* length in sizeof(yaml_word_t) units */

/* 
 *  YAML bytecodes start with a section specifying branching 
 *  information and how a given YAML node may be spread across
 *  two or more buffers.
 *
 *  In particular, when a scalar node will not fit within the 
 *  remaining part of the current buffer, it must be marked with
 *  SPLIT bit flag, and then in the next buffer, the same node
 *  will continue and will be marked as RESUME.   If a branch node
 *  has a child which is split, then it is also split with a 
 *  pairing FINISH+SPLIT bytecode at the end of the current buffer,
 *  with a START+RESUME bytecode at the beginning of the next buffer.
 */

#define YAML_START ((yaml_bytecode_t)(0x1000))
        /* given bytecode is a branch and may contain other bytecodes;
         * the length signifies distance to a paired 'FINISH' bytecode
         */
#define YAML_FINISH ((yaml_bytecode_t)(0x2000))
        /* signifies the finish of a branch, length is always zero 
        */
#define YAML_EMPTY  (YAML_START | YAML_FINISH)
        /* an empty branch, length is always zero */ 
#define YAML_SPLIT ((yaml_bytecode_t)(0x4000))
        /* node did not fit in the buffer, it spills over into
         * the next buffer */
#define YAML_RESUME ((yaml_bytecode_t)(0x8000))
        /* a continuation of an incomplete node that was broken
         * beacuse it did not fit within the buffer
         */

/*
 *  YAML bytecodes are partitioned into four general categories,
 *  each addressing a particular aspect of a YAML stream.   It 
 *  is invalid to combine any of these flags, this allows these
 *  flags to be checked by a the bitwise-and operator.
 */
#define YAML_CONTROL ((yaml_bytecode_t)(0x0100))
        /* this is your general flow control instruction, which
         * is used to signal document boundaries, registration of
         * type families and other non-content but important signals 
         */
#define YAML_CONTENT ((yaml_bytecode_t)(0x0200))
        /* primary content of the yaml stream including scalars
         * collections; also including alias handling
         */
#define YAML_STYLE   ((yaml_bytecode_t)(0x0400))
        /* comments, insignificant whitespace, and styling instructions
         * which can all be safely stripped or added by a formatting 
         * layer interested in human presentation 
         */
#define YAML_MESSAGE ((yaml_bytecode_t)(0x0800))
        /* error messages, warnings, debug traces, and other 
         * informational notices which are not style, content, or
         * control 
         */

/*
 *  YAML_CONTROL
 */

#define YAML_BUFFER (YAML_CONTROL | 0x0010)
        /* A YAML bytecode stream can be delivered in a series of
         * convienent buffer-sized chunks; this is always the first and
         * last bytecode in the stream.   The leading bytecode must
         * have a non-zero length which jumps to the trailing bytecode.
         * Between these two bytecodes is the content of the chunk, 
         * formatted as a series of bytecodes.
         */
#define YAML_DOCUMENT (YAML_CONTROL | 0x0020)
        /* Immediately following a buffer is the document bytecode which
         * signifies the begin/end of a document.  This is also a branch.
         */
#define YAML_INTERN   (YAML_CONTROL | 0x0040)
        /* instructs the interpreter to allocate memory for a string value
         * and place it into memory for later reference; strings pinned
         * with this instruction must stay in memory for the entire lenth
         * of the YAML bytecode stream
         */
#define YAML_TRANSFER (YAML_INTERN | 0x0001)
        /* indicates that the interned value is a transfer method
         *
         * For this instruction, argone is the number of the storage 
         * location where the transfer method will go.  Values of
         * 0x3FFF and below are reserved for built-in YAML approved 
         * transfer methods, and across a stream transfer method
         * handles may not be recycled.
         */
#define YAML_CONSTANT (YAML_INTERN | 0x0002)
        /* indicates that the interned value is a 'prealias', a well 
         * known constant used by a schema or transformation tool
         *
         * Many schemas use special key names to key on, and this allows
         * a YAML processor to be 'initialized' with those scalars which
         * be frequently used.  In a manner similar to YAML_TRANSFER,
         * argtwo is used to name the alias handle for the constant.
         * It is an error to reassign alias handles set with this
         * instruction (although they can be reset with other aliases)
         */
#define YAML_OPTIMIZATION (YAML_CONTROL | 0x0080)
        /* indicates an instruction which represents some sort of
         * optimization in the structure for better lookups, etc.
         * these items are allowed to have machine specific byte
         * orderings and should be ignored in general
         * TODO: this is just musing... 
         */
#define YAML_BINARYEQUIV (YAML_OPTIMIZATION | 0x0001)
         /* for built-in well known transfer methods, like integer,
          * the payload of this instruction could be a 'binary' 
          * version instead of the textual representation
          * TODO: this is just musing
          */
#define YAML_HASHTABLE   (YAML_OPTIMIZATION | 0x0002)
         /* this can follow a mapping node to provide a hashtable...
          * TODO: this is just musing
          */
/*
 *  YAML_CONTENT
 */

#define YAML_ALIAS    (YAML_CONTENT | 0x0010)
        /* indicates an alias to a previously marked node, argtwo
         * contains the alias handle used to lookup the replacement
         */
#define YAML_SCALAR   (YAML_CONTENT | 0x0020)
        /* indicates a scalar node, argone contains the transfer method,
         * and argtwo can be used to mark the node with an alias handle
         * so that the scalar can be referenced later 
         */
#define YAML_BRANCH   (YAML_CONTENT | 0x0040)
        /* indicates a branch node, just like scalar, argone contains the
         * transfer method and argtwo can be used to mark the node for
         * use as an alias later. 
         */
#define YAML_SEQUENCE (YAML_BRANCH  | 0x0001)
        /* indicates the sequence structure, as an ordered list of
         * nodes, obviously only content nodes count as content.
         */
#define YAML_MAPPING  (YAML_BRANCH  | 0x0002)
        /* indicates a mapping structure, as a list of pairs
         * TODO: can this be done any differently?  */

/* 
 *  YAML_STYLE
 */

#define YAML_IGNORABLE  (YAML_STYLE | 0x0010)
        /* comments, whitespace, etc. 
         */
#define YAML_COMMENT (YAML_IGNORABLE | 0x0001)
        /* a single line comment */
#define YAML_WHITESPACE  (YAML_IGNORABLE | 0x0008)
        /* insignificiant whitespace, note that a given scalar
         * can be split over several instructions with whitespace
         * instructions placed stratigically in the middle,
         * argone contains how many of those spaces
         */
#define YAML_BREAK (YAML_WHITESPACE | 0x0002)
        /* a line break, argone says how many line breaks 
         */
#define YAML_INDENT (YAML_WHITESPACE | 0x0004)
        /* sets the number of spaces to use for future indentation
         * with argone
         */
#define YAML_BLOCK (YAML_STYLE | 0x0020)
        /* instruction that changes the 'paint brush' to use the block
         * style when printing nodes
         *
         * argone: if provided, limits the paint brush to the particular
         *         transfer method specified
         * argtwo: if provided, this is filled with the bytecode mask
         *         to apply against, for example YAML_BLOCK or YAML_SCALAR.
         */
#define YAML_BLOCK_FOLDED (YAML_BLOCK | 0x0001)
        /* specialization of block style that specifies folded scalar
         */
#define YAML_BLOCK_LITERAL (YAML_BLOCK | 0x0002)
        /* specialization of block style that specifies literal scalar 
         */
#define YAML_FLOW (YAML_STYLE | 0x0040)
        /* similar to YAML_BLOCK only it sets the paint brush to
         * use the flow style
         */
#define YAML_FLOW_PLAIN (YAML_FLOW | 0x0001)
        /* specialization of flow scalar that indicates the plain style,
         * of course an error could be created if this isn't possible.
         */
#define YAML_FLOW_SINGLE_QUOTE (YAML_FLOW | 0x0002)
        /* specialization of flow scalar style to indicate single quoted 
         */
#define YAML_FLOW_DOUBLE_QUOTE (YAML_FLOW | 0x0004)
        /* specialization of flow scalar style to indicate double quoted
         */

/* 
 *  YAML_MESSAGE
 */
#define YAML_NOTICE (YAML_MESSAGE | 0x0010)
        /* specifies an informational message which should be sent to
         * the user if it is not understood 
         *
         * argone: holds an error number
         * argtwo: holds a line number
         * value: holds the message text
         */
#define YAML_WARNING (YAML_NOTICE | 0x0001)
        /* an unexpected event happened, but processing will continue
         */
#define YAML_ERROR  (YAML_NOTICE | 0x0002)
        /* the producer will stop producing further instructions for
         * the current document and will move on to the next document
         */
#define YAML_FATAL  (YAML_ERROR  | 0x0004)
        /* previous nodes may have been invalid and the producer
         * will stop producing futher instructions */
#define YAML_APPNOTICE (YAML_NOTICE | 0x0008 )
        /* a mixin flag to indicate that the given notice was not
         * produced by the parser, but rather by an application,
         * in this case, argone should be > 0x3FFF 
         */

/*  
 *  producer/consumer API
 *
 *  The cool part of YAML bytecodes is the simple "C" call API, since
 *  most of the complexity is moved into the bytecodes.   There are two
 *  forms of the API, a push and a pull interface.   The yaml_push_t is
 *  a callback function called again and again by the producer, while
 *  yaml_pull_t is a pull function called again and again by the
 *  consumer.  The push function simply sends the consumer's structure
 *  back to them along with the next buffer, it responds with a buffer
 *  containing YAML_MESSAGE instructions, or NULL if all is OK.
 *  The pull function is passed an empty buffer and fills it; if a
 *  resize function is passed, it can use this to resize the buffer
 *  as required.  The pull function simply returns the same buffer
 *  it was passed (or the resized one), any messages to the consumer
 *  will be in the buffer.
 */

typedef yaml_word_p yaml_buffer_t;  /* first instruction is YAML_BUFFER */
typedef void * yaml_producer_t;     /* someone producing YAML buffers   */
typedef void * yaml_consumer_t;     /* someone consuming YAML buffers   */

typedef yaml_buffer_t
    (*yaml_realloc_t)
        (yaml_buffer_t buff, yaml_size_t newsize);

typedef yaml_buffer_t
    (*yaml_push_t)
        (yaml_consumer_t sink,
         yaml_buffer_t buff);

typedef yaml_buffer_t
    (*yaml_pull_t)
        (yaml_producer_t source, 
         yaml_buffer_t buff, 
         yaml_size_t size,
         yaml_realloc_t *realloc);

/*
 *  Various helper macros/functions for operating on these data
 *  structures.    
 */
typedef int      yaml_bool_t;
#define YAML_OVERFLOW ((yaml_int16_t)(0xFFFF))
#ifdef YAML_SAFE 
   /* safe versions of various helper macros, these are
    * actual functions which must be called so that they
    * can check arguments and appear in the stack trace;
    * also, ideally these can call an error function
    * to invoke an exception handler (long jump)
    */
    extern yaml_word_t YAML_MAKE_WORD( 
                   yaml_bytecode_t bytecode,
                   yaml_argone_t   argone,
                   yaml_argtwo_t   argtwo,
                   yaml_length_t   length
    );
    extern yaml_length_t YAML_LENGTH(yaml_word_t word);
    extern yaml_argone_t YAML_ARGONE(yaml_word_t word);
    extern yaml_argtwo_t YAML_ARGTWO(yaml_word_t word);
    extern yaml_argtwo_t YAML_BYTECODE(yaml_word_t word);

    /* helper items for dereferencing and checking for overflow */
    extern yaml_word_t YAML_DEREF_WORLD(yaml_word_p word);
    extern yaml_bool_t YAML_OVERFLOW_LENGTH(yaml_word_p pword);
    extern yaml_bool_t YAML_OVERFLOW_ARGONE(yaml_word_p pword);
    extern yaml_bool_t YAML_OVERFLOW_ARGTWO(yaml_word_p pword);

    /* methods to handle dereferencing from a pointer and overflows */
    extern yaml_bytecode_t YAML_GET_BYTECODE(yaml_word_p pword);
    extern yaml_word_t     YAML_GET_LENGTH(yaml_word_p pword);
    extern yaml_word_t     YAML_GET_ARGONE(yaml_word_p pword);
    extern yaml_word_t     YAML_GET_ARGTWO(yaml_word_p pword);
    extern yaml_word_t     YAML_GET_SIZE(yaml_word_p pword);
#else
  /* the macro equivalent of the items above, without error checking */
  #define YAML_MAKE_WORD(bytecode, argone, argtwo, length) \
            ((yaml_word_t)                            \
             ((((yaml_bytecode_t)(bytecode)) << 48) + \
              (((yaml_argone_t)(argone)) << 32) + \
              (((yaml_argtwo_t)(argtwo)) << 16) + \
               ((yaml_length_t)(length))))
  #define YAML_LENGTH(word) \
            ((yaml_length_t) ((((yaml_word_t)(word)) << 48) >> 48))
  #define YAML_ARGTWO(word) \
            ((yaml_argtwo_t) ((((yaml_word_t)(word)) << 32) >> 48))
  #define YAML_ARGONE(word) \
            ((yaml_argone_t) ((((yaml_word_t)(word)) << 16) >> 48))
  #define YAML_BYTECODE(word) \
            ((yaml_bytecode_t) (((yaml_word_t)(word)) >> 48))
  #define YAML_DEREF_WORD(pword) (*((yaml_word_p)(pword)))
  #define YAML_OVERFLOW_LENGTH(pword) \
          ((yaml_bool_t)(YAML_OVERFLOW==YAML_LENGTH(YAML_DEREF_WORD(pword))))
  #define YAML_OVERFLOW_ARGONE(pword) \
          ((yaml_bool_t)(YAML_OVERFLOW==YAML_ARGONE(YAML_DEREF_WORD(pword))))
  #define YAML_OVERFLOW_ARGTWO(pword) \
          ((yaml_bool_t)(YAML_OVERFLOW==YAML_ARGTWO(YAML_DEREF_WORD(pword))))
  #define YAML_GET_BYTECODE(pword) YAML_BYTECODE(YAML_DEREF_WORD(pword))
  #define YAML_GET_LENGTH(pword) \
          (YAML_OVERFLOW_LENGTH(pword) ? \
            ((yaml_word_t)(*(((yaml_word_t)pword)+1))) :          \
            ((yaml_word_t)YAML_LENGTH(YAML_DEREF_WORD(pword))))
  #define YAML_GET_ARGONE(pword) \
          (YAML_OVERFLOW_ARGONE(pword) ? \
            ((yaml_word_t)(*(((yaml_word_t)pword)+1 \
                              +YAML_OVERFLOW_LENGTH(pword)))) : \
            ((yaml_word_t)YAML_ARGONE(YAML_DEREF_WORD(pword))))
  #define YAML_GET_ARGTWO(pword) \
          (YAML_OVERFLOW_ARGTWO(pword) ? \
            ((yaml_word_t)(*(((yaml_word_t)pword)+1 \
                              +YAML_OVERFLOW_LENGTH(pword) \
                              +YAML_OVERFLOW_ARGONE(pword)))) : \
            ((yaml_word_t)YAML_ARGTWO(YAML_DEREF_WORD(pword))))
  #define YAML_GET_SIZE(pword) \
          ( YAML_GET_LENGTH(pword) \
          + YAML_OVERFLOW_LENGTH(pword) \
          + YAML_ARGONE_LENGTH(pword) \
          + YAML_ARGTWO_LENGTH(pword))
#endif

Thread: [Yaml-core] yaml bytecodes?

yaml-core