From: Clark C. E. <cc...@cl...> - 2003-09-08 00:09:31
|
Howdy. Been thinking about a YAML bytecode specification, mostly as a way to commonly express APIs in a manner which is more independent of language, independent of push vs pull semantics, and in a way such that various tools (such as a schema validator or formatter) can be introduced as components within a YAML processing framework. The core of this idea is to represent YAML information as "pre-parsed" instructions, indicating document boundaries, comments, scalars, sequences, mappings, etc. To allow for incremental delivery, the stream is delivered in a buffer which can be refilled as many times as required. In a simple case, the entire YAML stream may fit in a single buffer, in other cases, the buffer could be filled with exactly one document at a time, or more granular where large scalars can be split into chunks. In this way the actual call API is reduced to a very small limited set of functions, one for push, one for pull, and an optional one to expand a buffer. A suprizing result of this method emerged almost immediately, style instructions are thought more of as a mechanism to describe the "paintbrush" used, so that setting a literal block style for a given transfer method is one instruction which then applies to the remainder of the stream. This allows for styles, comments, and other presentation requirements can be entirely distinct instructions which cna be ignored by a processor, or injected into an existing stream by a YAML 'painter', or interpreted by the printer. Anyway, I'd love all of your feedback; so included below is a "C" header file which attempts to describe this yaml bytecode layer. I was thinking that I could "implement" this by hooking up Syck to produce the bytecode stream from its events. Following this I could write a Python wrapper for the bytecode layer, and then perhaps a pretty printer, etc. More advanced components would be a YPath processor, Schema validator, etc. Although, perhaps this is the wrong direction. I'm not sure. Following is the header file... Best, Clark /* * yamlbyte.h * * This defines the layout of a YAML bytecode stream, aka "binary YAML", * which has the following goals: * * parsed: > * Tools written to use this bytecode stream (loaders, filters, etc.) * need not concern themself with the complexity of styles and the * various details of the YAML syntax. While the bytecodes will * themselves offer complexity, the complexity can better organized * and would reflect the complexity of any other form of API. * * push or pull: > * By sticking with a 'data' based API, the differences between push * and pull parsers can be limited to a single API call distinction. * A push parser would call an event handler with a buffer holding * a single or perhaps more than one bytecode instruction. Where a * pull parser would be called with a buffer and it could fill the * buffer with as many bytecode instructions as would fit. * * full or streaming: > * With a bytecode instruction layout, a parser could return an * entire YAML document as a single result; or it could return * a YAML document in chunks. In essence, a language could * provide a SAX like or DOM like API on top of these bytecodes. * For a streaming API, the buffer could be 'moving', and for a * DOM like interface, the buffer could be static. * * language independent: > * By specifying a YAML "preparsed" structure as a sequential memory * block each language could wrap the structure or layer the APIs * on the structure as would seem appropriate to that language rather * than having a single API. In particular, a single YAML bytecode * stream could even be streamed between different applications * via shared memory or a pipe. * * tool support: > * As a least-common denominator, several tools, such as a schema * validator, path expression evaluator, or even transformation * tools could be written to operate directly on the bytecodes. * In this way, each language could wrap a "C" function to perform * these mutations on the bytecodes without having to write the * code themselves. * * efficiency: > * While it is not the goal of YAML bytecodes, some people could * argue that a "binary" format is more efficient wire format. In * any case, the bytecode format can offer some optimizations to * make it smaller for the most prominant situations. * * compatibility: > * It is possible to provide for standardized mappings of other * syntaxes (such as XML, RDF, SOAP, BER, etc.) into YAML bytecodes. * In essence, allowing for different parsers and emitters. * * TODO: produce an extensive set of before/after samples with * differing buffer sizes to demonstrate how chunking works. */ #include <stdint.h> typedef uint16_t yaml_bytecode_t; /* * A YAML bytecode is a 16 bit value defined as a bitmask field. */ typedef uint16_t yaml_argone_t; /* * Immediately following a bytecode is an optional argument to the * bytecode. For an error message bytecode, this would be the * specific error message number. For a typed node, this is a * handle to a transfer method. * * Note: if the value for argone (or argtwo or length) is greater * than 65533 (sixteen bits minus two), then the value is 0xFFFF * and a 64 bit value is stored at the beginning of the variable * length content. */ typedef uint16_t yaml_argtwo_t; /* * Immediately following the first argument is a second optional * argument, the meaning of which depends upon the bytecode. For * scalar and branch content nodes, this is a handle which indicates * that the given node may be referenced later. For an alias node, * this is the previously referenced handle. */ typedef uint16_t yaml_length_t; /* * To support variable length information, the last 16 bit peice * of information following the bytecode is a length, in 64 bit * chunks till the next bytecode. For scalar values or error * messages, this is a length of a UTF-16 encoded textual value. * All text values are null terminated, and thus, the worst case * terminator is a 64 bit zero. Following the null terminated * string value may be the overflow of argone or argtwo. * * Note: Even if the length is zero, argone or argtwo or both * may overflow, and in this case, the actual size of the * bytecode may be longer than 64 bits. The overflows * are stored immediately following the bytecode word * as length, argone, argtwo respectively. */ struct yaml_word { /* * A YAML bytecode stream then a sequence of the above four items * packed into a 64 bit value, which, when streamed uses little * endian network byte order. * * There are several design decisions at work with this structure, * first, many computers use 64 bits for struct alignment. Second, * 64 bit computers are on their way and thus having 64 bit lengths * and values is a real possibility. Third, in YAML data, most * content nodes are short and most documents are relatively small, * thus, it would be ideal if most non-scalar nodes fit in 64 bits. */ yaml_bytecode_t bytecode; yaml_argone_t argone; yaml_argtwo_t argtwo; yaml_length_t length; }; typedef uint64_t yaml_word_t; typedef yaml_word_t * yaml_word_p; typedef uint64_t yaml_size_t; /* length in sizeof(yaml_word_t) units */ /* * YAML bytecodes start with a section specifying branching * information and how a given YAML node may be spread across * two or more buffers. * * In particular, when a scalar node will not fit within the * remaining part of the current buffer, it must be marked with * SPLIT bit flag, and then in the next buffer, the same node * will continue and will be marked as RESUME. If a branch node * has a child which is split, then it is also split with a * pairing FINISH+SPLIT bytecode at the end of the current buffer, * with a START+RESUME bytecode at the beginning of the next buffer. */ #define YAML_START ((yaml_bytecode_t)(0x1000)) /* given bytecode is a branch and may contain other bytecodes; * the length signifies distance to a paired 'FINISH' bytecode */ #define YAML_FINISH ((yaml_bytecode_t)(0x2000)) /* signifies the finish of a branch, length is always zero */ #define YAML_EMPTY (YAML_START | YAML_FINISH) /* an empty branch, length is always zero */ #define YAML_SPLIT ((yaml_bytecode_t)(0x4000)) /* node did not fit in the buffer, it spills over into * the next buffer */ #define YAML_RESUME ((yaml_bytecode_t)(0x8000)) /* a continuation of an incomplete node that was broken * beacuse it did not fit within the buffer */ /* * YAML bytecodes are partitioned into four general categories, * each addressing a particular aspect of a YAML stream. It * is invalid to combine any of these flags, this allows these * flags to be checked by a the bitwise-and operator. */ #define YAML_CONTROL ((yaml_bytecode_t)(0x0100)) /* this is your general flow control instruction, which * is used to signal document boundaries, registration of * type families and other non-content but important signals */ #define YAML_CONTENT ((yaml_bytecode_t)(0x0200)) /* primary content of the yaml stream including scalars * collections; also including alias handling */ #define YAML_STYLE ((yaml_bytecode_t)(0x0400)) /* comments, insignificant whitespace, and styling instructions * which can all be safely stripped or added by a formatting * layer interested in human presentation */ #define YAML_MESSAGE ((yaml_bytecode_t)(0x0800)) /* error messages, warnings, debug traces, and other * informational notices which are not style, content, or * control */ /* * YAML_CONTROL */ #define YAML_BUFFER (YAML_CONTROL | 0x0010) /* A YAML bytecode stream can be delivered in a series of * convienent buffer-sized chunks; this is always the first and * last bytecode in the stream. The leading bytecode must * have a non-zero length which jumps to the trailing bytecode. * Between these two bytecodes is the content of the chunk, * formatted as a series of bytecodes. */ #define YAML_DOCUMENT (YAML_CONTROL | 0x0020) /* Immediately following a buffer is the document bytecode which * signifies the begin/end of a document. This is also a branch. */ #define YAML_INTERN (YAML_CONTROL | 0x0040) /* instructs the interpreter to allocate memory for a string value * and place it into memory for later reference; strings pinned * with this instruction must stay in memory for the entire lenth * of the YAML bytecode stream */ #define YAML_TRANSFER (YAML_INTERN | 0x0001) /* indicates that the interned value is a transfer method * * For this instruction, argone is the number of the storage * location where the transfer method will go. Values of * 0x3FFF and below are reserved for built-in YAML approved * transfer methods, and across a stream transfer method * handles may not be recycled. */ #define YAML_CONSTANT (YAML_INTERN | 0x0002) /* indicates that the interned value is a 'prealias', a well * known constant used by a schema or transformation tool * * Many schemas use special key names to key on, and this allows * a YAML processor to be 'initialized' with those scalars which * be frequently used. In a manner similar to YAML_TRANSFER, * argtwo is used to name the alias handle for the constant. * It is an error to reassign alias handles set with this * instruction (although they can be reset with other aliases) */ #define YAML_OPTIMIZATION (YAML_CONTROL | 0x0080) /* indicates an instruction which represents some sort of * optimization in the structure for better lookups, etc. * these items are allowed to have machine specific byte * orderings and should be ignored in general * TODO: this is just musing... */ #define YAML_BINARYEQUIV (YAML_OPTIMIZATION | 0x0001) /* for built-in well known transfer methods, like integer, * the payload of this instruction could be a 'binary' * version instead of the textual representation * TODO: this is just musing */ #define YAML_HASHTABLE (YAML_OPTIMIZATION | 0x0002) /* this can follow a mapping node to provide a hashtable... * TODO: this is just musing */ /* * YAML_CONTENT */ #define YAML_ALIAS (YAML_CONTENT | 0x0010) /* indicates an alias to a previously marked node, argtwo * contains the alias handle used to lookup the replacement */ #define YAML_SCALAR (YAML_CONTENT | 0x0020) /* indicates a scalar node, argone contains the transfer method, * and argtwo can be used to mark the node with an alias handle * so that the scalar can be referenced later */ #define YAML_BRANCH (YAML_CONTENT | 0x0040) /* indicates a branch node, just like scalar, argone contains the * transfer method and argtwo can be used to mark the node for * use as an alias later. */ #define YAML_SEQUENCE (YAML_BRANCH | 0x0001) /* indicates the sequence structure, as an ordered list of * nodes, obviously only content nodes count as content. */ #define YAML_MAPPING (YAML_BRANCH | 0x0002) /* indicates a mapping structure, as a list of pairs * TODO: can this be done any differently? */ /* * YAML_STYLE */ #define YAML_IGNORABLE (YAML_STYLE | 0x0010) /* comments, whitespace, etc. */ #define YAML_COMMENT (YAML_IGNORABLE | 0x0001) /* a single line comment */ #define YAML_WHITESPACE (YAML_IGNORABLE | 0x0008) /* insignificiant whitespace, note that a given scalar * can be split over several instructions with whitespace * instructions placed stratigically in the middle, * argone contains how many of those spaces */ #define YAML_BREAK (YAML_WHITESPACE | 0x0002) /* a line break, argone says how many line breaks */ #define YAML_INDENT (YAML_WHITESPACE | 0x0004) /* sets the number of spaces to use for future indentation * with argone */ #define YAML_BLOCK (YAML_STYLE | 0x0020) /* instruction that changes the 'paint brush' to use the block * style when printing nodes * * argone: if provided, limits the paint brush to the particular * transfer method specified * argtwo: if provided, this is filled with the bytecode mask * to apply against, for example YAML_BLOCK or YAML_SCALAR. */ #define YAML_BLOCK_FOLDED (YAML_BLOCK | 0x0001) /* specialization of block style that specifies folded scalar */ #define YAML_BLOCK_LITERAL (YAML_BLOCK | 0x0002) /* specialization of block style that specifies literal scalar */ #define YAML_FLOW (YAML_STYLE | 0x0040) /* similar to YAML_BLOCK only it sets the paint brush to * use the flow style */ #define YAML_FLOW_PLAIN (YAML_FLOW | 0x0001) /* specialization of flow scalar that indicates the plain style, * of course an error could be created if this isn't possible. */ #define YAML_FLOW_SINGLE_QUOTE (YAML_FLOW | 0x0002) /* specialization of flow scalar style to indicate single quoted */ #define YAML_FLOW_DOUBLE_QUOTE (YAML_FLOW | 0x0004) /* specialization of flow scalar style to indicate double quoted */ /* * YAML_MESSAGE */ #define YAML_NOTICE (YAML_MESSAGE | 0x0010) /* specifies an informational message which should be sent to * the user if it is not understood * * argone: holds an error number * argtwo: holds a line number * value: holds the message text */ #define YAML_WARNING (YAML_NOTICE | 0x0001) /* an unexpected event happened, but processing will continue */ #define YAML_ERROR (YAML_NOTICE | 0x0002) /* the producer will stop producing further instructions for * the current document and will move on to the next document */ #define YAML_FATAL (YAML_ERROR | 0x0004) /* previous nodes may have been invalid and the producer * will stop producing futher instructions */ #define YAML_APPNOTICE (YAML_NOTICE | 0x0008 ) /* a mixin flag to indicate that the given notice was not * produced by the parser, but rather by an application, * in this case, argone should be > 0x3FFF */ /* * producer/consumer API * * The cool part of YAML bytecodes is the simple "C" call API, since * most of the complexity is moved into the bytecodes. There are two * forms of the API, a push and a pull interface. The yaml_push_t is * a callback function called again and again by the producer, while * yaml_pull_t is a pull function called again and again by the * consumer. The push function simply sends the consumer's structure * back to them along with the next buffer, it responds with a buffer * containing YAML_MESSAGE instructions, or NULL if all is OK. * The pull function is passed an empty buffer and fills it; if a * resize function is passed, it can use this to resize the buffer * as required. The pull function simply returns the same buffer * it was passed (or the resized one), any messages to the consumer * will be in the buffer. */ typedef yaml_word_p yaml_buffer_t; /* first instruction is YAML_BUFFER */ typedef void * yaml_producer_t; /* someone producing YAML buffers */ typedef void * yaml_consumer_t; /* someone consuming YAML buffers */ typedef yaml_buffer_t (*yaml_realloc_t) (yaml_buffer_t buff, yaml_size_t newsize); typedef yaml_buffer_t (*yaml_push_t) (yaml_consumer_t sink, yaml_buffer_t buff); typedef yaml_buffer_t (*yaml_pull_t) (yaml_producer_t source, yaml_buffer_t buff, yaml_size_t size, yaml_realloc_t *realloc); /* * Various helper macros/functions for operating on these data * structures. */ typedef int yaml_bool_t; #define YAML_OVERFLOW ((yaml_int16_t)(0xFFFF)) #ifdef YAML_SAFE /* safe versions of various helper macros, these are * actual functions which must be called so that they * can check arguments and appear in the stack trace; * also, ideally these can call an error function * to invoke an exception handler (long jump) */ extern yaml_word_t YAML_MAKE_WORD( yaml_bytecode_t bytecode, yaml_argone_t argone, yaml_argtwo_t argtwo, yaml_length_t length ); extern yaml_length_t YAML_LENGTH(yaml_word_t word); extern yaml_argone_t YAML_ARGONE(yaml_word_t word); extern yaml_argtwo_t YAML_ARGTWO(yaml_word_t word); extern yaml_argtwo_t YAML_BYTECODE(yaml_word_t word); /* helper items for dereferencing and checking for overflow */ extern yaml_word_t YAML_DEREF_WORLD(yaml_word_p word); extern yaml_bool_t YAML_OVERFLOW_LENGTH(yaml_word_p pword); extern yaml_bool_t YAML_OVERFLOW_ARGONE(yaml_word_p pword); extern yaml_bool_t YAML_OVERFLOW_ARGTWO(yaml_word_p pword); /* methods to handle dereferencing from a pointer and overflows */ extern yaml_bytecode_t YAML_GET_BYTECODE(yaml_word_p pword); extern yaml_word_t YAML_GET_LENGTH(yaml_word_p pword); extern yaml_word_t YAML_GET_ARGONE(yaml_word_p pword); extern yaml_word_t YAML_GET_ARGTWO(yaml_word_p pword); extern yaml_word_t YAML_GET_SIZE(yaml_word_p pword); #else /* the macro equivalent of the items above, without error checking */ #define YAML_MAKE_WORD(bytecode, argone, argtwo, length) \ ((yaml_word_t) \ ((((yaml_bytecode_t)(bytecode)) << 48) + \ (((yaml_argone_t)(argone)) << 32) + \ (((yaml_argtwo_t)(argtwo)) << 16) + \ ((yaml_length_t)(length)))) #define YAML_LENGTH(word) \ ((yaml_length_t) ((((yaml_word_t)(word)) << 48) >> 48)) #define YAML_ARGTWO(word) \ ((yaml_argtwo_t) ((((yaml_word_t)(word)) << 32) >> 48)) #define YAML_ARGONE(word) \ ((yaml_argone_t) ((((yaml_word_t)(word)) << 16) >> 48)) #define YAML_BYTECODE(word) \ ((yaml_bytecode_t) (((yaml_word_t)(word)) >> 48)) #define YAML_DEREF_WORD(pword) (*((yaml_word_p)(pword))) #define YAML_OVERFLOW_LENGTH(pword) \ ((yaml_bool_t)(YAML_OVERFLOW==YAML_LENGTH(YAML_DEREF_WORD(pword)))) #define YAML_OVERFLOW_ARGONE(pword) \ ((yaml_bool_t)(YAML_OVERFLOW==YAML_ARGONE(YAML_DEREF_WORD(pword)))) #define YAML_OVERFLOW_ARGTWO(pword) \ ((yaml_bool_t)(YAML_OVERFLOW==YAML_ARGTWO(YAML_DEREF_WORD(pword)))) #define YAML_GET_BYTECODE(pword) YAML_BYTECODE(YAML_DEREF_WORD(pword)) #define YAML_GET_LENGTH(pword) \ (YAML_OVERFLOW_LENGTH(pword) ? \ ((yaml_word_t)(*(((yaml_word_t)pword)+1))) : \ ((yaml_word_t)YAML_LENGTH(YAML_DEREF_WORD(pword)))) #define YAML_GET_ARGONE(pword) \ (YAML_OVERFLOW_ARGONE(pword) ? \ ((yaml_word_t)(*(((yaml_word_t)pword)+1 \ +YAML_OVERFLOW_LENGTH(pword)))) : \ ((yaml_word_t)YAML_ARGONE(YAML_DEREF_WORD(pword)))) #define YAML_GET_ARGTWO(pword) \ (YAML_OVERFLOW_ARGTWO(pword) ? \ ((yaml_word_t)(*(((yaml_word_t)pword)+1 \ +YAML_OVERFLOW_LENGTH(pword) \ +YAML_OVERFLOW_ARGONE(pword)))) : \ ((yaml_word_t)YAML_ARGTWO(YAML_DEREF_WORD(pword)))) #define YAML_GET_SIZE(pword) \ ( YAML_GET_LENGTH(pword) \ + YAML_OVERFLOW_LENGTH(pword) \ + YAML_ARGONE_LENGTH(pword) \ + YAML_ARGTWO_LENGTH(pword)) #endif |
From: why t. l. s. <yam...@wh...> - 2003-09-09 02:40:34
|
On Sunday 07 September 2003 06:13 pm, Clark C. Evans wrote: > Been thinking about a YAML bytecode specification ... heck of a good idea. i've always liked that Python's pickle has both binary and human-readable= =20 representations. i wonder if pickle could be leveraged? perhaps your vi= sion=20 of it is greater than i see on my simple scan of the header file... part of me thinks you invented this idea because you're a COMPLETE contro= l=20 freak. and the only way to ensure proper parsing/loading/emitting is to=20 verify exact bytecode at a given stage of the process. :D well, good work. i'm going to look through the header file in detail whe= n i=20 get more time. _why |
From: Clark C. E. <cc...@cl...> - 2003-09-09 09:01:45
|
On Mon, Sep 08, 2003 at 06:08:44PM -0600, why the lucky stiff wrote: | i've always liked that Python's pickle has both binary and human-readable | representations. i wonder if pickle could be leveraged? perhaps your vision | of it is greater than i see on my simple scan of the header file... Well, I would not credit me with too much vision... your quote below is probably closer to the truth. | part of me thinks you invented this idea because you're a COMPLETE control | freak. and the only way to ensure proper parsing/loading/emitting is to | verify exact bytecode at a given stage of the process. :D Anyway. I've gotten feedback from two people thus far: Nathan (adiabatic) made a few great points on IRC: (a) the number of bytecodes seems a bit up there, (b) he thinks it should be big endian for the network byte order, (c) he would rather see UTF-8 (or a UTF-8/16 switch). The whole encoding issue and how to store 'instruction length' in a platform independent way is tough. There are probably only a few 'unnecessary' bytecodes in the .h file, I was just musing on a few (esp the intern stuff). In general, I think that the apparant complexity comes from using masks... Oren happened to call me yesterday (to ensure that I got the updated spec and post it...) and he provided feedback as well. As it turns out, Oren has something similar for the tool he is using, only that it is a bit lower-level, rather than having a bytecode for a 'scalar', he has a bytecode for a line of text as it appears in the stream. Also, Oren was using a single character for each bytecode (see below) and encoding the length of the bytecode as a variable length integer (without nulls). We kinda bounced ideas back and forth a bit and the following emerged (almost entirely Oren's insights): 1. Use a single ascii character for each bytecode, in this way it is easy to remember, and possibly even human debuggable. Content Codes: D Document (new) M Mapping S Sequence V Value * Alias . Document, Mapping, Sequence end marker # Comment , Continuation P Pause (...) the stream Specifiers (which immediately preced the content code): & An anchor, the anchor text immediately follows ! A transfer method, which immediately follows. Optional Formatting Codes: > Let subsequent V nodes to be "flow" style | Let subsequent V nodes to use "literal" style " Let subsequent V nodes to use "double quoted" style ' Let subsequent V nodes to use 'single quoted' style ~ Let subsequent V nodes to use 'plain' style [ Let subsequent S nodes use the 'inline' style { Let subsequent M nodes to use the 'inline' style ? Unset the formatting flags For example: --- - plain - > this is a flow scalar - > another flow scalar which is continued on a second line - &001 !str | this is a block scalar, typed and anchored 001 - "This is \"double quoted\"" Would be written (where \n ends each variable length bytecode) as the following stream: DS~Vplain >Vthis is a flow scalar Vanother flow scalar which is continued ,on a second line &1 !str |Vthis is a block scalar, typed and anchored 1 *1 "VThis is "double quoted" .. Since bytecodes will be all ASCII printable, one will have an array of 255 long with binary values specifying if the bytecode is variable length or not, so is_varlen['D'] is 0, while is_varlen['V'] is 1. In a similar way, is_paired['D'] is 1, while is_paired['|'] is 0. While this may not be as efficient as the bit masks I was proposing... it is significantly more "debuggable". The also cool part about this is that the formatting bytecodes could be stripped or injected as needed. 2. While I like the above format, after the conversation with Oren, I had several concerns. First, in the above method, there didn't seem to be a way to specify the length of a scalar (in particular it could suck to have a continuation for each line). Oren suggested that we could make the stream UTF8, UTF16BE or UTF16LE and simply encode the length as a UTF code point. This would give a maximum length of 0x10FFFF16 Thus LA (0x4C41) specifies a lengthof 65 units (8 bits for UTF8, 16 bits for UTF16) and 0x4C07 would specify a length of 7... and ring your terminal's bell. Second, I was concerned about being able to use the buffer as-is by pointing into it. This has two aspects: (a) on some platforms pointers are often expensive if they point to an address which is not aligned (to 32 or 64 bits, etc) -- this can be solved by a ' ' bytecode which is just padding, as needed. (b) and then many libraries expect strings to be null terminated, this can be provided by specifying that the L instruction makes the next scalar terminate with \0\n instead of just \n. These two things makes the L instruction rather ugly, as it really is a 'special' instruction which impacts the core mechanism. Hmm. Further, if the entire YAML document was in a single buffer, then another instruction "J" for jump would also be useful, it would be similar to L only that it specifies how big a branch is. DJz (0x444A7A) would specify that the document's entire span in memory is 122 bytes. In this way one could 'skip' a document, or a subtree (as this instruction could come immediately after a branch). 3. In the previous proposal, you wouldn't provide the user with the raw transfer provided, but would rather provide a normalized one. This can be done with two additional bytecodes, which could be used instead (or in addition to %). R Register a type family, the first unicode character which follows is the 'handle' (giving 0x10FFFF16 handles) and then followed by the normalized URL. F Specify a registered family for an upcoming scalar using a registered bytecode. Further, we could specify that several 'handles' are pre-loaded... RStaguri:yaml.org,2002:str FS Vthis is a string There are several other higher order items that I had done, and Oren convinced me that we could keep the low level bytecodes as close to the YAML spec as possible (line oriented), and then have a few higher level codes as needed. Anyway, this is a completely _different_ approach than what I was using, it has several advantages: (a) it is more debuggable, and (b) it is probably quite a bit more compact, (c) lengths and handles nicely become characters, delegating byte order handling to the same mechanism as used by the underlying character set. Seeing the level of complications here, it was making me think that perhaps the previous proposal I had should just use a full 128 bit 'node' for every possible node (32 bit code, 32 bit length, 32 bit typefamily, and a 32 bit anchor). It would be much less space efficient... but probably much easier to understand and use from a variety of programs. In any case, I'm leaning more towards this approach now... Best, Clark |
From: Clark C. E. <cc...@cl...> - 2003-09-09 19:23:40
|
subject: Alternative Bytecode Idea... (adopting Oren's approach) summary: > This proposal uses a single ASCII character for each bytecode, each bytecode has one of three lengths: (a) atomic, (b) it has one 'argument' character which always follows, (c) it is variable length and ends with a unix linefeed '\n' character (or, see L) codes: # # Content/Flow # 'D': Document '%': Directive 'M': Mapping 'S': Sequence 'V': Value '*': Alias '.': Branch End ',': Value Continuation 'N': Normalized New Line 'P': Pause the stream (...) # # Filler (non-content) # ' ': Padding '#': Comment # # Specifiers # '&': Anchor '!': Transfer method string # # Formatting # '>': Let subsequent V nodes to be "flow" style '|': Let subsequent V nodes to use "literal" style '"': Let subsequent V nodes to use "double quoted" style "'": Let subsequent V nodes to use 'single quoted' style '_': Let subsequent V nodes to use 'plain' style '[': Let subsequent S nodes use the 'inline' style '{': Let subsequent M nodes to use the 'inline' style '0': Subsequent blocks do not indent children '1': Subsequent blocks indent children 1 space '2': Subsequent blocks indent children 2 spaces '3': Subsequent blocks indent children 3 spaces '4': Subsequent blocks indent children 4 spaces '5': Subsequent blocks indent children 5 spaces '6': Subsequent blocks indent children 6 spaces '7': Subsequent blocks indent children 7 spaces '8': Subsequent blocks indent children 8 spaces '9': Subsequent blocks indent children 9 spaces '~': Unset the formatting flags # # Advanced # 'E': Error message 'W': Warning message 'I': Informational message 'R': Register a transfer method, immediately following this code is a character to 'index' followed by the transfer. 'T': Normalized Transfer, immediately following this code is a character referencing the 'index' previously registered. ';', Branch Continuation 'J': Jump, specifies the length to end of branch 'L': Length, specifies length of a V or , or any other leaf categories: atomic: [ ' ', '>', '|', '"', "'", "_", "_", '[', '{', '~', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'D', 'M', 'S', '.', 'N', 'P' , ';' ] varlen: [ '%', 'V', '*', ',', '&', '!', '=' ] # ends with '\n' index: [ '=', 'J', 'V', 'L' ] # followed by exactly one utf char indvar: [ 'E', 'W', 'I', 'R' ] # index, followed by varlen examples: - yaml: | --- - plain - > this is a flow scalar - > another flow scalar which is continued on a second line and indented 2 spaces - &001 !str | This is a block scalar, both typed and anchored - *001 # this was an alias - "This is a\n double quoted scalar" bytecode: | D0S_Vplain 2>Vthis is a flow scalar Vanother flow scalar which is continued ,on a second line and indented 2 spaces &001 !str |VThis is a block scalar, both typed N,and anchored *001 # this was an alias "VThis is a N, double quoted scalar todo: - show how buffer boundaries are handled - show how the advanced codes work |
From: why t. l. s. <yam...@wh...> - 2003-09-09 21:03:06
|
On Tuesday 09 September 2003 01:28 pm, Clark C. Evans wrote: > > This proposal uses a single ASCII character for each bytecode. > This is moving right along! Tremendous work! HeiL! I can see so many advantages to the ASCII solution. One of these advanta= ges=20 was demostrated in your last message: the ability to store YAML bytecode=20 within a YAML document. Absolutely, the bytecode will be a great way to ensure precision in the=20 testing suite. And now we merely need to add a new bytecode representati= on=20 for each test in the suite. We could have stored Clark's bytecode in bas= e64,=20 but this would slow development and I'm afraid the testing suite would en= d up=20 falling behind. It's tough enough to maintain such tests without having = to=20 encode data. In fact, I'm sure you could gain a lot by adding to the tests right away.= The=20 suite represents a large cross-section of cases. _why |
From: Clark C. E. <cc...@cl...> - 2003-09-10 04:12:30
|
On Tue, Sep 09, 2003 at 03:02:43PM -0600, why the lucky stiff wrote: | I can see so many advantages to the ASCII solution. One of these advantages | was demostrated in your last message: the ability to store YAML bytecode | within a YAML document. Another advantage is that once you introuce a noop bytecode you can even convert a unicode string containing a YAML document into a unicode string containing the bytecodes *without* moving any of the charaters. Thus, the parser could just be a simple call.... yaml_parse2bytecode( yaml_utf8 * buff, len); And then, it would be a cake walk for a processor to rumage through the bytecodes. | Absolutely, the bytecode will be a great way to ensure precision in the | testing suite. And now we merely need to add a new bytecode representation | for each test in the suite. We could have stored Clark's bytecode in base64, | but this would slow development and I'm afraid the testing suite would end up | falling behind. It's tough enough to maintain such tests without having to | encode data. Well, there is only one open issue on the table, for the Lenght bytecode L do we store the length as an ASCII number or as a single unicode charaacter (where the codepoint is the lenght) | In fact, I'm sure you could gain a lot by adding to the tests right away. The | suite represents a large cross-section of cases. Right. A good way to validate that we indeed have both sufficient and necessary set of bytecodes. Best, Clark |
From: Clark C. E. <cc...@cl...> - 2003-09-10 04:13:11
|
subject: Revision #2 of YAML Bytecodes summary: > This proposal defines a 'preparsed' format where a YAML syntax is converted into a series of events, as bytecodes. Each bytecode is either atomic (stands alone) or is variable length, ending in a line feed character '\n'. It is thought that this preparsed format could be a functional equivalent to a parser API, in that each callback (or set of optional arguments on each callback) almost perfectly correspond to each bytecode. Since the bytecode format is actually in the same Unicode encoding as the YAML source, one could even imagine a parser "C" function which takes a buffer and simply rewrites the data in place (without expanding or contracting the buffer) from the YAML syntax to the equivalent bytecodes. codes: # # Primary Bytecodes # # These bytecodes form the minimum needed to represent YAML information # from the serial model (ie, without format and comments) # 'D': name: Document size: Atomic, paired with '.' desc: Indicates that a document has begun, either it is the beginning of a YAML stream, or a --- has been found. The bytecode '.' is used to signal the end of branch instructions, thus an empty document is expressed as "D." '$': name: Scalar size: Variable (uses \n to end) desc: This indicates the start of a scalar value, which can be continued by the 'N' and ',' bytecodes. This bytecode is used for sequence entries, keys, values, etc. ',': name: Scalar Continuation size: Variable (uses \n to end) desc: Since a scalar may not fit within a buffer, and since it may not contain a \n character, it may have to be broken into several chunks. This is an additional chunk 'n': name: Normalized New Line size: Atomic desc: Since scalar values may not contain a newline (\n), this bytecode is used in its place. Thus, the bytecodes for "Hello\nWorld" would be "$Hello\nn,World\n". 'S': name: Sequence size: Atomic, paired with '.' desc: Indicates the start of a sequence, children are provided following till a '.' bytecode is encountered. So, the bytecodes for "[ one, two ]" would be "S$one\n$two\n." 'M': name: Mapping size: Atomic, paired with '.' desc: Indicates the start of a mapping, children of the mapping are provided as a series of K1,V1,K2,V2 pairs as they are found in the input stream. For example, the bytecodes for "{ a: b, c: d }" would be "M$a\n$b\n$c\n$d\n." '.': name: Close Branch size: Atomic desc: This closes the outermost Branch (Document, Mapping, Sequence), see earlier examples. '%': name: Directive size: Variable desc: Indicates that a directive was encountered, note that the only non-error directive is "YAML:1.0" '&': name: Anchor size: Variable desc: This bytecode associates an anchor with the very next content node, see the '*' alias bytecode. '*': name: Alias size: Variable desc: This is used when ever there is an alias node, for example, "[ &X one, *X ]" would be normalized to "S&X\n$one\n*X\n." -- in this example, the anchor bytecode applies to the very next content bytecode. '!': name: Raw Transfer size: Variable desc: This is the raw transfer string as provided by the incoming YAML stream. Note that validation is not provided in this case -- although 'R' and 'T' are better bytecodes when possible. 'P': name: Pause size: Atomic desc: This is the instruction when a document is terminated, but another document has not yet begun. Thus, it is optional, and typically used to pause parsing... # # Advanced bytecodes (not absolutely needed, but very nice) # '~': name: Noop size: Atomic desc: This bytecode does nothing. In some cases, (especially with 4 space indentation), it is possible to rewrite an incoming YAML stream as a bytecode stream without moving strings around. For example, assume that '@' represents a newline (for illusration only), the first string, in YAML syntax can be rewritten in place as the second string, in YAML bytecodes, without moving the strings. "- plain@- >@ This is@ folded@- |@ a block@ scalar@" "S$plain@~~~~~$This is@, folded@~~~~~$a block@n$scalar@n." ';': name: literal scalar continuation size: Variable (uses \n to end) desc: This is simply a shorthand for "N," bytecode sequence only taking one character instead of two. It is needed to support in-place bytecode conversion of literal scalars which only use one character indenting, "--- |@ literal@ scalar" could be converted in-place to "D~~~~~$literal@;scalar@n" I can't think of any other use case for this bytecode 'R': name: Register Normalized Transfer size: Variable desc: The variable length payload has two parts, the first is a unicode string which is used as a key to identify the transfer, then the '=' sign, followed by the transfer as expressed as a full URI. For example: "Rs=taguri:yaml.org,2002:str\n" 'T': name: Normalized Transfer size: Variable desc: This is an alternative to "!" bytecode, but using the abbreviation given in R above. Note that the unicode string could be just one character, giving 0x10FFFF16 possibilities before one even gets to two characters. "Ts\n" 'L': name: Length size: Variable desc: This bytecode is purely optional, and gives the span of the very next bytecode (in 8 bit words for UTF8 and 16 bit words for UTF16). When used in front of a mapping or sequence, it gives the length of the branch so one may skip all of the children. While one could specify the length as a single unicode character, Brian asserted that it would be better just using ASCII version. "--- |\n literal\n scalar.\n" "DL22\n$literal\nr;scalar.\nn" In the above example, one would have to add 22 to the address of the $ character to get to the next bytecode beyond the scalar's scope. '?': name: Notice size: Variable desc: This is a packed string, it has an error level, W - Warning, E - Error, F - Fatal, T - Trace, I - Info, followed by a error number, a line number, and then a textual message, for example: "?W394,#22,Missing Such and such on line 22\n" # # The following bytecodes are purely at the syntax level and # useful for pretty printers and emitters # '#': name: Comment size: Variable desc: This is a single line comment. It is terminated like all of the other variable length items, with a '\n'. '>': name: Folded Style size: Atom desc: Subsequent scalar nodes '$' use the folded style unless another style instruction changes the 'paintbrush', of course, if a scalar cannot be emitted using flow (it has items which require double quote escaping), then this is an warning (or error) condition in the emitter or painter, etc. '|': name: Literal Style size: Atom '"': name: Double Quoted Style size: Atom "'": name: Single Quoted Style size: Atom 'F': name: Flow Collections desc: Use flow/inline style for subsequent mappings and sequences 'B': name: Block Collections desc: Use block/indented style for subsequent mappings and sequences 'I': name: Indent size: Variable desc: Specifies number of additional spaces to indent for various block styles, ie "I4\n" specifies 4 char indent. '?': name: Autoselect Format size: Atom desc: Unset the formatting flags examples: - yaml: | --- - plain - > this is a flow scalar - > another flow scalar which is continued on a second line and indented 2 spaces - &001 !str | This is a block scalar, both typed and anchored - *001 # this was an alias - "This is a \"double quoted\" scalar" bytecode: | D0S_$plain I2 >$this is a flow scalar $another flow scalar which is continued ,on a second line and indented 2 spaces &001 !str |$This is a block scalar, both typed N,and anchored *001 # this was an alias "$This is a "double quoted" scalar .. |
From: Oren Ben-K. <or...@be...> - 2003-09-10 17:12:52
|
I like the idea of printable compact byte code for YAML. I'm focusing on the parser generator so I didn't put a lot of thought to the exact set of byte codes. Also, there are some subtle difference between the byte codes as I see them to what Clark presented. My motivation was the return format from a "pull" parser. I settled on a "get next" method that returns a pair - a code and an associated set of characters. This has an obvious mapping to byte codes. The exact set of codes depends on the level of abstraction. We have several information model levels. My intention is that the parser will return codes that fully describe the syntax model - every space, comment, line break and escape character. Clark has started from the other end (or maybe from the middle), and his original codes were originally meant to describe the tree (or maybe even the graph) model. Hence there are incompatibilities. I think this can be resolved by having a common set of codes with additional level-specific codes. For example, the code (think of it as a token) describing the white space between the key's ':' and the '!' of the transfer method of the value belongs only to the syntax level. There are also issues of processing; the parser returns the transfer method with an indication that a prefix is used (at the syntax model level), while at a higher level it would be expected that the transfer method would be the complete URI. I suspect that there should be two separate codes needed - one for the URI and another for the exact form it takes in the syntax (using shorthands, prefixes, etc.). On the other hand, some codes (most importantly, the simple scalar value code/token) are common for multiple levels. Therefore, defining the exact set of codes will require careful design work. I'd much rather delay this work to the point where I have a working parser generator so that the design can be tested in practice, rather than working on the design now in the abstract. A second difference is that, in my scheme, every token/code is exactly one line. The format is _always_ one character specifying the token type, then the characters associated with it, then a '\n'. It doesn't matter whether the specific token type always has no associated data. Also the format does not specify the amount of associated characters in advance; one has to scan for the '\n'. That said, as Clark pointed out it is possible to define a "length" code that would precede a token and provide an advance warning about the amount of data that it carries, in case this is important for efficiency. Personally I'd rather encode the length as a normal ASCII decimal number (e.g., "L153\n" to signal that the following token line is 155 characters long - 1 for the code, one for the final \n, and 153 characters of data). As you can see I'm not very worried about compactness; I'm more concerned about being able to grok a byte codes stream for debugging. At any rate, given the above, a document like this: --- !!bloop foo : > bar baz # comment ... Going through my parser might end up looking something like the following. I'm assuming the '#" code means a no-op code whose characters are ignored (that is, a comment token - not to be confused with a token representing a comment in the YAML document itself!). # Document header. h--- # Begin collection node (I'm following the productions here). c # White space (. <=> ' '). w. # Transfer method. t!bloop # Line break (nl = LF; nc = CR; nP = PS; nL = LS; nN = NEL). nl # Start of mapping value. m # Key scalar node. s # Value. vfoo # End of scalar key. e # White space again. w. # Indicator separating key and value. i: # More spaces w. # Value scalar node. s # Style (folded). S> # Line break. nl # Indentation. i.. # More of the value. v baz # Line break. nl # Indentation. i.. # Piece of the value. vbar # End of scalar value. e # Line breaks - note outside node since is chomped. nl # Line breaks - note outside node since is chomped. nl # Indicator (for comment). i# # The comment text (I'm trying to stick to letter codes). C comment # Line breaks. nl # End of mapping value (document node). e Without the comments it looks like this: h--- c w. t!bloop nl m i s vfoo e w. i: w. s S> nl i.. v baz nl i.. vbar e nl nl i# C comment nl e There are many subtleties (exactly what are the node boundaries with regard to white space and empty lines; I'm using 'e' as a catch-all end directive, maybe we want specific end directives; what to do with zero indentation; and so on). Note the above is geared towards an output from a tokenizing parser; it is trivial to construct an exact copy of the original YAML file from the byte codes (in other words, the above codes are designed to express the syntax model). It is easy to "cull" the codes and only leave the "meat" of the document: c t!bloop m s vfoo e s v baz nl vbar e e This is rather compact. If anyone is worried about compactness beyond this level, zipping this would do wonders. Note that there are very little traces of the specific syntax used, and with minimal work (and no changes to the code) such traces can be completely eliminated (e.g. normalizing newlines, expanding escape sequences etc.). This demonstrates the notion of having a common set of tokens that are shared between the different model levels. However some things change between levels - e.g., expanding transfer methods to URIs probably requires using a different code ('T' instead of 't'?). Also note it is inherently impossible to represent a multi-line scalar as a single 'v' token in the "higher" model level because \n always terminates the token, and in the "high" levels of abstraction escape sequences shouldn't be used (or should they?). This isn't a big deal, I think. OK, enough rambling... My point is that this is a promising direction, but there are many details to work out - and I'm going to table the whole notion until *after* I get my Pull-Parser-Generator to work. Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2003-09-11 02:26:07
|
On Wed, Sep 10, 2003 at 07:11:03PM +0200, Oren Ben-Kiki wrote: | I like the idea of printable compact byte code for YAML. I'm focusing on | the parser generator so I didn't put a lot of thought to the exact set | of byte codes. Also, there are some subtle difference between the byte | codes as I see them to what Clark presented. I was being 'elaborate' so as to explore the issues involved, I really like -- as _Why suggested -- the idea of converting the examples in the test suite to bytecodes. This would definately allow us to think things through. | My motivation was the return format from a "pull" parser. I settled on a | "get next" method that returns a pair - a code and an associated set of | characters. This has an obvious mapping to byte codes. Although it will not be direct, it will be easy to convert the output of the Syck parser into bytecodes; but it is more at the 'serial model'. | Clark has started from the other end (or maybe from the middle), and his | original codes were originally meant to describe the tree (or maybe even | the graph) model. Hence there are incompatibilities. I actually *started* from the graph model and worked backwards, as it would be nice to be able to (perhaps with some mind warping) use the bytecodes in a read-only random access sort of way. | I think this can be resolved by having a common set of codes with | additional level-specific codes. For example, the code (think of it as a | token) describing the white space between the key's ':' and the '!' of | the transfer method of the value belongs only to the syntax level. Exactly. I like the idea of even using the bytecode's case to be the clearest form of distinction. | There are also issues of processing; the parser returns the transfer | method with an indication that a prefix is used (at the syntax model | level), while at a higher level it would be expected that the transfer | method would be the complete URI. I suspect that there should be two | separate codes needed - one for the URI and another for the exact form | it takes in the syntax (using shorthands, prefixes, etc.). I really like your example below where 't' is the bytecode for the trasfer as it appears in the syntax model, while 'T' is the normalized version appearing in the serial model. | On the other hand, some codes (most importantly, the simple scalar value | code/token) are common for multiple levels. Therefore, defining the | exact set of codes will require careful design work. I'd much rather | delay this work to the point where I have a working parser generator so | that the design can be tested in practice, rather than working on the | design now in the abstract. *nods* Hence me starting from the graph model and working backwards. | A second difference is that, in my scheme, every token/code is exactly | one line. The format is _always_ one character specifying the token | type, then the characters associated with it, then a '\n'. It doesn't | matter whether the specific token type always has no associated data. | Also the format does not specify the amount of associated characters in | advance; one has to scan for the '\n'. While working out the bytecode idea, what occurred to me is that it could be possible to generate an "in-place" parser which took an input buffer and tokenized it. I'd like to know what you think of this approach? Specifically: - You end up ignoring alot of the stuff in the syntax model so that you have room for the bytecodes. Most of the bytecodes end up going in the 'indent' which is tossed. - It requires a few extra instructions which do not appear in any model, and it requires redundancy (ie, instruction A, B may have to be both separate, and a third instruction X which represents A followed by B) for the necessary compatness. - Typefamily handling is somewhat interesting In short, it requires a bit more complexity, but the advantage is that the parser could be: parse(buffer*, bufflen); Which, IMHO, is kinda neet. However, it could be _too_ cute. ... Hmm. It seems that your primary goal, is to provide a tool which better helps visualize the parser output. My original goal with the bytecodes was two fold: 1. To enable parser / loader / serializer / painter / emitter interoperabilty by providing an extendable push/pull API as data rather than as function calls. 2. To provide an intermediate 'pre-parsed' language which would be easier to operate on programmatically than the syntax directly. In this case, you really want to use binary numbers -- although not pointers beacuse it would be nice to move this structure between processes. I never had the goal of visualization, as I think this is what the current YAML syntax already provides. Also, I never had 'binary' ie, compactness as a goal. Unless your data is primarly big numbers, the idea that binary format is compact is flawed; you often store one or two digit numbers in 32 bits, and frequently space is wasted on byte alignment, etc. A binary format gives is 'ready to use'. | That said, as Clark pointed out it is possible to define a "length" code | that would precede a token and provide an advance warning about the | amount of data that it carries, in case this is important for | efficiency. Personally I'd rather encode the length as a normal ASCII | decimal number (e.g., "L153\n" to signal that the following token line | is 155 characters long - 1 for the code, one for the final \n, and 153 | characters of data). As you can see I'm not very worried about | compactness; I'm more concerned about being able to grok a byte codes | stream for debugging. Hmm. The inability to store a string directly as a "blob" beacuse \n must be escaped by ending the chunk, specifying a new line bytecode, and then starting the next chunk. Ick. That's hardly useful for an intermediate format. Thank you _so_ much for humoring me and writing the remainder of this message... it was very useful. Although I think that it has me thinking my original path was better; at least for the goals I had... progammability. Although, I do like Why's feedback about providing the bytecode representation directly in the test suite. Would hex be too awful? Imagine the 'test' buffer size were say 40 characters, so the file below would be 'compared' to something like yaml: | --- !!bloop foo : > bar baz # comment code: | # output from bmore (http://bvi.sf.net) 00000000 2D 2D 2D 20 21 21 62 6C 6F 6F 70 0A --- !!bloop. 0000000C 66 6F 6F 20 3A 20 3E 0A 20 20 62 61 foo : >. ba 00000018 72 0A 20 20 20 20 62 61 7A 0A 0A 23 r. baz..# 00000024 20 63 6F 6D 6D 65 6E 74 0A comment. Where more of the stuff on the right would be periods (.) representing unprintable characters (while not every printable character on the right would be actual content characters, since some characters would appear, say, in side a number) Thus, one could easily 'test' to see if the right binary stream is being generated. The primary issue is that the code would have to have a specific byte order... Thoughts? Best, Clark | At any rate, given the above, a document like this: | | --- !!bloop | foo : > | bar | baz | | # comment | ... | | Going through my parser might end up looking something like the | following. I'm assuming the '#" code means a no-op code whose characters | are ignored (that is, a comment token - not to be confused with a token | representing a comment in the YAML document itself!). | | # Document header. | h--- | # Begin collection node (I'm following the productions here). | c | # White space (. <=> ' '). | w. | # Transfer method. | t!bloop | # Line break (nl = LF; nc = CR; nP = PS; nL = LS; nN = NEL). | nl | # Start of mapping value. | m | # Key scalar node. | s | # Value. | vfoo | # End of scalar key. | e | # White space again. | w. | # Indicator separating key and value. | i: | # More spaces | w. | # Value scalar node. | s | # Style (folded). | S> | # Line break. | nl | # Indentation. | i.. | # More of the value. | v baz | # Line break. | nl | # Indentation. | i.. | # Piece of the value. | vbar | # End of scalar value. | e | # Line breaks - note outside node since is chomped. | nl | # Line breaks - note outside node since is chomped. | nl | # Indicator (for comment). | i# | # The comment text (I'm trying to stick to letter codes). | C comment | # Line breaks. | nl | # End of mapping value (document node). | e | | Without the comments it looks like this: | | h--- | c | w. | t!bloop | nl | m | i | s | vfoo | e | w. | i: | w. | s | S> | nl | i.. | v baz | nl | i.. | vbar | e | nl | nl | i# | C comment | nl | e | | There are many subtleties (exactly what are the node boundaries with | regard to white space and empty lines; I'm using 'e' as a catch-all end | directive, maybe we want specific end directives; what to do with zero | indentation; and so on). | | Note the above is geared towards an output from a tokenizing parser; it | is trivial to construct an exact copy of the original YAML file from the | byte codes (in other words, the above codes are designed to express the | syntax model). It is easy to "cull" the codes and only leave the "meat" | of the document: | | c | t!bloop | m | s | vfoo | e | s | v baz | nl | vbar | e | e | | This is rather compact. If anyone is worried about compactness beyond | this level, zipping this would do wonders. | | Note that there are very little traces of the specific syntax used, and | with minimal work (and no changes to the code) such traces can be | completely eliminated (e.g. normalizing newlines, expanding escape | sequences etc.). This demonstrates the notion of having a common set of | tokens that are shared between the different model levels. However some | things change between levels - e.g., expanding transfer methods to URIs | probably requires using a different code ('T' instead of 't'?). | | Also note it is inherently impossible to represent a multi-line scalar | as a single 'v' token in the "higher" model level because \n always | terminates the token, and in the "high" levels of abstraction escape | sequences shouldn't be used (or should they?). This isn't a big deal, I | think. | | OK, enough rambling... My point is that this is a promising direction, | but there are many details to work out - and I'm going to table the | whole notion until *after* I get my Pull-Parser-Generator to work. | | Have fun, | | Oren Ben-Kiki | | | | ------------------------------------------------------- | This sf.net email is sponsored by:ThinkGeek | Welcome to geek heaven. | http://thinkgeek.com/sf | _______________________________________________ | Yaml-core mailing list | Yam...@li... | https://lists.sourceforge.net/lists/listinfo/yaml-core |
From: Oren Ben-K. <or...@be...> - 2003-09-11 06:06:09
|
> | My motivation was the return format from a "pull" parser... > | Clark has started from the other end... > > I actually *started* from the graph model and worked > backwards, Right. Different goals, subtle differences in result. I think we can work it out, though. > | I think this can be resolved by having a common set of codes with > | additional level-specific codes... > > Exactly. I like the idea of even using the bytecode's case > to be the clearest form of distinction. Well, there are 3 different models and only two letter cases :-) We'll just have to wait and see. > While working out the bytecode idea, what occurred to me is > that it could be possible to generate an "in-place" parser > which took an input buffer and tokenized it. I'd like to > know what you think of this approach? Specifically: I don't like it. First, I want to be able to report the syntax model, for tools such as pretty-printers and editors. Using your approach, this is inherently impossible - there's just no place to add the codes for the syntax tokens. Certainly we could have two parsers, one for the syntax and one for the higher levels, but I think this defeats the purpose. Second, it doesn't work anyway. There are cases where you want to add a token and there are no "useless" syntax characters you can overwrite with the token code. For example, a YAML documents without a header: foo: bar There's no place you can insert the start-of-document, start-of-mapping, and start-of-key-node indicators. > - Typefamily handling is somewhat interesting This is another example where what you propose is impossible; if I use a prefix, and you want to report the full type family, there's just no way you can make it fit: --- !some-very-long-prefix^ foo: !^some-suffix bar ... > In short, it requires a bit more complexity, but the > advantage is that the parser could be: > > parse(buffer*, bufflen); > > Which, IMHO, is kinda neet. However, it could be _too_ cute. Much too cute :-) > Hmm. It seems that your primary goal, is to provide a tool > which better helps visualize the parser output. > > My original goal with the bytecodes was two fold: > > 1. To enable parser / loader / serializer / painter / emitter > interoperabilty by providing an extendable push/pull API as > data rather than as function calls. Nice notion, I'm all for it. Though you'd still want an API. You want to be able to invoke each module as functions rather than by reading from a pipe/socket/etc. "get_next_token" still makes perfect sense. > 2. To provide an intermediate 'pre-parsed' language which > would be easier to operate on programmatically than > the syntax directly. Again, good goal. > In this case, you really want > to use binary numbers -- although not pointers beacuse > it would be nice to move this structure between processes. I don't see why this follows. Programmatically, you work on token structs. Between processes you send the trivial encoding of these tokens into text. These are two different representations; converting between them is as trivial as it gets. > I never had the goal of visualization, as I think this is > what the current YAML syntax already provides. Not if one wants to write a test suite for the parser itself... > | That said, as Clark pointed out it is possible to define a "length" > | code that would precede a token and provide an advance > | warning about > | the amount of data that it carries, in case this is important for > | efficiency. Or we could similarly pre-specify the length required for the complete content of a scalar token, or the length of a sequence - using different codes of course. Just another of these details we'll need to work on :-) > | Personally I'd rather encode the length as a > | normal ASCII > | decimal number (e.g., "L153\n" ... > > Hmm. The inability to store a string directly as a "blob" > beacuse \n must be escaped by ending the chunk, specifying a > new line bytecode, > and then starting the next chunk. Ick. That's hardly useful for > an intermediate format. It is possible to get around this problem though using your notion of a '.' code: vsome multi-line .scalar value Note that this is only one token whose code is 'v' and whose value is "some multi-line\nscalar value". It is just that its representation in the byte-codes file is two physical lines. This allows the higher-level modules to report the value of a scalar in a single token, which is what you are after. Don't mixing up between the byte codes format when written to the disk as a file and the return value from "get_next_token". I don't think it is very useful to be able to have the return value from some processing module be a long chunk of concatenated byte codes in memory. The only place that such a format will be used would be when reading a byte codes file from the disk - and it would be immediately converted to "get_next_token" calls. > ... I think that it has > me thinking my original path was better; at least for the goals I > had... progammability. Programmability is done with API calls. It is (too) cute to consider "in-place" tokenization to byte codes; but what happens later when a layer of processing wants to remove tokens or insert additional tokens? This is trivial using "get_next_token" but next to impossible using an internal buffer holding a sequence of "byte codes". > Although, I do like Why's feedback about providing the bytecode > representation directly in the test suite. Yes, that was my main use case for the codes (initially). > Would hex be too > awful? Yes :-) I want to be able to meaningfully diff the expected and actual result files and stand a fighting chance of understanding the result - e.g., a token being inserted in the wrong place. Being able to write an input/output file by hand is also useful when testing/debugging. Making the byte-codes be simple editable files is very useful there. > Thoughts? To recap: > | OK, enough rambling... My point is that this is a promising > | direction, > | but there are many details to work out - and I'm going to table the > | whole notion until *after* I get my Pull-Parser-Generator to work. Have fun, Oren Ben-Kiki |
From: Brian I. <in...@tt...> - 2003-09-11 07:53:14
|
On 10/09/03 19:11 +0200, Oren Ben-Kiki wrote: > OK, enough rambling... My point is that this is a promising direction, > but there are many details to work out - and I'm going to table the > whole notion until *after* I get my Pull-Parser-Generator to work. I actually have a language agnostic parser/loader implementation (in pure Perl right now) passing tests. I was using the old Parser/Emitter API. I'd like to start using the bytecode method. Perhaps I'll just use Clark's layout for now. I'll bring up issues as they come about. Cheers, Brian |
From: Clark C. E. <cc...@cl...> - 2003-09-21 08:06:53
|
Howdy Why! Anyway, it only took me about 4h to write a very first pass of YAML -> YAML Bytecodes via Syck. It certainly is one _hack_ of a job (complete with in-line main and a single hard coded test). However, it does work... nite! http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/yaml4r/syck/ext/yamlbyte/ --- # YAML test: 1 and: "with new\nline\n" also: &3 three more: *3 ... M Ttaguri:yaml.org,2002:str Stest Ttaguri:yaml.org,2002:int S1 Ttaguri:yaml.org,2002:str Sand Tstr Swith new N Cline N Ttaguri:yaml.org,2002:str Salso A3 Ttaguri:yaml.org,2002:str Sthree Ttaguri:yaml.org,2002:str Smore R3 E On Mon, Sep 08, 2003 at 06:08:44PM -0600, why the lucky stiff wrote: | On Sunday 07 September 2003 06:13 pm, Clark C. Evans wrote: | > Been thinking about a YAML bytecode specification ... | | heck of a good idea. | | i've always liked that Python's pickle has both binary and human-readable | representations. i wonder if pickle could be leveraged? perhaps your vision | of it is greater than i see on my simple scan of the header file... | | part of me thinks you invented this idea because you're a COMPLETE control | freak. and the only way to ensure proper parsing/loading/emitting is to | verify exact bytecode at a given stage of the process. :D | | well, good work. i'm going to look through the header file in detail when i | get more time. | | _why | | | | | ------------------------------------------------------- | This sf.net email is sponsored by:ThinkGeek | Welcome to geek heaven. | http://thinkgeek.com/sf | _______________________________________________ | Yaml-core mailing list | Yam...@li... | https://lists.sourceforge.net/lists/listinfo/yaml-core | | |
From: Clark C. E. <cc...@cl...> - 2003-09-22 02:30:05
|
whole scalar: summary: > The most painful aspect of converting the Syck interface to generate YAML bytecodes was that each scalar is reported as a whole with the possibility of embedded '\n' and '\z' characters. If the consumer of the bytecodes needs the entire 'whole' scalar, then it seems especially dumb (not to mention burdensome) to break up each scalar into tiny bytecodes so that the consumer can rebuild it... proposal: > I propose adding a YAMLBYTE_WHOLESCALAR = '<' which can contain '\n' and '\z' items (and thus requires a length or a end pointer to specify how big it is). When serialized as text, this bytecode would have the following structure: '<<' word '\n' raw '\n' here '\n', where 'raw' is the raw scalar value, and 'here' is a printable ascii string whish marks the end of the scalar when encountered on a line all by itself. D Q Ssingle line scalar <<. This whole multi-line scalar has two new lines in its content and is not termianted with a new line. . Ssecond single line scalar E concerns: - > since this bytecode can have embedded new lines (\n) it could mess with readability... however, here (<<) documents are very well known, and in this context fits the bill. Note: this does not support the <<- variant... if you are after readability, use YAML! - > since this bytecode can have embedded nulls (\z) it could cause problems with programs that use \z as a signal that the stream has ended... |
From: Brian I. <in...@tt...> - 2003-09-22 05:47:37
|
On 22/09/03 02:33 +0000, Clark C. Evans wrote: > whole scalar: > summary: > > The most painful aspect of converting the Syck interface > to generate YAML bytecodes was that each scalar is reported > as a whole with the possibility of embedded '\n' and '\z' > characters. If the consumer of the bytecodes needs the > entire 'whole' scalar, then it seems especially dumb > (not to mention burdensome) to break up each scalar into > tiny bytecodes so that the consumer can rebuild it... > proposal: > > I propose adding a YAMLBYTE_WHOLESCALAR = '<' which > can contain '\n' and '\z' items (and thus requires > a length or a end pointer to specify how big it is). > > When serialized as text, this bytecode would have the > following structure: '<<' word '\n' raw '\n' here '\n', > where 'raw' is the raw scalar value, and 'here' is a > printable ascii string whish marks the end of the scalar > when encountered on a line all by itself. > > D > Q > Ssingle line scalar > <<. > This whole multi-line scalar has > two new lines in its content and is > not termianted with a new line. > . > Ssecond single line scalar > E > > concerns: > - > > since this bytecode can have embedded new lines (\n) > it could mess with readability... however, here (<<) > documents are very well known, and in this context > fits the bill. Note: this does not support the <<- > variant... if you are after readability, use YAML! > - > > since this bytecode can have embedded nulls (\z) it > could cause problems with programs that use \z as > a signal that the stream has ended... Your YAML better written: whole scalar: summary: The most painful aspect of converting the Syck interface to generate YAML bytecodes was that each scalar is reported as a whole with the possibility of embedded '\n' and '\z' characters. If the consumer of the bytecodes needs the entire 'whole' scalar, then it seems especially dumb (not to mention burdensome) to break up each scalar into tiny bytecodes so that the consumer can rebuild it... proposal: > I propose adding a YAMLBYTE_WHOLESCALAR = '<' which can contain '\n' and '\z' items (and thus requires a length or a end pointer to specify how big it is). When serialized as text, this bytecode would have the following structure: '<<' word '\n' raw '\n' here '\n', where 'raw' is the raw scalar value, and 'here' is a printable ascii string whish marks the end of the scalar when encountered on a line all by itself. D Q Ssingle line scalar <<. This whole multi-line scalar has two new lines in its content and is not termianted with a new line. . Ssecond single line scalar E concerns: - since this bytecode can have embedded new lines (\n) it could mess with readability... however, here (<<) documents are very well known, and in this context fits the bill. Note: this does not support the <<- variant... if you are after readability, use YAML! - since this bytecode can have embedded nulls (\z) it could cause problems with programs that use \z as a signal that the stream has ended... This means the almost the same thing. The folding indicator is usually unneeded. It only makes sense on the 'proposal' entry, because the indented lines need newline endings. (The only difference between this and the original is that folded scalars end with a newline) Cheers, The YAML Fashion Police |
From: Clark C. E. <cc...@cl...> - 2003-09-20 19:09:15
|
# Changes since last pass: # - all bytecodes are \n terminated # - content bytecodes are upper case # - formatting bytecodes are lower case # - other bytecodes (for error messages, node length) are # using symbols, like #, !, etc. # subject: Revision #3 of YAML Bytecodes summary: > This proposal defines a 'preparsed' format where a YAML syntax is converted into a series of events, as bytecodes. Each bytecode appears on its own line, starting with a single character and ending with a line feed character, '\n'. codes: # # Primary Bytecodes (Capital Letters) # # These bytecodes form the minimum needed to represent YAML information # from the serial model (ie, without format and comments) # 'D': name: Document desc: > Indicates that a document has begun, either it is the beginning of a YAML stream, or a --- has been found. Thus, an empty document is expressed as "D\n" 'V': name: Directive desc: > This represents any YAML directives immediately following a 'D' bytecode. For example '--- %YAML:1.0' produces the bytecode "D\nVYAML:1.0\n". 'P': name: Pause Stream desc: > This is the instruction when a document is terminated, but another document has not yet begun. Thus, it is optional, and typically used to pause parsing. For example, a stream starting with an empty document, but then in a hold state for the next document would be: "D\nP\n" 0 : name: End Stream (optional) desc: > YAML bytecodes are meant to be passable as a single "C" string, and thus the null terminator can optionally be used to signal the end of a stream. When writing bytecodes out to a flat file, the file need not contain a null terminator; however, when read into memory it should always have a null terminator. 'M': name: Mapping desc: > Indicates the begin of a mapping, children of the mapping are provided as a series of K1,V1,K2,V2 pairs as they are found in the input stream. For example, the bytecodes for "{ a: b, c: d }" would be "M\nSa\nSb\nSc\nSd\nE\n" 'Q': name: Sequence desc: > Indicates the begin of a sequence, children are provided following till a '.' bytecode is encountered. So, the bytecodes for "[ one, two ]" would be "Q\nSone\nStwo\nE\n" 'E': name: End Collection desc: > This closes the outermost Collection (Mapping, Sequence), note that the document has one and only one node following it, therefore it is not a branch. 'S': name: Scalar desc: > This indicates the start of a scalar value, which can be continued by the 'N' and 'C' bytecodes. This bytecode is used for sequence entries, keys, values, etc. 'C': name: Scalar Continuation desc: > Since a scalar may not fit within a buffer, and since it may not contain a \n character, it may have to be broken into several chunks. 'N': name: Normalized New Line (in a scalar value) desc: > Scalar values must be chunked so that new lines and null values do not occur within a 'S' or 'C' bytecode (in the bytecodes, all other C0 need not be escaped). This bytecode is then used to represent one or more newlines, with the number of newlines optionally following. For example, "Hello\nWorld" would be "SHello\nN\nCWorld\n", and "Hello\n\n\nWorld" is "SHello\nN3\nCWorld\n" 'Z': name: Null Character (in a scalar value) desc: > As in normalized new lines above, since the null character cannot be used in the bytecodes, is must be escaped, ie, "Hello\zWorld" would be "SHello\nZ\nCWorld\n". 'A': name: Alias desc: > This is used when ever there is an alias node, for example, "[ &X one, *X ]" would be normalized to "S\nAX\nSone\nRX\nE\n" -- in this example, the anchor bytecode applies to the very next content bytecode. 'R': name: Reference (Anchor) desc: > This bytecode associates an anchor with the very next content node, see the 'A' alias bytecode. 'T': name: Transfer desc: > This is the transfer method. If the value begins with a '!', then it is not normalized. Otherwise, the value is a fully qualified URL, with a semicolon. The transfer method applies only to the node immediately following, and thus it can be seen as a modifier like the anchor. For example, "Ttaguri:yaml.org,2002:str\nSstring\n" is normalized, "T!str\nSstring\n" is not. # # Formatting bytecodes (lower case) # # The following bytecodes are purely at the syntax level and # useful for pretty printers and emitters. Since the range of # lower case letters is contiguous, it could be easy for a # processor to simply ignore all bytecodes in this range. # 'c': name: Comment desc: > This is a single line comment. It is terminated like all of the other variable length items, with a '\n'. 'i': name: Indent desc: > Specifies number of additional spaces to indent for subsequent block style nodes, "i4\n" specifies 4 char indent. 'f': name: Flow Collections desc: > Use flow (inline bracketed) style for subsequent mappings {} and sequences []. If this bytecode is followed by string value, it is treated as a transfer method to which the style applies, for example, "[ one, two ]" is expressed as "f\nQ\nSone\nStwo\nE\n". If one wants to limit this to a private transfer "bing" for example, then the flow instruction would appear as "f!!bing\n". Only subsequent collections marked with "!!bing" would be affected. Transfers used in this way are matched by exact string copies, thus, normalization equivalencies are not checked. 'b': name: Block Collections desc: > Use block/indented style for subsequent mappings and sequences, as with Flow Collections above, this can be followed with an optional transfer string (see T above). "s": name: Single Quoted Style desc: > Subsequent scalar nodes 'S' should be expressed using single quoted style, untill another style instruction changes the current 'paintbrush'. Of course, if the scalar cannot be emitted using the single quoted style (it has items which require double quoted escaping), then this is an error. As above, if the bytecode is followed by characters, this is treated as a transfer method to be matched, for example: "s\nSsingle quoted\n" "s!!bing\nSplain style\nT!!bing\nSsingle quoted\n" 'd': name: Double Quoted Style desc: > Same as above, only specifying double quoted. Note that, in the default output printer, scalar values over a given buffer size or beyond one line may always be double quoted as this is the only style which can represent the entire unicode character set within YAML. 'l': name: Literal Style desc: > same as above, only specifying block literal style for scalar values 'o': name: Folded Style desc: > same as above, only specifying folded literal style for scalar values 'p': name: Plain Style desc: > specifies to use plain style for scalar values # # Advanced bytecodes (not alphabetic) # # These are optional goodies which one could find useful. # '#': name: Line Number desc: > This bytecode allows the line number of the very next node to be reported. '!': name: Notice desc: > This signifies the end of the current document (and possibly the entire stream) due to an error condition. This signal has a packed format, with an error number, a comma, and textual error message: "#22\n!73,Indentation mismatch\n" "#132\n!84,Tabs are illegal for indentation\n" '?': name: Length desc: > This bytecode gives the span of the very next 'S', 'M', or 'Q' bytecode -- including its subordinates. For scalars, it includes the span of all subordinate 'N' and 'C' codes. For mappings or sequences, this gives the length all the way to the corresponding 'E' bytecode so that the entire branch can be skipped. The length is given starting at the corresponding 'S', 'M' or 'Q' bytecode and extends to the first character following subordinate nodes. Since this length instruction is meant to be used to 'speed' things up, and since calculating the length via hand is not really ideal, the length is expressed in Hex. This will allow programs to easily convert the length to an actual value (converting from hex to integers is easier than decimal). Furthermore, all leading x's are ignored (so that they can be filled in later) and if the bytecode value is all x's, then the length is unknown. Lastly, this length is expressed in 8 bit units for UTF-8, and 16 bit units for UTF-16. For example, --- [[one, two], three] Is expressed as, "?25\nD\n?x1E\nQ\n?xxE\nQ\nSone\nStwo\nE\nSthree\nE\n" Thus it is seen that the address of D plus 37 is the null terminator for the string, the first 'Q' plus 30 also gives the null teriminator, and the second 'Q' plus 14 jumps to the opening 'S' for the third scalar. design: - name: streaming support problem: > The interface should ideally allow for a YAML document to be moved incrementally as a stream through a process. In particular, YAML is inheritently line oriented, thus the interface should probably reflect this fundamental character. solution: > The bytecodes deliver scalars as chunks, each chunk limited to at most one line. While this is not ideal for passing large binary objects, it is simple and easy to understand. - name: push problem: > The most common 'parsers' out there for YAML are push style, where the producer owns the 'C' program stack, and the consumer keeps its state as a heap object. Ideal use of a push interface is an emitter, since this allows the sender (the application program) to use the program stack and thus keep its state on the call stack in local, automatic variables. solution: > A push interface simply can call a single event handler with a (bytecode, payload) tuple. Since the core complexity is in the bytecodes, the actual function signature is straight-forward allowing for relative language independence. Since the bytecode is always one character, the event handler could just receive a string where the tuple is implicit. - name: pull problem: > The other alternative for a streaming interface is a 'pull' mechanism, or iterator model where the consumer owns the C stack and the producer keeps any state needed as a heap object. Ideal use of a pull interface is a parser, since this allows the receiver (the application program) to use the program stack, keeping its state on the call stack in local variables. solution: > A pull interface would also be a simple function, that when called filles a buffer with binary node(s). Or, in a language with garbage collection, could be implemented as an iterator returning a string containing the bytecode line (bytecode followed immediately by the bytecode argument as a single string) or as a tuple. - name: pull2push problem: > This is done easily via a small loop which pulls from the iterator and pushes to the event handler. solution: > For python, assuming the parser is implemented as an iterator where one can 'pull' bytecode, args tuples, and assuming the emitter has a event callback taking a bytecode, args tuple, we have: def push2pull(parser, emitter): for (bytecode, args) in parser: emitter.push(bytecode, args) - name: push2pull problem: > This requires the entire YAML stream be cashed in memory, or each of the two stages in a thread or different continuation with shared memory or pipe between them. solution: > This use case seems much easier with a binary stream; that is, one need not convert the style of functions between the push vs pull pattern. And, for languages supporting continuations, (ruby) perhaps push vs pull is not even an issue... for a language like python, one would use the threaded Queue object, one thread pushes (bytecode, args) tuples into the Queue, while the other thread pulls the tuples out. Simple. - name: neutrality problem: > It would be ideal of the C Program interface was simple enough to be independent of programming language. In an ideal case, imagine a flow of YAML structured data through various processing stages on a server; where each processing stage is written in a different programming language. solution: > While it may be hard for each language to write a syntax parser filled with all of the little details, it would be much much easier to write a parser for these bytecodes; as it involves simple string handling, dispatching on the first character in each string. - name: tools problem: > A goal of mine is to have a YPATH expression language, a schema language, and a transformation language. I would like these items to be reusable by a great number of platforms/languages, and in particular as its own callable processing stage. solution: > If such an expression language was written on top of a bytecode format like this, via a simple pull function (/w adapters for push2pull and pull2push) quite a bit of reusability could emerge. Imagine a schema validator which is injected into the bytecode stream and it is an identity operation unless an exception occurs, in which case, it terminates the document and makes the next document be a description of the validation error. - name: encoding problem: > Text within the bytecode format must be given an encoding. There are several considerations at hand listed below. solution: > The YAML bytecode format uses the same encodings as YAML itself, and thus is independent of actual encoding. A parser library should have several functions to convert between the encodings. examples: - yaml: | --- - plain - > this is a flow scalar - > another flow scalar which is continued on a second line and indented 2 spaces - &001 !str | This is a block scalar, both typed and anchored - *001 # this was an alias - "This is a \"double quoted\" scalar" bytecode: | D Q Splain f Sthis is a flow scalar Sanother flow scalar which is continued Con a second line and indented 2 spaces b a001 t!str SThis is a block scalar, both typed N Cand anchored R001 cthis was an alias d SThis is a "double quoted" scalar E |
From: Oren Ben-K. <or...@be...> - 2003-09-20 19:42:05
|
Some points: > 'S': > name: Scalar > desc: > > This indicates the start of a scalar value, which can > be continued by the 'N' and 'C' bytecodes. This bytecode > is used for sequence entries, keys, values, etc. Maybe it is better to have 'S' end with an 'E' and have multiple 'V'-s and 'N'-s in it (you'd need some other char for directives - maybe 'C'). It is cleaner than using 'C' (e.g. a node always end with an 'E'). It isn't a big deal, though. > 'N': > name: Normalized New Line (in a scalar value) > desc: > > Scalar values must be chunked so that new lines and > null values do not occur within a 'S' or 'C' bytecode > (in the bytecodes, all other C0 need not be escaped). > This bytecode is then used to represent one or more > newlines, with the number of newlines optionally > following. For example, > "Hello\nWorld" would be "SHello\nN\nCWorld\n", and > "Hello\n\n\nWorld" is "SHello\nN3\nCWorld\n" This doesn't handle LS and PS... I suggest that 'N' would be followed by character(s) specifying the new-line type. > 'f': > name: Flow Collections > desc: > > Use flow (inline bracketed) style for subsequent mappings > {} and sequences []. If this bytecode is followed by > string value, it is treated as a transfer method to which > the style applies, for example, "[ one, two ]" is expressed > as "f\nQ\nSone\nStwo\nE\n". If one wants to limit this to > a private transfer "bing" for example, then the flow > instruction would appear as "f!!bing\n". Only subsequent > collections marked with "!!bing" would be affected. > Transfers used in this way are matched by exact string > copies, thus, normalization equivalencies are not checked. -1 on that: -1 on the "applied to subsequent nodes of the specified transfer method". The byte codes are no place for this level of instruction. Not that there isn't a need for that sort of stuff - but it is a separate concern, and should be addressed properly rather than slapping one very special restricted case into the byte codes. I also tend towards having a single 's' code followed by a style marker: s>, s|, s|, s', s", sf (flow), s (for plain - or maybe sp). I'd also have a code for explicit indent and for chomping, allowing the byte codes to express the exact modifiers given in the document for a node. > '!': > name: Notice > desc: > > This signifies the end of the current document (and > possibly the entire stream) due to an error condition. > This signal has a packed format, with an error number, > a comma, and textual error message: > "#22\n!73,Indentation mismatch\n" > "#132\n!84,Tabs are illegal for indentation\n" I think this shouldn't necessarily indicate the end of the document. If byte codes follow they should either be 'D' for the next document, 0 if everything is done, or anything else if the parser has some recovery algorithm - e.g., multiple 'E's to close the affected nodes and then continuation of the next valid node. > '?': > name: Length > desc: > > This bytecode gives the span of the very next 'S', 'M', > or 'Q' bytecode -- including its subordinates. For scalars, > it includes the span of all subordinate 'N' and 'C' codes. > For mappings or sequences, this gives the length all the > way to the corresponding 'E' bytecode so that the entire > branch can be skipped. The length is given starting at > the corresponding 'S', 'M' or 'Q' bytecode and extends > to the first character following subordinate nodes. This expresses the length of the byte codes of the node, right? To allow a quick "seek" to skip the nodes byte codes? Well and good, but it is no less useful - perhaps more useful - to know the length of the _content_ of a node in advance (the number of entries in a sequence, the number of chars in a scalar, or even pairs in a map). This allows a loader to pre-allocate the necessary structures and thus greatly improve efficiency. We need such a code in addition to this one. Overall, I like the approach. Like I said, I'm not willing to "sign" on the codes until I get to the point I can actually generate them, which would take a few months. In the meanwhile, how is the release candidate draft coming along? :-) Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2003-09-20 20:26:56
|
On Sat, Sep 20, 2003 at 09:39:44PM +0200, Oren Ben-Kiki wrote: | > 'S': | > name: Scalar | > desc: > | > This indicates the start of a scalar value, which can | > be continued by the 'N' and 'C' bytecodes. This bytecode | > is used for sequence entries, keys, values, etc. | | Maybe it is better to have 'S' end with an 'E' and have multiple 'V'-s | and 'N'-s in it (you'd need some other char for directives - maybe 'C'). | It is cleaner than using 'C' (e.g. a node always end with an 'E'). It | isn't a big deal, though. In this approach, the scalar ends when you encounter a byte code which is not 'N', 'C', or 'Z'. It is a tiny state machine needed to handle this case, but I think the bulk of scalars will not contain new lines, nulls, and will not need to be continued; so I'd rather make it easy for that case. Also, branch (mappings and sequences) tracking will require the handler to maintain a stack, 'E' pops the stack. I don't really see the need to put scalars on that stack as well, so I was thinking more from this use case. I guess if you want to think of 'C' as chunk that would be fine... but I still don't see the reason to end a sequence of chunks. As any other bytecode will do this for you. | > 'N': | > name: Normalized New Line (in a scalar value) | > desc: > | > Scalar values must be chunked so that new lines and | > null values do not occur within a 'S' or 'C' bytecode | > (in the bytecodes, all other C0 need not be escaped). | > This bytecode is then used to represent one or more | > newlines, with the number of newlines optionally | > following. For example, | > "Hello\nWorld" would be "SHello\nN\nCWorld\n", and | > "Hello\n\n\nWorld" is "SHello\nN3\nCWorld\n" | | This doesn't handle LS and PS... I suggest that 'N' would be followed by | character(s) specifying the new-line type. Ok. So, NCRLF, NLF, NCR, NLS, NPS would be the sequences, where N all by its self is NLF? | > 'f': | > name: Flow Collections | > desc: > | > Use flow (inline bracketed) style for subsequent mappings | > {} and sequences []. If this bytecode is followed by | > string value, it is treated as a transfer method to which | > the style applies, for example, "[ one, two ]" is expressed | > as "f\nQ\nSone\nStwo\nE\n". If one wants to limit this to | > a private transfer "bing" for example, then the flow | > instruction would appear as "f!!bing\n". Only subsequent | > collections marked with "!!bing" would be affected. | > Transfers used in this way are matched by exact string | > copies, thus, normalization equivalencies are not checked. | | -1 on that: | | -1 on the "applied to subsequent nodes of the specified transfer | method". The byte codes are no place for this level of instruction. Not | that there isn't a need for that sort of stuff - but it is a separate | concern, and should be addressed properly rather than slapping one very | special restricted case into the byte codes. | | I also tend towards having a single 's' code followed by a style marker: | s>, s|, s|, s', s", sf (flow), s (for plain - or maybe sp). I'd also | have a code for explicit indent and for chomping, allowing the byte | codes to express the exact modifiers given in the document for a node. Ok. I'll make these changes. | > '!': | > name: Notice | > desc: > | > This signifies the end of the current document (and | > possibly the entire stream) due to an error condition. | > This signal has a packed format, with an error number, | > a comma, and textual error message: | > "#22\n!73,Indentation mismatch\n" | > "#132\n!84,Tabs are illegal for indentation\n" | | I think this shouldn't necessarily indicate the end of the document. If | byte codes follow they should either be 'D' for the next document, 0 if | everything is done, or anything else if the parser has some recovery | algorithm - e.g., multiple 'E's to close the affected nodes and then | continuation of the next valid node. Right. \z ends the document. I will fix. | > '?': | > name: Length | > desc: > | > This bytecode gives the span of the very next 'S', 'M', | > or 'Q' bytecode -- including its subordinates. For scalars, | > it includes the span of all subordinate 'N' and 'C' codes. | > For mappings or sequences, this gives the length all the | > way to the corresponding 'E' bytecode so that the entire | > branch can be skipped. The length is given starting at | > the corresponding 'S', 'M' or 'Q' bytecode and extends | > to the first character following subordinate nodes. | | This expresses the length of the byte codes of the node, right? To allow | a quick "seek" to skip the nodes byte codes? Well and good, but it is no | less useful - perhaps more useful - to know the length of the _content_ | of a node in advance (the number of entries in a sequence, the number of | chars in a scalar, or even pairs in a map). This allows a loader to | pre-allocate the necessary structures and thus greatly improve | efficiency. We need such a code in addition to this one. Right. This one is about 'seek', not about 'allocation'. ;) | Overall, I like the approach. Like I said, I'm not willing to "sign" on | the codes until I get to the point I can actually generate them, which | would take a few months. That's fine. I just wanted to put a front end on syck so that I can start using the bytecodes for various items (such as writing a generic Python loader via the bytecodes). | In the meanwhile, how is the release candidate | draft coming along? :-) Yes.... Hmm... *me acts like a nerd* Ok. I'll get back to that very shortly. I've got one (or two even) days a week on YAML! ;) Best, Clark |
From: Clark C. E. <cc...@cl...> - 2003-09-20 21:06:07
|
# # Reflects Oren's comments, adds yamlbyte.h at the bottom # subject: Revision #4 of YAML Bytecodes summary: > This proposal defines a 'preparsed' format where a YAML syntax is converted into a series of events, as bytecodes. Each bytecode appears on its own line, starting with a single character and ending with a line feed character, '\n'. codes: # # Primary Bytecodes (Capital Letters) # # These bytecodes form the minimum needed to represent YAML information # from the serial model (ie, without format and comments) # 'D': name: Document desc: > Indicates that a document has begun, either it is the beginning of a YAML stream, or a --- has been found. Thus, an empty document is expressed as "D\n" 'V': name: Directive desc: > This represents any YAML directives immediately following a 'D' bytecode. For example '--- %YAML:1.0' produces the bytecode "D\nVYAML:1.0\n". 'P': name: Pause Stream desc: > This is the instruction when a document is terminated, but another document has not yet begun. Thus, it is optional, and typically used to pause parsing. For example, a stream starting with an empty document, but then in a hold state for the next document would be: "D\nP\n" '\z': name: Finish (end stream) desc: > YAML bytecodes are meant to be passable as a single "C" string, and thus the null terminator can optionally be used to signal the end of a stream. When writing bytecodes out to a flat file, the file need not contain a null terminator; however, when read into memory it should always have a null terminator. 'M': name: Mapping desc: > Indicates the begin of a mapping, children of the mapping are provided as a series of K1,V1,K2,V2 pairs as they are found in the input stream. For example, the bytecodes for "{ a: b, c: d }" would be "M\nSa\nSb\nSc\nSd\nE\n" 'Q': name: Sequence desc: > Indicates the begin of a sequence, children are provided following till a '.' bytecode is encountered. So, the bytecodes for "[ one, two ]" would be "Q\nSone\nStwo\nE\n" 'E': name: End Collection desc: > This closes the outermost Collection (Mapping, Sequence), note that the document has one and only one node following it, therefore it is not a branch. 'S': name: Scalar desc: > This indicates the start of a scalar value, which can be continued by the 'N' and 'C' bytecodes. This bytecode is used for sequence entries, keys, values, etc. 'C': name: Scalar Continuation desc: > Since a scalar may not fit within a buffer, and since it may not contain a \n character, it may have to be broken into several chunks. 'N': name: Normalized New Line (in a scalar value) desc: > Scalar values must be chunked so that new lines and null values do not occur within a 'S' or 'C' bytecode (in the bytecodes, all other C0 need not be escaped). This bytecode is then used to represent one or more newlines, with the number of newlines optionally following. For example, "Hello\nWorld" would be "SHello\nN\nCWorld\n", and "Hello\n\n\nWorld" is "SHello\nN3\nCWorld\n" If the new line is an LS or a PS, the N bytecode can be followed with a L or P. Thus, "Hello\PWorld\L" is reported "SHello\nNP\nWorld\NL\n" 'Z': name: Null Character (in a scalar value) desc: > As in normalized new lines above, since the null character cannot be used in the bytecodes, is must be escaped, ie, "Hello\zWorld" would be "SHello\nZ\nCWorld\n". 'A': name: Alias desc: > This is used when ever there is an alias node, for example, "[ &X one, *X ]" would be normalized to "S\nAX\nSone\nRX\nE\n" -- in this example, the anchor bytecode applies to the very next content bytecode. 'R': name: Reference (Anchor) desc: > This bytecode associates an anchor with the very next content node, see the 'A' alias bytecode. 'T': name: Transfer desc: > This is the transfer method. If the value begins with a '!', then it is not normalized. Otherwise, the value is a fully qualified URL, with a semicolon. The transfer method applies only to the node immediately following, and thus it can be seen as a modifier like the anchor. For example, "Ttaguri:yaml.org,2002:str\nSstring\n" is normalized, "T!str\nSstring\n" is not. # # Formatting bytecodes (lower case) # # The following bytecodes are purely at the syntax level and # useful for pretty printers and emitters. Since the range of # lower case letters is contiguous, it could be easy for a # processor to simply ignore all bytecodes in this range. # 'c': name: Comment desc: > This is a single line comment. It is terminated like all of the other variable length items, with a '\n'. 'i': name: Indent desc: > Specifies number of additional spaces to indent for subsequent block style nodes, "i4\n" specifies 4 char indent. 's': name: Scalar styling desc: > This bytecode, is followed with one of the following items to indicate the style to be used for the very next content node. It is an error to specify a style for a scalar other than double quoted when it must be escaped. Furthermore, there must be agreement between the style and the very next content node, in other words, a scalar style requires that the next content node be an S. > flow scalar " double quoted scalar ' single quoted scalar | literal scalar p plain scalar { inline mapping [ inline sequence b block style (for mappings and sequences'") # # Advanced bytecodes (not alphabetic) # # These are optional goodies which one could find useful. # '#': name: Line Number desc: > This bytecode allows the line number of the very next node to be reported. '!': name: Notice desc: > This is a message sent from the producer to the consumer regarding the state of the stream or document. It does not necessarly end a stream, as the 'finish' bytecode can be used for this purpose. This signal has a packed format, with the error number, a comma, and a textual message: "#22\n!73,Indentation mismatch\n" "#132\n!84,Tabs are illegal for indentation\n" ',': name: Span desc: > This bytecode gives the span of the very next 'S', 'M', or 'Q' bytecode -- including its subordinates. For scalars, it includes the span of all subordinate 'N' and 'C' codes. For mappings or sequences, this gives the length all the way to the corresponding 'E' bytecode so that the entire branch can be skipped. The length is given starting at the corresponding 'S', 'M' or 'Q' bytecode and extends to the first character following subordinate nodes. Since this length instruction is meant to be used to 'speed' things up, and since calculating the length via hand is not really ideal, the length is expressed in Hex. This will allow programs to easily convert the length to an actual value (converting from hex to integers is easier than decimal). Furthermore, all leading x's are ignored (so that they can be filled in later) and if the bytecode value is all x's, then the length is unknown. Lastly, this length is expressed in 8 bit units for UTF-8, and 16 bit units for UTF-16. For example, --- [[one, two], three] Is expressed as, "?25\nD\n?x1E\nQ\n?xxE\nQ\nSone\nStwo\nE\nSthree\nE\n" Thus it is seen that the address of D plus 37 is the null terminator for the string, the first 'Q' plus 30 also gives the null teriminator, and the second 'Q' plus 14 jumps to the opening 'S' for the third scalar. '@': name: Allocate desc: > This is a hint telling the processor how many items are in the following collection (mapping pairs, or sequence values), or how many character units need to be allocated to hold the next value. Clearly this is encoding specific value. The length which follows is in hex (not decimal). For example, "one", could be "@x3\nSone" design: - name: streaming support problem: > The interface should ideally allow for a YAML document to be moved incrementally as a stream through a process. In particular, YAML is inheritently line oriented, thus the interface should probably reflect this fundamental character. solution: > The bytecodes deliver scalars as chunks, each chunk limited to at most one line. While this is not ideal for passing large binary objects, it is simple and easy to understand. - name: push problem: > The most common 'parsers' out there for YAML are push style, where the producer owns the 'C' program stack, and the consumer keeps its state as a heap object. Ideal use of a push interface is an emitter, since this allows the sender (the application program) to use the program stack and thus keep its state on the call stack in local, automatic variables. solution: > A push interface simply can call a single event handler with a (bytecode, payload) tuple. Since the core complexity is in the bytecodes, the actual function signature is straight-forward allowing for relative language independence. Since the bytecode is always one character, the event handler could just receive a string where the tuple is implicit. - name: pull problem: > The other alternative for a streaming interface is a 'pull' mechanism, or iterator model where the consumer owns the C stack and the producer keeps any state needed as a heap object. Ideal use of a pull interface is a parser, since this allows the receiver (the application program) to use the program stack, keeping its state on the call stack in local variables. solution: > A pull interface would also be a simple function, that when called filles a buffer with binary node(s). Or, in a language with garbage collection, could be implemented as an iterator returning a string containing the bytecode line (bytecode followed immediately by the bytecode argument as a single string) or as a tuple. - name: pull2push problem: > This is done easily via a small loop which pulls from the iterator and pushes to the event handler. solution: > For python, assuming the parser is implemented as an iterator where one can 'pull' bytecode, args tuples, and assuming the emitter has a event callback taking a bytecode, args tuple, we have: def push2pull(parser, emitter): for (bytecode, args) in parser: emitter.push(bytecode, args) - name: push2pull problem: > This requires the entire YAML stream be cashed in memory, or each of the two stages in a thread or different continuation with shared memory or pipe between them. solution: > This use case seems much easier with a binary stream; that is, one need not convert the style of functions between the push vs pull pattern. And, for languages supporting continuations, (ruby) perhaps push vs pull is not even an issue... for a language like python, one would use the threaded Queue object, one thread pushes (bytecode, args) tuples into the Queue, while the other thread pulls the tuples out. Simple. - name: neutrality problem: > It would be ideal of the C Program interface was simple enough to be independent of programming language. In an ideal case, imagine a flow of YAML structured data through various processing stages on a server; where each processing stage is written in a different programming language. solution: > While it may be hard for each language to write a syntax parser filled with all of the little details, it would be much much easier to write a parser for these bytecodes; as it involves simple string handling, dispatching on the first character in each string. - name: tools problem: > A goal of mine is to have a YPATH expression language, a schema language, and a transformation language. I would like these items to be reusable by a great number of platforms/languages, and in particular as its own callable processing stage. solution: > If such an expression language was written on top of a bytecode format like this, via a simple pull function (/w adapters for push2pull and pull2push) quite a bit of reusability could emerge. Imagine a schema validator which is injected into the bytecode stream and it is an identity operation unless an exception occurs, in which case, it terminates the document and makes the next document be a description of the validation error. - name: encoding problem: > Text within the bytecode format must be given an encoding. There are several considerations at hand listed below. solution: > The YAML bytecode format uses the same encodings as YAML itself, and thus is independent of actual encoding. A parser library should have several functions to convert between the encodings. examples: - yaml: | --- - plain - > this is a flow scalar - > another flow scalar which is continued on a second line and indented 2 spaces - &001 !str | This is a block scalar, both typed and anchored - *001 # this was an alias - "This is a \"double quoted\" scalar" bytecode: | D Q Splain f Sthis is a flow scalar Sanother flow scalar which is continued Con a second line and indented 2 spaces b a001 t!str SThis is a block scalar, both typed N Cand anchored R001 cthis was an alias d SThis is a "double quoted" scalar E cheader: | /* yamlbyte.h * * The YAML bytecode "C" interface header file. See the YAML bytecode * reference for bytecode sequence rules and for the meaning of each * bytecode. */ #ifndef YAMLBYTE_H #define YAMLBYTE_H #include <stddef.h> /* list out the various YAML bytecodes */ typedef enum { /* content bytecodes */ YAML_FINISH = 0, YAML_DOCUMENT = 'D', YAML_DIRECTIVE = 'V', YAML_PAUSE = 'P', YAML_MAPPING = 'M', YAML_SEQUENCE = 'S', YAML_ENDMAPSEQ = 'E', YAML_SCALAR = 'S', YAML_CONTINUE = 'C', YAML_NEWLINE = 'N', YAML_NULLCHAR = 'Z', YAML_ALIAS = 'A', YAML_ANCHOR = 'R', YAML_TRANSFER = 'T', /* formatting bytecodes */ YAML_COMMENT = 'c', YAML_INDENT = 'i', YAML_STYLE = 's', /* other bytecodes */ YAML_LINENUMBER = '#', YAML_NOTICE = '!', YAML_SPAN = ',', YAML_ALLOC = '@' } yaml_code_t; /* additional modifiers for the YAML_STYLE bytecode */ typedef enum { YAML_FLOW = '>', YAML_LITERAL = '|', YAML_BLOCK = 'b', YAML_PLAIN = 'p', YAML_INLINE_MAPPING = '{', YAML_INLINE_SEQUENCE = '}', YAML_SINGLE_QUOTED = 39, YAML_DOUBLE_QUOTED = '"' } yaml_style_t; typedef unsigned char yaml_utf8_t; typedef unsigned short yaml_utf16_t; #ifdef YAML_UTF8 #ifdef YAML_UTF16 #error Must only define YAML_UTF8 or YAML_UTF16 #endif typedef yaml_utf8_t yaml_char_t; #else #ifdef YAML_UTF16 typedef yaml_utf16_t yaml_char_t; #else #error Must define YAML_UTF8 or YAML_UTF16 #endif #endif /* return value for push function, tell parser if you want to stop */ typedef enum { YAML_MORE = 1, /* producer should continue to fire events */ YAML_STOP = 0 /* producer should stop firing events */ } yaml_more_t; /* push bytecodes from a producer to a consumer * where arg is null terminated /w a length */ typedef void * yaml_consumer_t; typedef yaml_more_t (*yaml_push_t)( yaml_consumer_t self, yaml_code_t code, const yaml_char_t *arg, size_t arglen ); /* pull bytecodes by the producer from the consumer, where * producer must null terminate buff and return the number * of sizeof(yaml_char_t) bytes used */ typedef void * yaml_producer_t; typedef size_t (*yaml_pull_t)( yaml_producer_t self, yaml_code_t *code, yaml_char_t *buff, /* at least 1K buffer */ size_t buffsize ); /* returns number of bytes used in the buffer */ /* canonical helper to show how to hook up a parser (as a push * producer) to an emitter (as a push consumer) */ #define YAML_PULL2PUSH(pull, producer, push, consumer) \ do { \ yaml_code_t code = YAML_NOTICE; \ yaml_more_t more = YAML_CONTINUE; \ yaml_char_t buff[1024]; \ size_t size = 0; \ memset(buff, 0, 1024 * sizeof(yaml_char_t)); \ while( code && more) { \ size = (pull)((producer),&code, buff, 1024); \ assert(size < 1024 && !buff[size]); \ more = (push)((consumer),code, buff, size); \ } \ buff[0] = 0; \ (push)((consumer),YAML_FINISH, buff, 0); \ } while(1) #endif |
From: Clark C. E. <cc...@cl...> - 2003-09-20 20:28:36
|
/* yamlbyte.h * * The YAML bytecode "C" interface header file. See the YAML bytecode * reference for bytecode sequence rules and for the meaning of each * bytecode. */ #ifndef YAMLBYTE_H #define YAMLBYTE_H #include <stddef.h> /* list out the various YAML bytecodes */ typedef enum { /* content bytecodes */ YAML_FINISH = 0, YAML_DOCUMENT = 'D', YAML_DIRECTIVE = 'V', YAML_PAUSE = 'P', YAML_MAPPING = 'M', YAML_SEQUENCE = 'S', YAML_ENDMAPSEQ = 'E', YAML_SCALAR = 'S', YAML_CONTINUE = 'C', YAML_NEWLINE = 'N', YAML_NULLCHAR = 'Z', YAML_ALIAS = 'A', YAML_ANCHOR = 'R', YAML_TRANSFER = 'T', /* formatting bytecodes */ YAML_COMMENT = 'c', YAML_INDENT = 'i', YAML_FLOW = 'f', YAML_BLOCK = 'b', YAML_SINGLE = 's', YAML_DOUBLE = 'd', YAML_LITERAL = 'l', YAML_FOLDED = 'o', YAML_PLAIN = 'p', /* other bytecodes */ YAML_LINENUMBER = '#', YAML_NOTICE = '!', YAML_LENGTH = '?' } yaml_code_t; typedef unsigned char yaml_utf8_t; typedef unsigned short yaml_utf16_t; #ifdef YAML_UTF8 #ifdef YAML_UTF16 #error Must only define YAML_UTF8 or YAML_UTF16 #endif typedef yaml_utf8_t yaml_char_t; #else #ifdef YAML_UTF16 typedef yaml_utf16_t yaml_char_t; #else #error Must define YAML_UTF8 or YAML_UTF16 #endif #endif /* return value for push function, tell parser if you want to stop */ typedef enum { YAML_MORE = 1, /* producer should continue to fire events */ YAML_STOP = 0 /* producer should stop firing events */ } yaml_more_t; /* push bytecodes from a producer to a consumer * where arg is null terminated /w a length */ typedef void * yaml_consumer_t; typedef yaml_more_t (*yaml_push_t)( yaml_consumer_t self, yaml_code_t code, const yaml_char_t *arg, size_t arglen ); /* pull bytecodes by the producer from the consumer, where * producer must null terminate buff and return the number * of sizeof(yaml_char_t) bytes used */ typedef void * yaml_producer_t; typedef size_t (*yaml_pull_t)( yaml_producer_t self, yaml_code_t *code, yaml_char_t *buff, /* at least 1K buffer */ size_t buffsize ); /* returns number of bytes used in the buffer */ /* canonical helper to show how to hook up a parser (as a push * producer) to an emitter (as a push consumer) */ #define YAML_PULL2PUSH(pull, producer, push, consumer) \ do { \ yaml_code_t code = YAML_NOTICE; \ yaml_more_t more = YAML_CONTINUE; \ yaml_char_t buff[1024]; \ size_t size = 0; \ memset(buff, 0, 1024 * sizeof(yaml_char_t)); \ while( code && more) { \ size = (pull)((producer),&code, buff, 1024); \ assert(size < 1024 && !buff[size]); \ more = (push)((consumer),code, buff, size); \ } \ buff[0] = 0; \ (push)((consumer),YAML_FINISH, buff, 0); \ } while(1) #endif |
From: Clark C. E. <cc...@cl...> - 2003-09-21 04:20:22
|
Howdy. Well, this is it for this weekend, my play day is over. Anyway, here is what I think a "C" interface for YAML bytecodes could look like. In short, the pull/push functions just take a void pointer and a character buffer. All further 'API' specification is delegated to the bytecode definitions. The buffer can contain one or more bytecodes and their corresponding data. There is not much difference between push and pull, as seen by the macro PULL2PUSH. I pulled out Syck, created an ext directory called bytecode and started to hack there for a while (thus motivating the changes below). However, Syck uses postorder tree traversal, while the bytecode specification really assumes preorder. I think this can be solved by converting each 'event' into a bytecode buffer which contains the instructions for that particular node (anchor, alias, etc.). And then, when I encounter a collection, I build its buffer by contatinating. While this won't be efficient, it should convert Syck event handling into bytecode buffers.... After converting this, I was thinking of writing a native 'C' level emitter which used the bytecodes (including the format codes, if they are provided). Of course, next weekend I'll focus on the spec... Best, Clark P.S. Thanks _Why for the wonderful parser. ----- /* yamlbyte.h * * The YAML bytecode "C" interface header file. See the YAML bytecode * reference for bytecode sequence rules and for the meaning of each * bytecode. */ #ifndef YAMLBYTE_H #define YAMLBYTE_H #include <stddef.h> /* define what a character is */ typedef unsigned char yaml_utf8_t; typedef unsigned short yaml_utf16_t; #ifdef YAML_UTF8 #ifdef YAML_UTF16 #error Must only define YAML_UTF8 or YAML_UTF16 #endif typedef yaml_utf8_t yaml_char_t; #else #ifdef YAML_UTF16 typedef yaml_utf16_t yaml_char_t; #else #error Must define YAML_UTF8 or YAML_UTF16 #endif #endif /* specify list of bytecodes */ #define YAML_FINISH ((yaml_char_t) 0) #define YAML_DOCUMENT ((yaml_char_t)'D') #define YAML_DIRECTIVE ((yaml_char_t)'V') #define YAML_PAUSE ((yaml_char_t)'P') #define YAML_MAPPING ((yaml_char_t)'M') #define YAML_SEQUENCE ((yaml_char_t)'S') #define YAML_END_BRANCH ((yaml_char_t)'E') #define YAML_SCALAR ((yaml_char_t)'S') #define YAML_CONTINUE ((yaml_char_t)'C') #define YAML_NEWLINE ((yaml_char_t)'N') #define YAML_NULLCHAR ((yaml_char_t)'Z') #define YAML_ALIAS ((yaml_char_t)'A') #define YAML_ANCHOR ((yaml_char_t)'R') #define YAML_TRANSFER ((yaml_char_t)'T') /* formatting bytecodes */ #define YAML_COMMENT ((yaml_char_t)'c') #define YAML_INDENT ((yaml_char_t)'i') #define YAML_STYLE ((yaml_char_t)'s') /* other bytecodes */ #define YAML_LINE_NUMBER ((yaml_char_t)'#') #define YAML_NOTICE ((yaml_char_t)'!') #define YAML_SPAN ((yaml_char_t)')') #define YAML_ALLOC ((yaml_char_t)'@') /* second level style bytecodes, ie "s>" */ #define YAML_FLOW ((yaml_char_t)'>') #define YAML_LITERAL ((yaml_char_t)'|') #define YAML_BLOCK ((yaml_char_t)'b') #define YAML_PLAIN ((yaml_char_t)'p') #define YAML_INLINE_MAPPING ((yaml_char_t)'{') #define YAML_INLINE_SEQUENCE ((yaml_char_t)'[') #define YAML_SINGLE_QUOTED ((yaml_char_t)39) #define YAML_DOUBLE_QUOTED ((yaml_char_t)'"') typedef const yaml_char_t *yaml_buffer_t; /* argument to a code */ typedef enum { YAML_OK = 0, /* proceed */ YAML_E_MEMORY = 'M', /* could not allocate memory */ YAML_E_READ = 'R', /* input stream read error */ YAML_E_WRITE = 'W', /* output stream write error */ YAML_E_OTHER = '?', /* some other error condition */ YAML_E_PARSE = 'P' /* parse error, check bytecodes */ } yaml_result_t; /* producer pushes a null terminated buffer filled with one or more * bytecode events to the consumer; if the consumer's result is not * YAML_OK, then the producer should stop */ typedef void * yaml_consumer_t; typedef yaml_result_t (*yaml_push_t)( yaml_consumer_t self, yaml_buffer_t buff ); /* consumer pulls bytecode events from the producer; in this case * the buffer is owned by the producer, and will remain valid till * the pull function is called once again; if the buffer pointer * is set to NULL, then there are no more results; it is important * to call the pull function till it returns NULL so that the * producer can clean up its memory allocations */ typedef void * yaml_producer_t; typedef yaml_result_t (*yaml_pull_t)( yaml_producer_t self, yaml_buffer_t *buff /* to be filled in by the producer */ ); /* convert a pull interface to a push interface; the reverse process * requires threads and thus is language dependent; * * NOTE: this has a memory leak beacuse it does not finish * calling the producer when the consumer has a bad * result. Hmm. */ #define YAML_PULL2PUSH(pull,producer,push,consumer,result) \ do { \ yaml_pull_t _pull = (pull); \ yaml_push_t _push = (push); \ yaml_result_t _result = YAML_OK; \ yaml_producer_t _producer = (producer); \ yaml_consumer_t _consumer = (consumer); \ while(1) { \ yaml_buffer_t buff = NULL; \ _result = _pull(_producer,&buff); \ if(YAML_OK != result || NULL == buff) \ break; \ _result = _push(_consumer,buff); \ if(YAML_OK != result) \ break; \ } \ (result) = _result; \ } while(0) #endif |
From: Clark C. E. <cc...@cl...> - 2003-09-22 02:54:36
|
Sorry about being so prolific... I'm pondering "C" APIs. Given that most of the API complexity is wrapped up in the actual bytecodes and their definitions; it makes sending bytecodes between components quite easy. There are two methods for sending bytecodes, one instruction at a time, or as a bytecode buffer, which is a chunk of the textual format. Each one of these methods has a push and a pull variant. The method I outlined earlier, was the buffer method: | typedef enum { | YAML_OK = 0, /* proceed */ | YAML_E_MEMORY = 'M', /* could not allocate memory */ .. | YAML_E_PARSE = 'P' /* parse error, check bytecodes */ | } yaml_result_t; | | typedef const yaml_char_t *yaml_buffer_t; /* argument to a code */ | | /* producer pushes a null terminated buffer filled with one or more | * bytecode events to the consumer; if the consumer's result is not | * YAML_OK, then the producer should stop */ | typedef void * yaml_consumer_t; | typedef | yaml_result_t | (*yaml_push_t)( | yaml_consumer_t self, | yaml_buffer_t buff | ); | | /* consumer pulls bytecode events from the producer; in this case | * the buffer is owned by the producer, and will remain valid till | * the pull function is called once again; if the buffer pointer | * is set to NULL, then there are no more results; it is important | * to call the pull function till it returns NULL so that the | * producer can clean up its memory allocations */ | typedef void * yaml_producer_t; | typedef | yaml_result_t | (*yaml_pull_t)( | yaml_producer_t self, | yaml_buffer_t *buff /* to be filled in by the producer */ | ); In the above case, the buffer contains bytecodes together with the various scalar values, etc. The other option, is the one which Oren was proposing to me over the phone (only that he did not use a structure... I think the structure is probably useful for building processing chains rather than using args). typedef struct yaml_instruction_s { yaml_char_t code, const yaml_char_t *start; /* NULL unless bytecode has an argument */ const yaml_char_t *finish; /* length of argument is finish - start */ } *yaml_instruction_t; typedef result_t (*yaml_pullinst_t)( yaml_producer_t self, yaml_instruction_t *inst ); typedef result_t (*yaml_pushinst_t) ( yaml_consumer_t self, yaml_instruction_t inst ); Note it is easy to go from pushbuff_t to pushinst_t beacuse the intermediate converter need only keep a current instruction as its data (and the previous producer in the chain; going the other way probably requires variable length memory allocation. Is there another approach? The goal is to keep the API "RISC" style, generic across all instructions. In the pushinst, the only deviation would be that '<' WHOLE_SCALAR would not have the <<HERE stuff. Best, Clark |
From: Brian I. <in...@tt...> - 2003-09-22 06:09:28
|
I was discussing the merits of a bytecode API with my friend Colin Meyer who is helping with the Perl and Python YAML implementations. Colin was actually opposed to making a bytecode API be our preferred published API. Why? Because a SAX-like API would be more favorable to the programming community at large. Especially for a streaming interface. I have to admit that I agree with him. It would be very nice to have a "standard" SAX-like streaming API, that was implemented across the various languages. The one I am currently coding to looks something like this: start_stream() end_stream() start_document(directives) end_document() start_mapping(type_uri, anchor, style) end_mapping() start_sequence(type_uri, anchor, style) end_sequence() append_scalar(string, buffer_complete, type_uri, anchor, style) anchor_alias(anchor) This is much simpler and probably much more generally useful for a public streaming interface. Of course, a defined bytecode is still very useful. And it doesn't seem all that difficult to parse a bytecode stream and report it with an API like the one above. Cheers, Brian On 21/09/03 04:23 +0000, Clark C. Evans wrote: > Howdy. Well, this is it for this weekend, my play day is over. Anyway, > here is what I think a "C" interface for YAML bytecodes could look like. > In short, the pull/push functions just take a void pointer and a > character buffer. All further 'API' specification is delegated to > the bytecode definitions. The buffer can contain one or more bytecodes > and their corresponding data. There is not much difference between > push and pull, as seen by the macro PULL2PUSH. > > I pulled out Syck, created an ext directory called bytecode > and started to hack there for a while (thus motivating the > changes below). However, Syck uses postorder tree traversal, > while the bytecode specification really assumes preorder. > I think this can be solved by converting each 'event' into > a bytecode buffer which contains the instructions for > that particular node (anchor, alias, etc.). And then, > when I encounter a collection, I build its buffer by > contatinating. While this won't be efficient, it should > convert Syck event handling into bytecode buffers.... > > After converting this, I was thinking of writing a native > 'C' level emitter which used the bytecodes (including > the format codes, if they are provided). Of course, next > weekend I'll focus on the spec... > > Best, > > Clark > > P.S. Thanks _Why for the wonderful parser. > > ----- > > /* yamlbyte.h > * > * The YAML bytecode "C" interface header file. See the YAML bytecode > * reference for bytecode sequence rules and for the meaning of each > * bytecode. > */ > > #ifndef YAMLBYTE_H > #define YAMLBYTE_H > #include <stddef.h> > > /* define what a character is */ > typedef unsigned char yaml_utf8_t; > typedef unsigned short yaml_utf16_t; > #ifdef YAML_UTF8 > #ifdef YAML_UTF16 > #error Must only define YAML_UTF8 or YAML_UTF16 > #endif > typedef yaml_utf8_t yaml_char_t; > #else > #ifdef YAML_UTF16 > typedef yaml_utf16_t yaml_char_t; > #else > #error Must define YAML_UTF8 or YAML_UTF16 > #endif > #endif > > /* specify list of bytecodes */ > #define YAML_FINISH ((yaml_char_t) 0) > #define YAML_DOCUMENT ((yaml_char_t)'D') > #define YAML_DIRECTIVE ((yaml_char_t)'V') > #define YAML_PAUSE ((yaml_char_t)'P') > #define YAML_MAPPING ((yaml_char_t)'M') > #define YAML_SEQUENCE ((yaml_char_t)'S') > #define YAML_END_BRANCH ((yaml_char_t)'E') > #define YAML_SCALAR ((yaml_char_t)'S') > #define YAML_CONTINUE ((yaml_char_t)'C') > #define YAML_NEWLINE ((yaml_char_t)'N') > #define YAML_NULLCHAR ((yaml_char_t)'Z') > #define YAML_ALIAS ((yaml_char_t)'A') > #define YAML_ANCHOR ((yaml_char_t)'R') > #define YAML_TRANSFER ((yaml_char_t)'T') > /* formatting bytecodes */ > #define YAML_COMMENT ((yaml_char_t)'c') > #define YAML_INDENT ((yaml_char_t)'i') > #define YAML_STYLE ((yaml_char_t)'s') > /* other bytecodes */ > #define YAML_LINE_NUMBER ((yaml_char_t)'#') > #define YAML_NOTICE ((yaml_char_t)'!') > #define YAML_SPAN ((yaml_char_t)')') > #define YAML_ALLOC ((yaml_char_t)'@') > > /* second level style bytecodes, ie "s>" */ > #define YAML_FLOW ((yaml_char_t)'>') > #define YAML_LITERAL ((yaml_char_t)'|') > #define YAML_BLOCK ((yaml_char_t)'b') > #define YAML_PLAIN ((yaml_char_t)'p') > #define YAML_INLINE_MAPPING ((yaml_char_t)'{') > #define YAML_INLINE_SEQUENCE ((yaml_char_t)'[') > #define YAML_SINGLE_QUOTED ((yaml_char_t)39) > #define YAML_DOUBLE_QUOTED ((yaml_char_t)'"') > > typedef const yaml_char_t *yaml_buffer_t; /* argument to a code */ > > typedef enum { > YAML_OK = 0, /* proceed */ > YAML_E_MEMORY = 'M', /* could not allocate memory */ > YAML_E_READ = 'R', /* input stream read error */ > YAML_E_WRITE = 'W', /* output stream write error */ > YAML_E_OTHER = '?', /* some other error condition */ > YAML_E_PARSE = 'P' /* parse error, check bytecodes */ > } yaml_result_t; > > /* producer pushes a null terminated buffer filled with one or more > * bytecode events to the consumer; if the consumer's result is not > * YAML_OK, then the producer should stop */ > typedef void * yaml_consumer_t; > typedef > yaml_result_t > (*yaml_push_t)( > yaml_consumer_t self, > yaml_buffer_t buff > ); > > /* consumer pulls bytecode events from the producer; in this case > * the buffer is owned by the producer, and will remain valid till > * the pull function is called once again; if the buffer pointer > * is set to NULL, then there are no more results; it is important > * to call the pull function till it returns NULL so that the > * producer can clean up its memory allocations */ > typedef void * yaml_producer_t; > typedef > yaml_result_t > (*yaml_pull_t)( > yaml_producer_t self, > yaml_buffer_t *buff /* to be filled in by the producer */ > ); > > /* convert a pull interface to a push interface; the reverse process > * requires threads and thus is language dependent; > * > * NOTE: this has a memory leak beacuse it does not finish > * calling the producer when the consumer has a bad > * result. Hmm. > */ > #define YAML_PULL2PUSH(pull,producer,push,consumer,result) \ > do { \ > yaml_pull_t _pull = (pull); \ > yaml_push_t _push = (push); \ > yaml_result_t _result = YAML_OK; \ > yaml_producer_t _producer = (producer); \ > yaml_consumer_t _consumer = (consumer); \ > while(1) { \ > yaml_buffer_t buff = NULL; \ > _result = _pull(_producer,&buff); \ > if(YAML_OK != result || NULL == buff) \ > break; \ > _result = _push(_consumer,buff); \ > if(YAML_OK != result) \ > break; \ > } \ > (result) = _result; \ > } while(0) > > #endif > > > ------------------------------------------------------- > This sf.net email is sponsored by:ThinkGeek > Welcome to geek heaven. > http://thinkgeek.com/sf > _______________________________________________ > Yaml-core mailing list > Yam...@li... > https://lists.sourceforge.net/lists/listinfo/yaml-core |
From: Oren Ben-K. <or...@be...> - 2003-09-22 06:30:50
|
Brian Ingerson wrote: > I was discussing the merits of a bytecode API with my friend > Colin Meyer who is helping with the Perl and Python YAML > implementations. > > Colin was actually opposed to making a bytecode API be our > preferred published API. Why? Because a SAX-like API would be > more favorable to the programming community at large. > Especially for a streaming interface. Two points: - I'd like to reserve judgment on the details of the bytecodes API until I get to the point I'm implementing it. - I don't think there's a "preferred" API. There's a "pull" API, which something along the lines proposed by Clark, and there's a "push" API, which is something along the lines of your post. Both are necessary, neither is preferred. As a side note - it is trivial to implement a module that takes the pull API and emits the push API. If this is the way that the push (SAX) API is implemented, then the resulting system provides both APIs "for free". The other direction is harder (requires threads or co-routines). This is the only reason the "pull" API is emphasized at this point in time. It is an internal implementation issue, and it should be completely irrelevant to coders when deciding which API to use. Their choice of API should only be determined but what's easier/better suited to their application. FWIW, I think that given a decent pull API, people will quickly discover that for a great many use cases it results in simpler code. Maintaining state on a stack is so much easier than simulating one by yourself in data members of a tree-walking object... That said, there are cases where using a push API is simpler. It all depends on the application. I think YAML should be neutral here. > ... It would be very nice > to have a "standard" SAX-like streaming API, that was > implemented across the various languages. +1, as long as it is clear that both approaches complement, rather than contradict. Have fun, Oren Ben-Kiki |
From: Clark C. E. <cc...@cl...> - 2003-09-22 14:27:58
|
On Sun, Sep 21, 2003 at 11:09:23PM -0700, Brian Ingerson wrote: | I was discussing the merits of a bytecode API with my friend Colin Meyer | who is helping with the Perl and Python YAML implementations. | Colin was actually opposed to making a bytecode API be our preferred | published API. Why? Because a SAX-like API would be more favorable to | the programming community at large. Especially for a streaming | interface. Well, I am not in disagreement -- the bytecode API breaks a very nice (transfer, anchor, style, data) reporting unit into a bunch of parts, and this puts extra burden on the sender and receiver. However, for an API that keeps all of these peices together, I think we can do better than a SAX-like API, see below. | start_stream() | end_stream() | start_document(directives) | end_document() | start_mapping(type_uri, anchor, style) | end_mapping() | start_sequence(type_uri, anchor, style) | end_sequence() | append_scalar(string, buffer_complete, type_uri, anchor, style) | anchor_alias(anchor) typedef struct yaml_event { enum { BEGIN_STREAM, BEGIN_DOCUMENT, END_DOCUMENT, PAUSE_STREAM, BEGIN_MAPPING, END_MAPPING, BEGIN_SEQUENCE, END_SEQUENCE, APPEND_SCALAR, ALIAS } code; char *anchor; /* used by SCALAR, MAPPING, SEQUENCE, ALIAS */ char *transfer; /* used by SCALAR, MAPPING, SEQUENCE */ char *style; /* used by SCALAR, MAPPING, SEQUENCE */ char *chunk; /* used by SCALAR */ bool continued; /* used by SCALAR */ } * yaml_event_t; typedef enum { YAML_OK, YAML_E_MEMORY, YAML_E_WRITE, YAML_E_READ } yaml_error_t, typedef void * yaml_producer_t; typedef void * yaml_consumer_t; yaml_error_t (*yaml_push_t)(yaml_consumer_t self, yaml_event_t event); yaml_error_t (*yaml_pull_t)(yaml_producer_t self, yaml_event_t *event); /* * 1. In both cases, the producer owns the event's memory and buffer * 2. For the push case, the buffer is only valid for the length * of the event, it should be copied if desire to keep beyond this * 3. For the pull case, the buffer is valid till the next call * to the producer * 4. In either case, if the buffer is NULL then the stream has * finished * 5. In the push case, if the consumer returns a value which is * not YAML_OK, then the producer should abort and not continue * to push events. * 6. For the pull case, if the producer returns a value which * is not YAML_OK, then the consumer should not continue to * pull events. * 7. The actual event structure could be variable length, ie, * no reason why an ALIAS has to have space for the transfer, i * etc. Although it probably does not hurt. */ This sort of API, while keeping the "chunking" property of SAX, has several advantages over a SAX like API: 1. the difference between push and pull is very small, so that people can become quickly familiar with both and use the one that best fits their needs 2. there is only two APIs and one data structure to "wrap" for linking this with a higher level language; further events can be added by the library (new event code) without requiring additional functions be wrapped. 3. one can add a few more "fields" to the event_t structure to get a DOM like structure; and thus this API (sequential) becomes a subset of a more general random access API. 4. if you add a 'length' attribute into the data structure, you have a "binary" stream Anyway, this is very similar to my original proposal, before I tried to make the whole thing more "character" friendly. | Of course, a defined bytecode is still very useful. And it doesn't seem | all that difficult to parse a bytecode stream and report it with an API | like the one above. Yes. I think that they satisfy two entirely different constraints. However, I would rather have a "event structure" based API instead of a bunch of SAX-like events. Does this make sense, to you agree? Or what does SAX-like event handling give you that the above doesn't. Thanks, Clark |
From: Clark C. E. <cc...@cl...> - 2003-09-27 00:03:06
|
# # my notes and thoughts regarding an IRC chat and phone # followup with Brian; note this email sets direction, so # I could use some feedback to make sure that I got our # overall agreemnts correct. also, some of these items # were done in a Brian/Clark phone chat, so Oren did not # have any feedback on them. Lastly, I added tasks for # Why... namely adding a 'begin' notification for each # branch to his API so that a preorder tree traversal # can be done if possible. In any case, this is probably # not totally correct, as I have a very selective memory # (especially when pushing my agenda)... # # I'll be putting out another API proposal in the next # day or so, so I'd rather have feedback on that thread # if your comments are directed towards API. # --- time: 2003-09-26 18:00:00Z # (2PM EST, 11AM PST, 11PM IST) subject: Specification and APIs where: IRC who: Oren, Clark, Brian topics: - name: YAML Conference desc: > Brian suggested that we have a YAML conference in Febuary, and he is talking to coordinatros. By this conference we would like (a) a formal done deal spec, (b) parsers, API, and other tools. Clark is now putting in 1-2d of work per week into YAML till further notice (when things get tight) - name: specification desc: > We discussed further work to do on the specification. Everyone agreed that we want a Release Canidate very soon (by the end of this weekend would be ideal). Clark expressed that he has a problem with the model section -- echoing discomfort Brian was expressing several months ago. In particular, Clark is not happy with the merger of two important but othogonal issues: (a) syntax, serial, graph models; and (b) the generic vs native binding of those models. In particular, he would like to separate out the "native" talk into a separate paragraph (under a different numbered title). Clark plans to keep the diagram which shows: syntax -> (parser) -> serial -> (loader) -> graph syntax <- (emitter) <- serial <- (dumper) <- graph and also add a diagram which is a "stack" of attributes added at each level: when you move from the graph to the serial model, key ordering and aliases are added; and, moving from the serial to the syntax model adds style, etc. The generic vs native binding section will talk primarly about the type family and hopefully will be quite brief. Part of the "insight" gained recently is that the syntax, serial, and graph model may all have native bindings of one sort or another. Clark expects this to be a 4-6h update and hopes to finish it this weekend. Brian and Oren all agreed that we need a quick run-through of the current spec before it goes out to release canidate. Once it is at release canidate... the only changes to the spec will be bug fixes, spelling mistakes, and clarifications. task: - Clark to make the above discussed changes (4-6h) - Oren/Brian to review changes before we go to RC. - Clark to update spec for release canidate status. - name: parsers desc: > Oren and Brian talked quite a bit about parsers, in particular, the idea that both of them are building new parsers. Our current "recommended" parser is Why's Syck. The biggest part of this conversation was resolving differences that there may be overlap between the projects; and while this is open source, it would be best to coordinate our efforts. As it turns out, Brian's parser is attempting to build a regex based parser, with state tables and regular expressions. His ideal parser is a tiny bit of code for each language, plus a large state table in YAML (which can be shipped with a parser as a native language structure). In this way Brian's parser is 'pure' within the given language (assuming the language has a PCRE regular expression engine). Oren, on the other hand, is developing a native "C" parser, but as part of his project, he is developing a pull version of lex/yacc. Therefore, Oren's parser may take some time to develop. Luckly, Why has a working push parser that satisfies most constraints with Ruby and Python bindings. Brian and I later decided in a phone chat afterword to add a perl binding to Syck to cover the 'short-term' needs of our constituency. We still lack emitters... task: - Oren to continue to work on his pull-based parser - Brian to continue on his regex parser - Brian and Clark to work on a Perl port of Syck this weekend, and perhaps the following weekend - name: APIs desc: > We have a resource allocation problem, namely, we have several YAML related products but no way to merge them. For example, we already have 2 Python loaders which do not share the same code base. We need also to foster a situation where others can join in and build tools that work with YAML. So, we need an API. In this way Clark and others can work on emitters and validators, while Brian and Oren continue on parsers. Also, with an API, we can separate the task of building parsers with loaders so that language bindings can work with many different parsers. It is clear that this API would be "C" for the greatest amount of interoperability. There are actually three APIs which could emerge: (a) a syntax level API for lexer/low-level parser, (b) a serial API for communication between the parser and loader and emitter, (c) a graph API for writing YPath and other random access tools. For now, (c) is a very low priority and (a) can be specified by Oren with his parser. The key API to specify is the serial API. When developing the serial "C" API there are two general types of APIs, push and pull. They can be unified by having a single "event" data structure -- for pull (iterator) API, the "next_event()" function returns a pointer to this event structure; for push (notify) APIs, the "event_handler" function takes a pointer to this event structure. In this way, a pull2push converter is: while(1) { consumer.event_handler(producer.next_event()) } For the serial "C" api, this event structure would contain just about everthing in a single event, thus struct event { enum { BEGIN_MAP, SCALAR, ... } code, const char *value; /* used for scalars */ const char *anchor; const char *transfer; } This is in contrast to the API for the syntax model, which would "pivot" anchor and transfer into separate events. With some careful design we could have a unified model that goes both ways (using a "C" union construct). When talking with Oren, we concluded that in addition to the next() function, the pull API requires a close() function so that the consumer can let the producer "clean up" its memory allocations. In the same way, the push API could either have a stop() function (or it could use the return value of the event_handler). In both of these interfaces, there can be a event type for parsing errors; thus a separate function is not required. In later musings (following the producer/consumer model in Twisted) the push interface could have an additional resume() function, and stop() would be split into pause() and close(); where close stops and cleans up memory and pause only stops. So, with a serial API, we can move forward with building separate tools. A good migration plan is to come up with some preliminary agreement on the API (Clark's chore) and then modify Syck to use this API. Then, Clark could write a Python binding to the API and Brian could write a Perl binding, etc. Also, at this point, a separate emitter, and yamltools.lib could be made which provides for pull2push and push2pull (using pthreads) converters. task: - Clark to attempt another round of API specifications, now with feedback from Brian and Oren. - Why to help clark bolt the API onto Syck - Clark to convert Syck's Python binding to use the API - Clark/Brian to convert the soon-to-be-done Perl binding to Syck - Oren to provide feedback for what is necessary to extend/limit the API for a serial model API |