Default Project Icon

shproto

Home Code Downloads Wiki Tickets Discussion

Home

mikerez

Briefly

The SHProto language describes the State Machine for parsing protocol or data format. This state machine is based on State Hierarchy (aka State Tree) but some loops are allowed. The description is used by SHProto Compiler to produce the Main Parser Function which will realize state machine and use it to parse Input Char Stream. SHProto supports only to-down parsing almost without looking ahead.
See also [FAQ].

Main benefits

  • State Machine event-driven runtime!
  • Easy and fast to write new protocol/data parser (if you have experience).
  • Easy and fast to enhance or correct existing parser (even if you are newbie).
  • The FSM realized in RAM should be VERY fast!
  • FSM could process input char by char (saves states between chars).

Main conditions

  • The input data flow is char (=1 byte=8 bit) based stream (input char stream).
  • The input stream could be limited or unlimited.
  • At any moment previous data and knowledges say how to parse next data.
  • Suppose you are waiting for some data or events to be extracted as a result.
  • Easily parses text protocols like HTTP, SMTP, POP3, IMAP, FTP, SNTP, SIP but binary capability should be done too.
  • Compiler will produce <proto_name>.cc and <proto_name>.h (and may be other source files) containing parser class with state machine building method bool build_tree(shproto_node), main parser's method bool parse_input(char * input, unsigned len), <proto_name>_state class, <proto_name>_output* class and base classes.

Very simple example (description vs state machine)

Simple example image

Will match any input lines like
"Give me a pie.\n"
"Give me bigmac.\n"
"Give me an icecream.\n"
"Give me an icecream with chocolate.\n"
"Give me an icecream with chocolate and banana.\n"
and will call corresponding C++ function.

Language basics

  • The description consists of Tokens divided with space symbols (" ","\t","\n","\r").
  • Each token is a Node of state hierarchy which is used to build result state machine.
  • The branches are described on different Lines.
  • Each line has own Hierarchy Level which is ruled by its indent.
  • Text from # till end of line is a comment.

Language tokens

Token Description
"dse\x18\n" fixed char sequence node
'asf\x21\t' fixed case unsensitive char sequence node
[^jhg,a-z,\t\x18\\\-] alternative states node (just inlines several branches in one line), inherits syntax from regexp, (real symbol '-' should be always back-quoted for now)
var:type(parameters) Variable node (handles any apropriate symbols for variable type) - will be saved in variable in parser's state; please see [SHProto variable types]
{ printf("text"); return false; } c++ code node handles any symbol and passes it through to the next node; returning false will stop switching to this state and parser will search for other case if exists; the following local variables are available in C++ expression: char chr (current input stream char), shproto_state * state (contains ALL variables declared in the tree)
<@ALIAS> aliasing any branching point in hierarchy, always should be at the end of line, or standalone (NOT A NODE)
<ALIAS> call to any aliased branching point (NOT A NODE)
$ call to the hierarchy top (NOT A NODE)
$$ exit from the external branch to the place it was called from (NOT A NODE)
\ at the end of branch logically disarms indent of following sub-branches
default([len]): magic token in the beginning of the last sub-branch - catches all unexpected symbols on it's indent level; if len is passed nonzero parser will cache all symbols from branching point and will process them all with default branch in a case of exception

Modificators at the end of node tokens (except variable, c++ and default nodes), only one modificator is allowed after each token:

Modificator Description
* node could be repeated zero or many times
+ node could be repeated at least one time
? node will be optionally present

Branching

  1. The branching begins from creating new line and increasing indent.
  2. You can describe several branches on the same indent and they all will be cheked when state machine will reach branching point.
  3. Any branch could have sub-branches.
  4. By decreasing indent level you mean there are no more sub-branches linked to the previous branch.
  5. Being once entered the branch changes the parser's state and no state rollback will be possible (except handling by default token).
  6. Parser will try appropriate branches from top to bottom, depending on expected symbol or behavior of the first branch's C++ node.
  7. The top branches are with zero indent and at least one such branch should present in the beginning of the file (first non empty line).
  8. The line cant have indent less than previous if such indent was not present earlier.
  9. The external branch could be described standalone after the main tree and empty line. This branch should start from <@ALIAS> and should be called so.
  10. C++ code node will match any input symbol and will be entered if will not return false.
  11. Parser must use char-based indexes embedded in nodes to seek for appropriate node.
  12. The default node should get all input symbols that was passed to sub-branch before unexpected symbol was found (adding default will slow-down the selection).
  13. The '\' token will logically place branches to the parent's level.

For example lets describe fast index-based parsing of some HTTP headers:

'Content-' \
    'Length:' [ \t]* content_len:u32 "\r\n"
    'Type:' [ \t]* content_type:string(64) "\r\n"
    'Encoding:' [ \t]* content_encoding:string(128) "\r\n"
[\t ] header_continuation:string(128) "\r\n"
default: header_name:substring(32,":") ":" [ \t]* header:string(128) "\r\n"

Aliasing

  1. Any branching point could be aliased by adding <@alias name> to the end of line before branches begins.
  2. Alias could be used to declare external branch by adding empty line and <@alias name> after main tree or previous external branch. The sub-branches of external branch (alias) should go with increased indent.
  3. External branch could be called from main tree or other external branch and after $$ token state machine will be returned to the place the call was done from.
  4. All other cases of calling aliases (including calling branch from itselve, calling main tree from branch) just do jumping and the calling place will not be saved.
  5. The token $ in external branch will return state machine to the main tree root.

Here is a good example for reading HTTP chunks with SHProto:

# parse HTTP body
... "\r\n" <@DECHUNK>
    chunklen:h32
        { if( state->chunklen != 0 ) return false; } { printf("body finished\n"); } $
        chunk:data(chunklen) <DECHUNK>

Variables

  1. All variables used in description are automaticaly declared and visible from every branch.
  2. Description can use one variable many times.
  3. Each variable name must be used with identical type in different places.
  4. Each variable is stored in class <proto_name>_state and accessible from c++ blocks thrue pointer state.
  5. Standalone local variable could be declared after empty line after any branch.
  6. All variables are initialized zeros when starting.

Getting result

  1. Class <proto_name>_state will contain all described variables.
  2. Class <proto_name>_output will get all calls with extracted data - data variables will be passed to methods with name bool put_<varname>_data( const char * data, unsigned length ); you should provide this class.
  3. Class <proto_name>_output has pointer to <proto_name>_state inside which is accesible thrue method set_state(const shproto_state * state).
  4. C++ code nodes have access to state thrue local variable <proto_name>_state * state.
  5. <Proto_name> is gotten from description file name by removing .shp from the end
  6. Class <proto_name>_state is inherited from base empty class shproto_state.
  7. The void eof() method of the parser should be called at the end of input stream.

More examples

mikerez