Menu

State Hierarchy PROTOcol Language

mikerez

Briefly

The SHProto language describes a State Machine for parsing protocol or data format. The State Machine is based on State Hierarchy. A description is used by SHProto Compiler to produce code for State Machine and use it to parse Input Char Stream. SHProto supports top-down parsing almost without looking ahead.

See also [FAQ].

Main benefits

  • State Machine event-driven runtime.
  • Easy and fast to write new protocol/data parser.
  • Easy and fast to enhance or correct existing parser.
  • Possibility to generate FSM in RAM or code-based FSM!

Main conditions

  • Input data flow is chars (=1 byte=8 bit) based stream (input char stream).
  • Input stream can be limited or unlimited.
  • Easily parses text protocols like HTTP, SMTP, POP3, IMAP, FTP, SNTP, SIP etc.
  • Compiler produces \<proto_name>.cc and \<proto_name>.h .</proto_name></proto_name>

Language basics

  • The description consists of Tokens divided by space symbols (" ","\t","\n","\r").
  • Each token is a Node of state hierarchy which is used to build result state machine.
  • The branches are described on different Lines.
  • Each line has own Hierarchy Level which is ruled by its indent.
  • Text from % till end of line is a comment.
  • Text from # till end of line is C++ macros.

Language tokens

Token Description
"dse\x18\n" fixed char sequence node
'asf\x21\t' fixed case unsensitive char sequence node
[\^jhg,a-z,\t\x18\\\-] alternative states node (just inlines several branches in one line), inherits syntax from regexp, (real symbol '-' should be always back-quoted for now)
var:type(parameters) Variable node (handles any apropriate symbols for variable type) - will be saved in variable in parser's state; please see [SHProto variable types]
c++ code node handles any symbol and passes it through to the next node; returning false will stop switching to this state and parser will search for other case if exists; the following local variables are available in C++ expression: char chr (current input stream char), shproto_state * state (contains ALL variables declared in the tree)
@ALIAS aliasing any branching point in hierarchy (NOT A NODE)
<alias></alias> call to any aliased branching point (NOT A NODE)
$ call to the hierarchy top (NOT A NODE)
$$ exit from the external branch to the place it was called from (NOT A NODE)
\ at the end of branch logically disarms indent of following sub-branches
default([len]): magic token in the beginning of the last sub-branch - catches all unexpected symbols on it's indent level; if len is passed nonzero parser will cache all symbols from branching point and will process them all with default branch in a case of exception

Modificators at the end of node tokens (except variable, c++ and default nodes), only one modificator is allowed after each token:

Modificator Description
* node may be repeated zero or many times
+ node may be repeated at least one time
? node will be optionally present

Branching

  1. The branching begins from creating new line and increasing indent.
  2. You can describe several branches on the same indent and they all will be cheked when state machine will reach branching point.
  3. Any branch may have sub-branches.
  4. By decreasing indent level you mean there are no more sub-branches linked to the previous branch.
  5. Being once entered the branch changes the parser's state and no state rollback will be possible (except handling by default token).
  6. Parser will try appropriate branches from top to bottom, depending on expected symbol or behavior of the first branch's C++ node.
  7. The top branches are with zero indent and at least one such branch should present in the beginning of the file (first non empty line).
  8. The line cant have indent less than previous if such indent was not present earlier.
  9. The external branch can be described standalone after the main tree and empty line. This branch should start from \@ALIAS and should be called so.
  10. C++ code node matches any input symbol and will be entered if will not return false.
  11. Parser must use char-based indexes embedded in nodes to seek for appropriate node.
  12. The default node should get all input symbols that was passed to sub-branch before unexpected symbol was found (adding default will slow-down the selection).
  13. The '\' token will logically place branches to the parent's level.

This example describes fast index-based parsing of some HTTP headers:

'Content-' \
    'Length:' [ \t]* content_len:u32 "\r\n"
    'Type:' [ \t]* content_type:string(64) "\r\n"
    'Encoding:' [ \t]* content_encoding:string(128) "\r\n"
[\t ] header_continuation:string(128) "\r\n"
default: header_name:substring(32,":") ":" [ \t]* header:string(128) "\r\n"


Aliasing

  1. Any branching point may be aliased by adding \<\@alias name> to the end of line before branches begins.
  2. Alias may be used to declare external branch by adding empty line and \<\@alias name> after main tree or previous external branch. The sub-branches of external branch (alias) should go with increased indent.
  3. External branch may be called from main tree or other external branch and after $$ token state machine will be returned to the place the call was done from.
  4. All other cases of calling aliases (including calling branch from itselve, calling main tree from branch) just do jumping and the calling place will not be saved.
  5. The token $ in external branch will return state machine to the main tree root.

Here is a good example for reading HTTP chunks with SHProto:

# parse HTTP body
... "\r\n" <@DECHUNK>
    chunklen:h32
        { if( state->chunklen != 0 ) return false; } { printf("body finished\n"); } $
        chunk:data(chunklen) <DECHUNK>


Variables

  1. All variables used in description are automaticaly declared and visible from every branch.
  2. Description can use one variable many times.
  3. Each variable name must be used with identical type in different places.
  4. Each variable is stored in class \<proto_name>_state</proto_name> and accessible from c++ blocks thrue pointer state.
  5. Standalone local variable could be declared after empty line after any branch.
  6. All variables are initialized zeros when starting.


Getting result

  1. Class \<proto_name>_state</proto_name> will contain all described variables.
  2. Class \<proto_name>_output</proto_name> will get all calls with extracted data - data variables will be passed to methods with name bool put_\<varname>_data( const char * data, unsigned length )</varname>; you should provide this class.
  3. Class \<proto_name>_output</proto_name> has pointer to \<proto_name>_state</proto_name> inside which is accesible thrue method set_state(const shproto_state * state).
  4. C++ code nodes have access to state thrue local variable \<proto_name>_state</proto_name> * state.
  5. \<proto_name></proto_name> is gotten from description file name by removing .shp from the end
  6. Class \<proto_name>_state</proto_name> is inherited from base empty class shproto_state.
  7. The void eof() method of the parser should be called at the end of input stream.

More examples

State Hierarchy


Related

Home: SQL
Wiki: BINARY1
Wiki: FAQ
Wiki: HTTP
Wiki: MIME
Wiki: POP3
Wiki: SHProto variable types
Wiki: SMTP