Briefly
The SHProto language describes a State Machine for parsing protocol or data format. The State Machine is based on State Hierarchy. A description is used by SHProto Compiler to produce code for State Machine and use it to parse Input Char Stream. SHProto supports top-down parsing almost without looking ahead.
See also [FAQ].
Main benefits
- State Machine event-driven runtime.
- Easy and fast to write new protocol/data parser.
- Easy and fast to enhance or correct existing parser.
- Possibility to generate FSM in RAM or code-based FSM!
Main conditions
- Input data flow is chars (=1 byte=8 bit) based stream (input char stream).
- Input stream can be limited or unlimited.
- Easily parses text protocols like HTTP, SMTP, POP3, IMAP, FTP, SNTP, SIP etc.
- Compiler produces \<proto_name>.cc and \<proto_name>.h .</proto_name></proto_name>
Language basics
- The description consists of Tokens divided by space symbols (" ","\t","\n","\r").
- Each token is a Node of state hierarchy which is used to build result state machine.
- The branches are described on different Lines.
- Each line has own Hierarchy Level which is ruled by its indent.
- Text from % till end of line is a comment.
- Text from # till end of line is C++ macros.
Language tokens
Token |
Description |
"dse\x18\n" |
fixed char sequence node |
'asf\x21\t' |
fixed case unsensitive char sequence node |
[\^jhg,a-z,\t\x18\\\-] |
alternative states node (just inlines several branches in one line), inherits syntax from regexp, (real symbol '-' should be always back-quoted for now) |
var:type(parameters) |
Variable node (handles any apropriate symbols for variable type) - will be saved in variable in parser's state; please see [SHProto variable types] |
|
c++ code node handles any symbol and passes it through to the next node; returning false will stop switching to this state and parser will search for other case if exists; the following local variables are available in C++ expression: char chr (current input stream char), shproto_state * state (contains ALL variables declared in the tree) |
@ALIAS |
aliasing any branching point in hierarchy (NOT A NODE) |
<alias></alias> |
call to any aliased branching point (NOT A NODE) |
$ |
call to the hierarchy top (NOT A NODE) |
$$ |
exit from the external branch to the place it was called from (NOT A NODE) |
\ |
at the end of branch logically disarms indent of following sub-branches |
default([len]): |
magic token in the beginning of the last sub-branch - catches all unexpected symbols on it's indent level; if len is passed nonzero parser will cache all symbols from branching point and will process them all with default branch in a case of exception |
Modificators at the end of node tokens (except variable, c++ and default nodes), only one modificator is allowed after each token:
Modificator |
Description |
* |
node may be repeated zero or many times |
+ |
node may be repeated at least one time |
? |
node will be optionally present |
Branching
- The branching begins from creating new line and increasing indent.
- You can describe several branches on the same indent and they all will be cheked when state machine will reach branching point.
- Any branch may have sub-branches.
- By decreasing indent level you mean there are no more sub-branches linked to the previous branch.
- Being once entered the branch changes the parser's state and no state rollback will be possible (except handling by default token).
- Parser will try appropriate branches from top to bottom, depending on expected symbol or behavior of the first branch's C++ node.
- The top branches are with zero indent and at least one such branch should present in the beginning of the file (first non empty line).
- The line cant have indent less than previous if such indent was not present earlier.
- The external branch can be described standalone after the main tree and empty line. This branch should start from \@ALIAS and should be called so.
- C++ code node matches any input symbol and will be entered if will not return false.
- Parser must use char-based indexes embedded in nodes to seek for appropriate node.
- The default node should get all input symbols that was passed to sub-branch before unexpected symbol was found (adding default will slow-down the selection).
- The '\' token will logically place branches to the parent's level.
This example describes fast index-based parsing of some HTTP headers:
'Content-' \
'Length:' [ \t]* content_len:u32 "\r\n"
'Type:' [ \t]* content_type:string(64) "\r\n"
'Encoding:' [ \t]* content_encoding:string(128) "\r\n"
[\t ] header_continuation:string(128) "\r\n"
default: header_name:substring(32,":") ":" [ \t]* header:string(128) "\r\n"
Aliasing
- Any branching point may be aliased by adding \<\@alias name> to the end of line before branches begins.
- Alias may be used to declare external branch by adding empty line and \<\@alias name> after main tree or previous external branch. The sub-branches of external branch (alias) should go with increased indent.
- External branch may be called from main tree or other external branch and after $$ token state machine will be returned to the place the call was done from.
- All other cases of calling aliases (including calling branch from itselve, calling main tree from branch) just do jumping and the calling place will not be saved.
- The token $ in external branch will return state machine to the main tree root.
Here is a good example for reading HTTP chunks with SHProto:
# parse HTTP body
... "\r\n" <@DECHUNK>
chunklen:h32
{ if( state->chunklen != 0 ) return false; } { printf("body finished\n"); } $
chunk:data(chunklen) <DECHUNK>
Variables
- All variables used in description are automaticaly declared and visible from every branch.
- Description can use one variable many times.
- Each variable name must be used with identical type in different places.
- Each variable is stored in class \<proto_name>_state</proto_name> and accessible from c++ blocks thrue pointer state.
- Standalone local variable could be declared after empty line after any branch.
- All variables are initialized zeros when starting.
Getting result
- Class \<proto_name>_state</proto_name> will contain all described variables.
- Class \<proto_name>_output</proto_name> will get all calls with extracted data - data variables will be passed to methods with name bool put_\<varname>_data( const char * data, unsigned length )</varname>; you should provide this class.
- Class \<proto_name>_output</proto_name> has pointer to \<proto_name>_state</proto_name> inside which is accesible thrue method set_state(const shproto_state * state).
- C++ code nodes have access to state thrue local variable \<proto_name>_state</proto_name> * state.
- \<proto_name></proto_name> is gotten from description file name by removing .shp from the end
- Class \<proto_name>_state</proto_name> is inherited from base empty class shproto_state.
- The void eof() method of the parser should be called at the end of input stream.
More examples