Download Latest Version bnf2xml-7.0.2.tar.gz (316.5 kB)
Email in envelope

Get an email when there's a new version of bnf2xml

Home
Name Modified Size InfoDownloads / Week
README 2016-04-08 19.4 kB
bnf2xml-7.0.2.tar.gz 2016-04-07 316.5 kB
bnf.c.prettier.txt 2013-06-26 24.9 kB
bnf2xml-5.tar.gz 2013-06-05 150.9 kB
bnf2xml-1.tar.gz 2013-05-04 102.2 kB
Totals: 6 Items   613.9 kB 2
ABOUT

	bfn2xml is a BNF parser that can be used easily
	by shell applications because it has xml output
	(also other optional reports).

	bfn2xml just reads input and prints xml output, that
	simple.  (marked up by matched bnf grammar of course)

	Most bnf tools are complex; to be used within compiled
	code running inside programs and output complex trees.
	The bnf suites stating "XML" I've seen take XML Input, do Not
	generate XML output (but could do so by coding I'm sure).

To build this c++ prog do:

	$ tar -xzf bnf2xml-7.0.2.tar.gz
	$ cd bnf2xml-7.02/

	$ ./configure
	$ make
	$ make install

	$ bnf2xml -h
	$ man bnf2xml

-------------------------------------------------------------

DOCUMENTATION

  README covers bnf2xml specifics, has little bnf instruction.
  For education on BNF in general see:

	google BNF and try  http://www.wikipedia.com

	see also: examples in tarball

This is a bnf parser that:

	takes a file as input
		reads the bnf definition file
		identifies patterns in file using bnf definitions
	outputs XML markup of the definitions matched

BETA
	this is a beta release not ready for heavier use
	testing is on-going to make it stable for
	target "old school EBNF" input and bnf files.

	changelog is far below

OUTPUT
	outputs the highest bnf line matched
	which will be the most top most definition
	(bnf line to match can be picked though)

	Report C option outputs a text table of all matches
	(found or failed) (it's a trace of this table that makes
	the xml output)

	see OUTPUT EXAMPLE far below.

BNF
	bnf2xml has some enhancement but not follow or
	cannot do all "new bnf" syntaxes see (1) , bnf2xml is
	well featured for size, but far less than AntLR
	compiler parser product

SYMBOL  DESCRIPTION

 "a"    anything inside quotes is a terminal to match against input
 <a>    is non-terminal, meaning it is a rule to look-up and apply
  |     logical OR, ie. <letter> | <digit> -> "a" or "1"
        logical AND is otherwise assumed, ie <alph> <digit> -> "a1"
 []     optional expresssions, absorb input only if all expr are true,
        the result is always true (matches 0 or 1 times)
 {}     same as ] except it matches 0 or more times
 ()     group expressions.  absorb input only if all are true
  *     is }
  +     <a>+ == <a> { <a> }
  ?     is ]

 notes:
  bnf2xml runtime only does unary postfix ops, ie <a>?  if you looked
    in debug [ <a> <b> ] would be re-written equivalently by listing
    <a>], <b>] under new <c>.  see below about expressions collected
    under one symbol, which is a different kind of OR, often easier.
    (different?  | is a short circuit OR, see -o below)

BNF FEATURES

	OPS <> | () {} [] "" + * ?
	fully reflexive, recursive

NON-BNF  (features added):

	operators  . - ^ ! # @ = ~
        syntax     <>== "" "" "" ...

	tokens: <BNF_QUOTE> <BNF_8bit> <BNF_ZERO>
		<MATCH_LIST> <MATCH_SEP> <RECORD_SEPARATOR>

	*   and ? are quiet on 0 matches
	{}  and [] are noisy on 0 matches ; <a></a>
	{}  and [] can appear as <a> } which avoids an extra rule
	<>  == special, match against lists of | "string"s using binary search
	=   assign <a>= saves text found by last token into <a>
	~   congruent <a>~ can do a <repl> in a pair context (ie like typedef)
	@   quiet, shallow, preceeding item is xml quiet
	@@  quiet, deep, any matches it made also quiet
	^   reflex OP is just alternate syntax for compounded *, see below
	.   skip the input 1 forward
	-   skip the input 1 back (dash)
	$   "n"$ set next skip to n forward ("5"$ . ; skip 5)
	$`  "n"$` set next skip to n forward, emit data in output
	#   test previous token truth throw out result
	!   reverses prev token truth throw out result
	`   emit last terminal or token if rule-line is true
	   `` emit one deep , lookup token then emit (left quote)
	   .` emit current input char, does not skip input
	   -` emit curent input position, converted to decimal (dash)
	   %` emit incremental counter in decimal
	&   quit, stops, and prints if --streaming didn't already
            see LIMITS
	&&  forgets past input blocks, prints if streaming, continues
 
	Use only one postfix per object (excepting @, @ must be last)
		postfixes: ~ = ] } * + ? ` $
		independants: | # ! & . - %

	* ? + ! # ^ ` @ ~ = & _ $  must be attatched (no <white>)
	example: <a> ? is <a> "?" matches "?" as character, <a>? ok
	Attachment behavior can be avoided by compiling "NO_UNQUOTED"

	(the orig. idea was that these cryptic opts (easier to work with
	 once learned) could be replaced from <full_name> to character ops
	 using sed(1) before use as bnf file, avoiding extra bnf parse coding)

	See "Extra OPS" and see LIMITS for more details.

	Remaining simply old EBNF is a goal, but with xml pretty.
	With ~= now added bnf2xml is considered complete, it is
	thought to do all it needs to for the intended output goal.
	Please comment if you think otherwise.

ABNF EBFN support:

	No.  However ::= can be subst to = by sed(1) before the bfn
	is parsed.  And while I have no plan for: the bnf file could define
	some EBNF in terms of BNF (and then have EBNF appended to it).

BNF ENTRY and EXIT

	BNF needs a "starting left token". <bnf> is used below.
	bnf2xml handles lists everywhere even at top level.  The top
	level "runs" slightly different (ie -l, see options).

	<bnf>	::= <prog1>		; a single token top level
	<bnf>	::= <prog2>		; or list of entries at top
	<prog1>	::= hi 			; bnf rules
	<prog1>	::= hello " " world	; bnf rules
	...

	Exit.  search must match all of the input (including last EOL)
	or fails, unless using the special symbol design to quit early.

	Questionable exit.  It's unknown if all of <a> should be all
	input until " "? is checked.  If input is "hi ()" the 1st
	<a> matches, " "?, then fails: which missed trying 2nd <a>.

	<bnf>	::= <a> " "?
	<a>	::= hi
	<a>	::= hi " ()"
				; a sure exit below
	<bnf>	::= <a>
	<a>	::= hi " ()" " "?
	<a>	::= hi " "?

	(allowing two <a>, tried for truth in turn, may be a feature specific
	 to bnf2xml.  it is convenient and avoids complications, see OR)

QUOTING BNF syntax

	default is ", use -b "'" to set BNF_QUOTE to '

	bnf quotation is used to indicate what is NOT bnf syntax

	bnf syntax uses ::= and <>(){}[] space, tab, EOL, and
	also +*? when not preceeded by any space.  ex:

		"<a>"		; is not a bnf symbol
		"hi there"	; is one string, and quotes are removed

	About matching the BNF_QUOTE char itself...
	There are NO BNF QUOTING RULES: must be pairs.
	(ie, sh(1) has rules to subvert quote rules, is error prone)

	Quoting is absolute so BNF_QUOTE is "built-in".
		<BNF_QUOTE>hi there<BNF_QUOTE>
	matches "hi there" including the quotes

	BNF allows line splitting (last char \), absorbs spaces and EOL
	BNF only sees a new rule start when < is at beginning of a line
	    (and is followed by ::= and is not inside BNF_QUOTE)
<a> ::= <b> ...

	<BNF_ZERO> is also need if "string" contains 0 as and strcp(3)
	is used in reading bnf, which stops at 0's.

	(input file has binary 0 being matched, not the tag, of course)

EOL

	How used or not used:
	$ echo "ask" | ./bnf2xml
	$ echo "take 1the 0door" | ./bnf2xml --loop
	--loop doesn't match <EOL> because it's designed to
	take input a line at a time it matches bnf lines w/out.
	bnf lines that don't end in <EOL> are matched
		(note two <EOL> matches echo -e "\n")

NOTES

	see tarball for further README and examples

""
	<a> ::= "" 	; matches no input or fully absorbed input

@
	<a>@	; quiet the last token or terminal.  shallow means
                ; <a></a> will not appear but what is found by <a>
                ; does.  "and"@ also works and nothing prints.
		; @ must be last, ie, <a>+@ is ok.

^
	reflex is similar to *  ; avoid reflexes to be more compatible
	<a> ::= <b>^		; simple reflex
	<a> ::= <b>^ <c>^ <d>	; fully reflexive

<repl>
	simple replacing		; is better done by input prefilter
	<b> ::= <repl> = "a" "b"	; if <b> a is matched <b> c is
	<b> ::= <foo> == "c" "d"	; answer, always next line used

=
	= only sets (repl can achieved with more bnf rules)
	<b> <a>=	; <- saves text found by <b> into <a>
	<a> ::= "foo"	; <- make sure <a> is a defined terminal
			; = does not pair symbol <b> to symbol <a>
	= does not reset if in failed rule, it is sticky

~
	~ is for replacing items (or paired lists) only if found
	in a certain context.  like =, <b> <a>~ saves <b> into <a>

		ex. CPL, "typdef int I" means replace all I with int
		if I is found and is an identifier previously found
		in typedef, repl, else I is left alone by ~

	<b>  ::= <c>+ <tfe>	; for full example of ~ see junk2/bnf.8
	<c>  ::= typedef " " <ident> <tie>~ " " <ident> <tfe>~ " ; "
	<tfe> ::= <repl> ==	; see <repl> above and
	<tie> ::= <foo> ==	; dont forget to put these two lines in bnf

	<b> matches the typedef statement itself in input, and if that happens
	tie tfe are setup by ~ during matching the statement so that tfe
	becomes tie when tfe is matched in it's context <ident> (typedef is
	a string which 8.b

	finally == is blank?  the elements of == are set = while scanning, they
	are filled in by the (lets say typedef) found in input

	~ the above pairs replaces <tfe> to <tie> automatically
          because <tfe> uses <repl>
	~ lists can result, uses <MATCH_LIST> in xml
	~ uses <> == "" "" "", can be used in ways not shown above
	~ resets if within failing rule, but see LIMITATIONS
	see c.bnf (or bnf.c.txt)

{{}}
	a run of same lhs are tried in order as if ORed (ex shows
	recursion too) for example, input: "{{inside1}}" or "{{}}"
	<a> :: "inside1"
	<a> :: "{" <a>* "}"	; note <a> is not automatically true
	<a> :: "inside2"	; <a>* allows "{{}}" to match

	(example is a simple brace matching in input "{" which would in xml
	 show what is captured by braces, .bnf examples have more on that)

	However the below is different: <a> is both first and inside OR
	<a> :: <b> | <a><d> | <c>
		if no other <a> are defined this needs -o because
		| uses "short circuit logic".  <a><d> only after <b> fails
		then <a><d> fails, an inf loop results.  arg -o converts
		the above to three <a> (a different kind of OR)

diverting

	bnf2xml's front-recursion

	it's use is diminished because its un-necessary and may change

	bnf2xml does not assume to tail recurse the tail of any front
	recursion.  but if it did (right from K&R C book):
		<a> ::= <a> "[]" ; <a> ::= <a> "()" ; <a> ::= <ident>
		x[][][]...  ; would match and be ok, interpretively
		x()()()...  ; is not intended to allow
		x()[]()...  ; also not intended to allow
	note that defining <c> ::= <a> does not hide that <a> is
	front recursive, and <c> still needs a termination rule.

	front-recursing <a>'s 1st rule (see above) shuts the 1st rule
	rule until the 1st completes (an inf loop would occur). bnf2xml
	does not retry <a> or parts of automatically , use + for that.
	If bnf2xml did these things the recursion results might be
	ambiguous and bnf2xml has no syntax to handle ambiguity.
	This may be improved in future if ambiguity is not a result.

	NEW: i believe the above bnf <a> is "flawed" as should be
	assumed bison syntax (bottom up parser), not bnf necessarily.
	Bison may not front recurse: it decides itself given bottom up
	rules what to match; the above depends on states of such, the
	appearance of front recursion in syntax isn't what it seems.
	For ex. it may include lower matches - i'd have to try it.

	Be warned that operators - ! # could cause side-effects when
	used with front-recursion, true w/ no input progress may loop.

undivert

about bnf autorules...

	auto added rules:
	<a> :: ( <c> ... ) <d>

	gets rewritten to:
	<a> :: <) 1> <d>
	<) 1> :: <c> ...

	Why is because unary ops are used, has no scanning to "search for )".
	(compiled regex does this).  The contents "<c> ..." are under one
	Left token <) so truth of "<c> and ..." is tested (known).
	see -d, it shows the table after rewrite (if any).

	shortcut:
		use <a> } in bnf.txt avoids the above since <a> can
		be composite (<a> can be a list of rules to try)

	no -o or auto rules are needed or added if one uses repeated
	<a> lines instead of (multimple items){mi}[mi]

rule lists

	as seen above, mult. <a> are each tried in order (if recursing
	on one, that one is turned off to avoid inf recursion until return)

	this is NOT supported (<a> must all appear together in bnf file)
	<a> ::= ...
	<b> ::= ...
	<a> ::= ...

; TERMINALS
	is now only for ops like -g to say where to stop, preferrencial

Dot . and Minus - skip the input's string position forward or back 1
character, or n char if "n"$ preceeded, which is not the same as
matching and is reset on failure of rule.  Not ! changes return
value, un-does any matching, note that <a>+! is may not be supported
# Test keeps the return value but undoes any matching.
` Emit emits the previous token or terminal when rule line is true,
`` looks up symbol one deep first, .` dot emit current input char,
-` the ipos, %` a counter, "n"$` what was skipped.   @ Quiet quiets
the last token or terminal and @@ quiets deep. = Equal assigns text
of previous match to token and ~ Congruent is like equal with lists
as described above.  & Quit prints then exits truth <a> as if <a>?&.
&& forgets past input blocks, prints if streaming, and continues.
<MATCH_LIST> only occurs due to ~ if bnf allowed R to be non-uniq.

  QUIT <a>& prints what is currently known and exists with truth of <a>
  failure of <a> does not cause rule to fail or block & action
  think of it as <a>?& and write rules accordingly

  ~ can cause <MATCH_LIST> <MATCH_SEP> in XML if your BNF ended up
  allowing a list of R that are the same (see notes on ~).

EXAMPLES

junk2/ contains examples and regression tests and are mean to be use like:

$ sh junk2/tst

AN OUTPUT EXAMPLE

c.bnf is an almost full K&R C bnf for use with bnf2xml.
not tested on much but regression tests yet though.

# see README.bnf.c.txt
$ echo -e "int main(){;}" | ./bnf2xml ./junk2/c.bnf --loop | junk2/g4

<bnf><program><external-definition><function-definition><decl-specifier-dd><type-specifier><type>int</type></type-specifier></decl-specifier-dd><space> </space><function-declarator><declarator-function><declarator-simple><identifier>main</identifier></declarator-simple><fun>(<parameter-list></parameter-list>)</fun></declarator-function></function-declarator><function-body><type-decl-list></type-decl-list><function-statement>{<declaration-list></declaration-list><statement-list><statement>;</statement></statement-list>}</function-statement></function-body></function-definition><data-definition-list></data-definition-list></external-definition></program></bnf>

Some of the above emitted was marked to be emitted if empty in c.bnf
The tags could be tree formatted, or be used as colors for html, or
ie, an xml parser can pull all <identifiers></identifiers> as if a
database).  bnf2xml isn't a compiler, c.bnf is not complete K&R.
c.bnf was "ok" as use of tesing bnf2xml for bugs (actually, c.bnf is
not good at showing bugs - bugs were more tenacious than that).
c.bnf may or may not be completed in future.

junk2/ includes basic test scripts to check many features of bnf2xml
for regression


DEBUGGING

	shows terminals and symbols table.  and can often show
	relative realtime progress while matching.
	(with -C, shows full truth table, maybe very long though)

	$ bnf2xml -d -d -k -C | less
	$ bnf2xml -l 123 # may be useful
        # & operator should be useful
        # or put a printf in code

LIMITATIONS

	it's possible to define an infinite loop that does nothing

	<a> | <b>^  ; ^ is not designed to stay within |, goes to <a>
	<a>+^+      ; ^ is not designed for post-fix ops and may be wrong
		    ;   ie. 1st + clobbers 2nd, + isn't a saved state
	[<a>]+	    ; senseless or double logic is not checked
	<a>~ <a>=   ; always save under first of list of <a>
	            ; = does not reset if rule fails , saves only
	<a>~        ; if <repl> pairing is used it must be on R
		    ; ex, typedef <tdt> <Lt>~ <ident> <Ri>~
	"100"$` .   ; lapping input len just shows warning
	            ; start>len of skip (due to maybe $$) exits

	there is only ONE unary postfix OP, excepting @, so organize
	your work by nesting of definitions
		postfixes: ~ = ] } * + ? ` $
		independants: | # ! & . - %
		ex: <a>+@ ok <a>+= wrong <a>+` wrong <b>` ok
		& is indep so <a>+& works
		+ & must be attatched so <a> + & is <a> "+" "&"

	<a>& without --streaming can't know how unfinished searches
	above could end up.  results are limited.  searches below
	current that succeeded print, unfinished upper labels are shown

	<MATCH_LIST> is only generated by ~
	<BNF_RECORD> is only generated if options select mult lines

	the bnf file itself cannot contain binary 0's or "'s
		See: <BNF_ZERO> <BNF_QUOTE>
	the bnf file itself cannot have strings greater than
		#define MAXN 8096

	independants must be quoted if litteral , not optional

EFFICIENCY

	<a> ::= <> == "ab" "abe" "abel" ...

	Is more suitable than | for long string lists.  Binary Search
	into list is done, note entries must be pre-sorted.
	Special keyword <repl> uses next line for pairing.

	BNF efficiency.  Common sense.  If you match something many
	times it makes sense to make it a symbol which is already
	matched.

SCAN SPEED

	bnf2xml can be "slow" but progress should stay even for any size file

TODOS

	are finishing K&R C c.bnf, maybe C ver 1, maybe a few speed improv.
		though it will never be "hand written scanner fast" at all

	remove VEC lib for efficiency it can have "c++'s need 3 copy syndrom"

COMPARISONS

	sed(1) or grep(1) use regular expression definitions but aren't
	set up well to parse a language: but are better for small tasks.
	regex can be compiled in your C app to match and get return(s).
	sed(1) can be used to parse, ie, a very simple language which
	is used in a program's user editable startup .cfg file

	awk(1) is wonderful: it's matching abilities are formidable, and
	it's laguage plain.  awk has limitations (such as memory, lag
	if resetting RS).  Awk isn't well suited for all things
	(or is it?)

	cpp can be used to pre-process an input file which adds macro
	ability to any input file (usu fed to a parser, see also m4).

	yacc/flex can be used to make a .c file to parse a set
	pattern which calls set functions when called to read input
	(may be ambiguous and require C coded fixups, per pattern)
	Complex but speedy once done, "compiler flexors" read source
	code, the funs ouput what the linker wants to know, w/fixups.
	flex is straight forward for easy languages so many C prog use
	these C flexor functions to parse data / user input files.

	XSLT parsers are a (new) and some use BNF.  They tend to expect
	(a particular) XML environment and do HTML output.  ie, Cduce

<snip>
  there is a little removed from the README viewable on "files download page"
</snip

LICENSE

     gpl2

-------------------------------------------------------------
ChangeLog last two

bnf2xml (7.0.2) beta; urgency=low
  * yes same version, just REAME is revised a little more readable i hope
  * future: beleive c.bnf typedef issue mentioned was fixed in c.bnf, but it is
    not released yet as still working other mentioned issues for a 7.0.3

 -- John Hendrickson <debguy@sourceforge.net>  Thu, 07 Apr 2016 15:48:15 -0400

bnf2xml (7.0.1) beta; urgency=low
  * added autoconf support, rename files, etc, tidy a little

 -- John Hendrickson <debguy@sourceforge.net>  Mon, 22 Dec 2014 12:39:14 -0500

Source: README.txt, updated 2016-04-08