Looking for the latest version? Download bnf2xml-7.0.2.tar.gz (315.7 kB)
Name Modified Size Downloads / Week Status
Totals: 5 Items   616.1 kB 5
README.bnf2xml-7.0.1 2015-02-12 22.4 kB 11 weekly downloads
bnf2xml-7.0.2.tar.gz 2015-02-12 315.7 kB 11 weekly downloads
bnf.c.prettier.txt 2013-06-26 24.9 kB 11 weekly downloads
bnf2xml-5.tar.gz 2013-06-05 150.9 kB 11 weekly downloads
bnf2xml-1.tar.gz 2013-05-04 102.2 kB 11 weekly downloads
ABOUT bfn2xml is a BNF parser that can be used easily by shell applications because it has xml output (also other optional reports). bfn2xml just reads input and prints xml output, that simple. (marked up by matched bnf grammar of course) Most bnf tools are complex; to be used within another language and running inside programs using complex trees. The bnf suites stating "XML" I've seen take XML Input, do Not generate XML output (but could do so by coding I'm sure). bnf2xml (7.1b) beta To build this c++ prog do: $ tar -xzf bnf2xml-7.0.1.tar.gz $ cd bnf2xml-7.01/ $ ./configure $ make $ make install $ bnf2xml -h $ man bnf2xml ------------------------------------------------------------- DOCUMENTATION README covers bnf2xml specifics, has little bnf instruction. For education on BNF in general see: google BNF and try http://www.wikipedia.com see also: examples in tarball This is a bnf parser that: takes a file as input reads the bnf definition file identifies patterns in file using bnf definitions outputs XML markup of the definitions matched BETA this is a beta release not ready for heavier use testing is on-going to make it stable for target "old school EBNF" input and bnf files. changelog is far below OUTPUT outputs the highest bnf line matched which will be the most top most definition (bnf line to match can be picked though) Report C option outputs a text table of all matches (found or failed) (it's a trace of this table that makes the xml output) see OUTPUT EXAMPLE far below. BNF bnf2xml has some enhancement but not follow or cannot do all "new bnf" syntaxes see (1) , bnf2xml is well featured for size, but far less than AntLR compiler parser product SYMBOL DESCRIPTION "a" anything inside quotes is a terminal to match against input <a> is non-terminal, meaning it is a rule to look-up and apply | logical OR, ie. <letter> | <digit> -> "a" or "1" logical AND is otherwise assumed, ie <alph> <digit> -> "a1" [] optional expresssions, absorb input only if all expr are true, the result is always true (matches 0 or 1 times) {} same as ] except it matches 0 or more times () group expressions. absorb input only if all are true * is } + <a>+ == <a> { <a> } ? is ] notes: bnf2xml runtime only does unary postfix ops, ie <a>? if you looked in debug [ <a> <b> ] would be re-written equivalently. let's say input is "hello world" and "hello " was already true. then terminal "e" (or any <a> which asks for "e") fails to match because input is on "w". <alph> would match, it'll take the "w" BNF FEATURES OPS <> | () {} [] "" + * ? fully reflexive, recursive NON-BNF (features added): operators . - ^ ! # @ = ~ syntax <>== "" "" "" ... tokens: <BNF_QUOTE> <BNF_8bit> <BNF_ZERO> <MATCH_LIST> <MATCH_SEP> <RECORD_SEPARATOR> * and ? are quiet on 0 matches {} and [] are noisy on 0 matches ; <a></a> {} and [] can appear as <a> } which avoids an extra rule <> == for match against longer lists of strings ability to do binary search such lists. see below. = assign <a>= saves text found by last token into <a> ~ congruent <a>~ remembers lists and pairs for <repl> @ quiet, shallow, preceeding item is xml quiet @@ quiet, deep, any matches it made also quiet ^ reflex OP is a different way to *, see below . skip the input 1 forward - skip the input 1 back (dash) $ "n"$ set next skip to n forward ("5"$ . ; skip 5) $` "n"$` set next skip to n forward, emit data in output # test previous token truth throw out result ! reverses prev token truth throw out result ` emit last terminal or token if rule-line is true `` emit one deep , lookup token then emit (left quote) .` emit current input char, does not skip input -` emit curent input position, converted to decimal (dash) %` emit incremental counter in decimal & quit, stops, and prints if --streaming didn't already see LIMITS && forgets past input blocks, prints if streaming, continues Use only one postfix per object (excepting @, @ must be last) postfixes: ~ = ] } * + ? ` $ independants: | # ! & . - % * ? + ! # ^ ` @ ~ = & _ $ must be attatched (no <white>) example: <a> ? is <a> "?" matches "?" as character, <a>? ok Attachment behavior can be avoided by compiling "NO_UNQUOTED" See "Extra OPS" and see LIMITS for more details. Remaining simply old EBNF is a goal, but with xml pretty. With ~= now added bnf2xml is considered complete, it is thought to do all it needs to for the intended output goal. Please comment if you think otherwise. ABNF EBFN support: No. However ::= can be subst to = by sed(1). And while I have no plan for: the bnf file could define EBNF in terms of BNF and then have EBNF appended to it, it'd hope. BNF ENTRY and EXIT BNF needs a "starting left token". <bnf> is used below. bnf2xml handles lists everywhere even at top level. The top level "runs" slightly different (ie -l, see options). <bnf> ::= <prog1> ; a single token top level <bnf> ::= <prog2> ; or list of entries at top <prog1> ::= hi ; bnf rules <prog1> ::= hello " " world ; bnf rules ... Exit. search must match all of the input (including last EOL) or fails, unless using the special symbol design to quit early. Questionable exit. It's unknown if all of <a> should be all input until " "? is checked. If input is "hi ()" the 1st <a> matches, " "?, then fails: which missed trying 2nd <a>. <bnf> ::= <a> " "? <a> ::= hi <a> ::= hi " ()" ; a sure exit below <bnf> ::= <a> <a> ::= hi " ()" " "? <a> ::= hi " "? QUOTING BNF syntax default is ", use -b "'" to set BNF_QUOTE to ' bnf quotation is used to indicate what is NOT bnf syntax bnf syntax uses ::= and <>(){}[] space, tab, EOL, and also +*? when not preceeded by any space. ex: "<a>" ; is not a bnf symbol "hi there" ; is one string, and quotes are removed About matching the BNF_QUOTE char itself... There are NO BNF QUOTING RULES: must be pairs. (ie, sh(1) has rules to subvert quote rules, is error prone) Quoting is absolute so BNF_QUOTE is "built-in". <BNF_QUOTE>hi there<BNF_QUOTE> matches "hi there" including the quotes BNF allows line splitting (last char \), absorbs spaces and EOL BNF only sees a new rule start when < is at beginning of a line (and is followed by ::= and is not inside BNF_QUOTE) <a> ::= <b> ... <BNF_ZERO> is also need if "string" contains 0 as and strcp(3) is used in reading bnf, which stops at 0's. EOL How used or not used: $ echo "ask" | ./bnf2xml $ echo "take 1the 0door" | ./bnf2xml --loop --loop doesn't match <EOL> because it's designed to take input a line at a time it matches bnf lines w/out. bnf lines that don't end in <EOL> are matched (note two <EOL> matches echo -e "\n") NOTES see tarball for further write and examples "" <a> ::= "" ; matches no input or fully absorbed input @ <a>@ ; quiet the last token or terminal. shallow means ; <a></a> will not appear but what is found by <a> ; does. "and"@ also works and nothing prints. ; @ must be last, ie, <a>+@ is ok. ^ reflex is similar to * ; avoid reflexes to be more compatible <a> ::= <b>^ ; simple reflex <a> ::= <b>^ <c>^ <d> ; fully reflexive <repl> simple replacing ; is better done by prefilter <b> ::= <repl> = "a" "b" ; if <b> a is matched <b> c is <b> ::= <foo> == "c" "d" ; answer, always next line used = = only sets (repl can achieved with more bnf rules) <b> <a>= ; <- saves text found by <b> into <a> <a> ::= "foo" ; <- make sure <a> is a defined terminal ; = does not pair symbol <b> to symbol <a> = does not reset if in failed rule, it is sticky ~ ~ is for replacing items (or paired lists) only if found in a certain context. like =, <b><a>~ saves <b> into <a> ex. CPL, "typdef int I" means replace all I with int if I is found and is an identifier previously found in typedef, repl, else I is left alone by ~ <b> ::= <c>+ <tfe> ; for full example of ~ see junk2/bnf.8 <c> ::= typedef " " <ident> <tie>~ " " <ident> <tfe>~ " ; " ; DON'T FORGET <repl> and to define ; see junk2/bnf.c.txt <tfe> ::= <repl> == ; no rule <repl> must be used so it'd <tie> ::= <foo> == ; be easy to forget: it's likely wanted! ~ the above pairs replaces <tfe> to <tie> automatically because <tfe> uses <repl> ~ lists can result, uses <MATCH_LIST> in xml ~ uses <> == "" "" "", can be used in ways not shown above ~ resets if within failing rule, but see LIMITATIONS {{}} a run of same lhs are tried in order as if ORed (ex shows recursion too) for example, input: "{{inside1}}" or "{{}}" <a> :: "inside1" <a> :: "{" <a>* "}" ; note <a> is not automatically true <a> :: "inside2" ; <a>* allows "{{}}" to match However this is different: <a> is both first and inside OR <a> :: <b> | <a><d> | <c> if no other <a> are defined this needs -o because | uses "short circuit logic". re-trying <b>, the start of <a>, is mute if it already failed. see -o below. diverting bnf2xml's front-recursion it's use is diminished because its un-necessary and may change bnf2xml does not assume to tail recurse the tail of any front recursion. but if it did (right from K&R C book): <a> ::= <a> "[]" ; <a> ::= <a> "()" ; <a> ::= <ident> x[][][]... ; would match and be ok, interpretively x()()()... ; is not intended to allow x()[]()... ; also not intended to allow note that defining <c> ::= <a> does not hide that <a> is front recursive, and <c> still needs a termination rule. front-recursing <a>'s 1st rule (see above) shuts the 1st rule rule until the 1st completes (an inf loop would occur). bnf2xml does not retry <a> or parts of automatically , use + for that. If bnf2xml did these things the recursion results might be ambiguous and bnf2xml has no syntax to handle ambiguity. This may be improved in future if ambiguity is not a result. NEW: i believe the above bnf <a> is "flawed" as should be assumed bison syntax (bottom up parser), not bnf necessarily. Bison may not front recurse: it decides itself given bottom up rules what to match; the above depends on states of such, the appearance of front recursion in syntax isn't what it seems. For ex. it may include lower matches - i'd have to try it. Be warned that operators - ! # could cause side-effects when used with front-recursion, true w/ no input progress may loop. undivert about bnf autorules... auto added rules: <a> :: ( <c> ... ) <d> gets rewritten to: <a> :: <) 1> <d> <) 1> :: <c> ... why is because internally we don't search for () at all. unary ops are used and the table is written to take unary param to such ops (compiled regex does this). see -d. shortcut: use <a> } in bnf.txt avoids the above since <a> can be composite (more than one rule) no -o or auto rules are needed or added if one uses repeated <a> lines instead of (){}[] (and avoids the -o thing) rule lists orderless multiple rules for <a> is NOT supported <a> ::= ... <b> ::= ... <a> ::= ... ; TERMINALS is now only for ops like -g to say where to stop, preferrencial Dot . and Minus - skip the input's string position forward or back 1 character, or n char if "n"$ preceeded, which is not the same as matching and is reset on f ailure of rule. Not ! changes return value, un-does any matching, note that <a> +! is not necessarily supported. # Test keeps the return value but undoes any m atching. ` Emit emits the previous token or terminal when rule line is true, `` looks up symbol one deep first, .` dot emit current input char, -` the ipos, %` a counter, "n"$` what was skipped. @ Quiet quiets the last token or terminal a nd @@ quiets deep. = Equal assigns text of previous match to token and ~ Congrue nt is like equal with lists as described above. & Quit prints then exits truth <a> as if <a>?&. && forgets past input blocks, prints if streaming, and continu es. <MATCH_LIST> only occurs due to ~ if bnf allowed R to be non-uniq. QUIT <a>& prints what is currently known and exists with truth of <a> failure of <a> does not cause rule to fail or block & action think of it as <a>?& and write rules accordingly ~ can cause <MATCH_LIST> <MATCH_SEP> in XML if your BNF ended up allowing a list of R that are the same (see notes on ~). EXAMPLES junk2/ contains examples and regression tests and are mean to be use like: $ sh junk2/tst AN OUTPUT EXAMPLE c.bnf is an almost full K&R C bnf for use with bnf2xml. not tested on much but regression tests yet though. # see README.bnf.c.txt $ echo -e "int main(){;}" | ./bnf2xml ./junk2/c.bnf --loop | junk2/g4 <bnf><program><external-definition><function-definition><decl-specifier-dd><type-specifier><type>int</type></type-specifier></decl-specifier-dd><space> </space><function-declarator><declarator-function><declarator-simple><identifier>main</identifier></declarator-simple><fun>(<parameter-list></parameter-list>)</fun></declarator-function></function-declarator><function-body><type-decl-list></type-decl-list><function-statement>{<declaration-list></declaration-list><statement-list><statement>;</statement></statement-list>}</function-statement></function-body></function-definition><data-definition-list></data-definition-list></external-definition></program></bnf> Some of the above emitted was marked to be emitted if empty in c.bnf The tags could be tree formatted, or be used as colors for html, or ie, an xml parser can pull all <identifiers></identifiers> as if a database). bnf2xml isn't a compiler but c.bnf is good at showing what is broken or needed further. junk2/ includes basic test scripts to check many features of bnf2xml for regression DEBUGGING shows terminals and symbols table. and can often show relative realtime progress while matching. (with -C, shows full truth table, maybe very long though) $ bnf2xml -d -d -k -C | less $ bnf2xml -l 123 # may be useful # & operator should be useful # or put a printf in code LIMITATIONS it's possible to define an infinite loop that does nothing <a> | <b>^ ; ^ is not designed to stay within |, goes to <a> <a>+^+ ; ^ is not designed for post-fix ops and may be wrong ; ie. 1st + clobbers 2nd, + isn't a saved state [<a>]+ ; senseless or double logic is not checked <a>~ <a>= ; always save under first of list of <a> ; = does not reset if rule fails , saves only <a>~ ; if <repl> pairing is used it must be on R ; ex, typedef <tdt> <Lt>~ <ident> <Ri>~ "100"$` . ; lapping input len just shows warning ; start>len of skip (due to maybe $$) exits there is only ONE unary postfix OP, excepting @, so organize your work by nesting of definitions postfixes: ~ = ] } * + ? ` $ independants: | # ! & . - % ex: <a>+@ ok <a>+= wrong <a>+` wrong <b>` ok & is indep so <a>+& works + & must be attatched so <a> + & is <a> "+" "&" <a>& without --streaming can't know how unfinished searches above could end up. results are limited. searches below current that succeeded print, unfinished upper labels are shown <MATCH_LIST> is only generated by ~ <BNF_RECORD> is only generated if options select mult lines the bnf file itself cannot contain binary 0's or "'s See: <BNF_ZERO> <BNF_QUOTE> the bnf file itself cannot have strings greater than #define MAXN 8096 independants must be quoted if litteral , not optional EFFICIENCY <a> ::= <> == "ab" "abe" "abel" ... Is more suitable than | for long string lists. Binary Search into list is done, note entries must be pre-sorted. Special keyword <repl> uses next line for pairing. BNF efficiency. Common sense. If you match something many times it makes sense to make it a symbol which is already matched. SCAN SPEED can be "slow" but progress should stay even for any size file TODO forgets to say something. there may be room for speed improvement however to be unambiguous bnf2xml must push tag of each symbol and reference of each unk input char. if it didn't then matching could not fail to prev context. (the nature of top down parsing means longer loop to match chars (but any ammount of complexity is resolved) ie, matching long <alph>* strings is a weakness ie, <> == matches long known strings quickly problem with unk string accelerator: should incl digits? too many flavors skip is a new feature to skip what does't need parsed it's fast to skip or include long data another strategy is to pre-post process blocks outside bnf2xml a new feature does what is needed for filter processing: "lbl" %` -` "n"` ; emit uniqe tag, place, len awk tokenize blocks | bnf2xml | awk expand tokens with the 2 new features, small bnf2xml should be workable to do "spot need" complex matches in large "known" files without writing wares or using more complex wares (beyond that is a lexar or sed, which have caveats as well) COMPARISONS sed(1) or grep(1) use regular expression definitions but aren't set up well to parse a language: but are better for small tasks. regex can be compiled in your C app to match and get return(s). sed(1) can be used to parse, ie, a very simple language which is used in a program's user editable startup .cfg file awk(1) is wonderful: it's matching abilities are formidable, and it's laguage plain. awk has limitations (such as memory, lag if resetting RS). Awk isn't well suited for all things (or is it?) cpp can be used to pre-process an input file which adds macro ability to any input file (usu fed to a parser, see also m4). yacc/flex can be used to make a .c file to parse a set pattern which calls set functions when called to read input (may be ambiguous and require C coded fixups, per pattern) Complex but speedy once done, "compiler flexors" read source code, the funs ouput what the linker wants to know, w/fixups. flex is straight forward for easy languages so many C prog use these C flexor functions to parse data / user input files. XSLT parsers are a (new) and some use BNF. They tend to expect (a particular) XML environment and do HTML output. ie, Cduce DIMINISHED , ADDED // #define CHECK_SUBST_REPL in bnf.cpp Experimental ability to make substitutions in input text or output text and to print or do system cmd upon match. This may be changed or dropped in the future. see <sys> <repl> <subst> in bnfparse.txt it's a performance hit, to do search and replace hints while searching for what may or may not be found / when the whole context is revealed. m4(1) cpp are far better tools for pre/post search and replace. no point in hacking bnf2xml to do what is mastered already. However I was wrong a little. = ~ is if seldom really required has low overhead. typedef x should be typedef uniq_id and thus handled by cpp. But in math with no uniq_id, a bnf to match id and using xml to pair a up contexts would be impossible to filter before, and depending on xml output: tedious or impossible to do after. HISTORY The parser itself began in a game I started but canceled decades ago. I wanted the parser out of it. But parsers output is usually specific to the program it runs in so it would be useless. But then I though - that's what XML is useful for! I fixed up this old 90's bnf parser and released it because: no bnf parsers I know of that give text output they are all rigged to talk to the application they run in --------------------- (1) What is new bnf no one agrees on is what I've seen. Mostly differences are about string handling of bnf file definition strings themselves. As far as numberic range tests: BNF, written in C, is not a good way to make a good calculator or even to compare numberic values, try bc(1). I'm almost afraid to add new "core rules" (see internet definiton) due to varying tty types and i18n and wchar it might make baking these worse not better: then change. I'd rather leave some "hackability" if possible, to make code small and more readable so any user's idea or need ad infinitum, is not hard to hack in. LICENSE see manpage VEC LIBS The Vec libs can be easy use arrays and are pretty well tested. Any failures I get are my own usage errors. However I myself am "canceling use" in the future because: * cannot make syntax / use easier as planned: C++ cancels C's "with" at link-time and due to that the easiest syntax I had origionally planned is not available for (DVec) (so C++ could release non-copyable libs for their "security") * exactly what you need help with in C++, templating, syntax, gets FAR worse: increases syntax and incompat wished to be reduced * know what? use macros and malloc it's more powerful than C++ :) and namespace? so what that's easy. * today's c++ compilers differ in ways that break things * realloc more efficient than recopy + new, and they don't mix (program must use one or other, unless very careful) ------------------------------------------------------------- ChangeLog last two bnf2xml ( beta; urgency=low * added autoconf support, rename files, etc, tidy a little -- John Hendrickson <debguy@sourceforge.net> Mon, 22 Dec 2014 12:39:14 -0500 bnf2xml (7.1b) beta; urgency=low * issue: was slow for huge files that have small areas to parse and previously (quit) was only way to skip (skip large tail) * added features to skip or include unparsed data or to tag skips for pre-post filter inclusion * added --quoted and forgotten --first-line * revised documentation and manpage * QUIT had checked symb_next which might not be set, fixed * found c++ bug in new c++, refused to compile, fixed declaring a simple counter var had changed also found new changes in C compiler (same gcc) prevented compiling of other software that had compiled / worked w/o error * imact none: noted foreign changes to new gcc compiler prevented compiling of code despite being a non-issue: a non-warning was changed to error neither of which would change type. suspect breakage was goal. found many similar across other gcc c related programs. -- John Hendrickson <debguy@sourceforge.net> Fri, 24 Jan 2014 19:24:01 -0500
Source: README.bnf2xml-7.0.1, updated 2015-02-12