From: Alexey N. <sn...@gm...> - 2014-06-13 10:03:13
|
To wrap up on this old thread: The go parser I wrote is available here: https://code.google.com/p/opensmiles/ Also a couple suggestions re: BNF spec: 1) it would be helpful to have syntax clearly split into lexer and parser parts to simplify work of future implementers. To make this more obvious - in my syntax file I used lowercase names for lexer elements and uppercase - for parser products. 2) Also the following change would be welcome: to make sure that sets of symbols do not cross. For instance, all 'aliphatic_organic' symbols are also defined as 'element_symbols' and 'aromatic_organic' is a subset of aromatic_symbols. For lexer purposes it would be nice to have unique, non-crossing sets of symbols and then combine them to get final desired sets. Sorry that it took so long. Best regards, Alexey Nezhdanov. 2014-03-27 16:11 GMT+01:00 Alexey Nezhdanov <sn...@gm...>: > Hi Andrew. > > That is definitely way bigger answer that I was expecting to get :) > > 5) Alas, I never studied parsers theory - neither in university, nor on my > own, so my experience operating with terms like LR(1) counts from yesterday. > > I am glad that we are on the same page now. I will definitely contribute > my results if I come up with something that would have value outside of my > project - like adapting the RDKit's .yy file to be usable both for C++ > and Go. > > Resolving the conflict by lexing bond and ringbond into different tokens > makes sense, though I agree that it is not very elegant given that they are > largely the same concept. > > 6) well, if you want my opinion on how to name it, then it's something > like: > chain :== chain_link | chain [bond] chain_link | chain dot chain > I thought of just 'link' at first - but that would be confusable with > 'bond'. > > 1) I always thought that atom symbols belong to lexer, not parser. Same > applies to bonds (though these are easy as they are always one character). > Not sure where the opening and closing braces/brackets should be, but it > doesn't look like it makes much difference. > > The next couple of paragraphs I have to skip and not comment on, sorry. > Your Kung-fu is definitely stronger than mine. > > Side note - I also fully understand the idea of not having enough time. > That happened to me as well long ago and back then my last 'gift' to the > project was to put it onto a github in the hope that other interested > people will fork it and move it on. That worked to *some* degree. > > 2) Thanks for the pointer! Knowing that there are already some test cases > waiting to be used is definitely helpful. > BTW, "[C+1+2]" is an illegal notation according to current spec. I don't > see why would human write +3 as +2+1 and if some code out there generates > it - it looks like a bug to me. So it's fine if such string would be > expected to fail to parse. > > Thank you for all your help! Hopefully I'll have something to give back. > > Best regards, > Alexey Nezhdanov > > P.S. I found out that there is built-in Go tool 'yacc' - I'll give it a > try > > > 2014-03-26 19:43 GMT+01:00 Andrew Dalke <da...@da...>: > > Hi Alexey, >> >> I am able to reproduce what you observed, and have figured out why my >> earlier parser didn't identify it. >> >> If I implement (nearly) everything as parser then PLY reports 14 >> shift/reduce errors in the section that you point out. The resulting parser >> accepts neither "C=O" nor "C(O)" >> >> Which is odd, since I used PLY to do the original grammar tests. >> >> I was able to find some of that early test code. Turns out I implemented >> the bond and ringbond terms in the lexer, not the parser, and the lexer's >> "first definition wins" match resolved the ambiguity in my favor. >> >> My "'human' POV" also uses strong lexer. It looks like when I write the >> OpenSMILES grammar that I back-translated the lexer as if it were a >> grammar, without considering the lookahead-1 interactions. >> >> I tried evaluating the current grammar, to see how best to resolve it for >> an LALR(1) grammar without influence of the lexer. I wasn't successful. >> It's been over 6 years since I last dealt with shift/reduce errors, and >> even longer since I really studied parsers, so I'm very rusty in how to >> think about those details. >> >> Can you contribute an improved grammar which doesn't depend so much on >> the lexer? >> >> > 6) The name branched_atom is confusing - that's actually one of the >> reasons of my original mail. It sounds like "atom that has branches >> attached to it", where as in spec it stays for "atom with 0 or more >> branches". >> >> I agree. I was unable to come up with a better name. I still can't. Can >> you suggest an alternative? >> >> > Parseable by scripts means, essentially - a pointer to an example >> software that is known to correctly parse this particular definition. >> >> That would be nice. It would mean a few additions to the grammar, like >> DIGIT and NUMBER, which are trivial to add. >> >> How would you handle atom symbol tokenization? That is, if the grammar is >> completely in the parser then there's only a single character lookahead, so >> 'H' resolves as hydrogen even if it's supposed to be 'He'. Would the >> left-factoring be explicit in the grammar? If so, does it make things more >> difficult to understand? >> >> This is of course easily solved in the lexer. That solution means the >> grammar definition would have to express a preference for what's in the >> lexer and what's in the grammar. >> >> On the other hand, that's what it is now, since it uses my assumption of >> what will be in the lexer. I just used a different boundary than you did. >> (And your boundary is more reasonable.) This assumption would just need to >> be made more explicit. >> >> ANTLR, by the way, combines lexing and parsing into a single set of >> rules, and uses LL(k) instead of DFA for the lexer, so this choice of >> grammar representation is biased by the choice of implementation technology. >> >> > As for proposals - I'm not ready to propose anything but counting in >> the additional knowledge I gained over these 22 hours a bison/flex solution >> seems to be very appealing to me. >> >> I can see the appeal to bison/flex, because it would test that the result >> is LALR(1), and provide an unambiguous way to convert from the grammar to >> LALR(1). >> >> I don't like how the choice of the lexer/parser boundary affects the >> ambiguity. That is, my converter from the current grammar to an LALR(1) >> parser would give a different result than yours, since we each resolved the >> shift/reduce ambiguity differently. >> >> There are alternatives to LALR(1). The most common ones I've seen are >> PEGs/packrat parsers and ANTLR's LL(*). And my own code uses state machines >> built on top of Ragel. >> >> I can also see the appeal of ANTLR, because v3 supports many language >> back-ends. I like the idea that the same grammar can be used to provide >> reference implementations for a lot of different language. >> >> I wonder if focusing on a specific LALR(1) decomposition might make it >> more difficult to use those alternative forms. I believe that a fully >> left-factored grammar should be fine for all of them. But I think a fully >> left-factored grammar is harder to understand. >> >> I wonder if a less left-factored grammar might still be easily portable >> across the range of likely back-ends. >> >> Hopefully you can provide a tweak to the grammar to prevent this >> ambiguity? >> >> Otherwise, making a large set of changes, like for a fully left-factored >> form, requires more work than I have time for right now. >> >> >> > 2) Yes, test cases of course, sorry for confusion. Ideally, every token >> should be tested in a few different ways, so the isotope number should be >> tested as well. >> >> I once started coming up with an extensive test suite, and found the >> combinatorial complexity made the problem rather tedious. Just checking >> that the bracket atoms were correct, with the various combinations of >> isotope, charge, etc., along with failure modes, was more than I wanted to >> do. >> >> You can see some of those tests in the *.txt files of my OpenSMILES-ragel >> project at https://bitbucket.org/dalke/opensmiles-ragel . >> >> In essence, I was coming up with tests which ensure that the parser >> generator was working correctly. In the general case, the tests can't >> assume that an implementor used a parser generator. For example, in some >> toolkits the illegal "[C+1+2]" assigns a charge of +3 to the atom. >> >> The idea tests were more tedious than I wanted to do, so I didn't. >> >> Feel free to work on it if it's something which interests you. It's >> something which will likely be of use to other OpenSMILES developers. >> >> Cheers, >> >> Andrew >> da...@da... >> >> >> >> >> ------------------------------------------------------------------------------ >> Learn Graph Databases - Download FREE O'Reilly Book >> "Graph Databases" is the definitive new guide to graph databases and their >> applications. Written by three acclaimed leaders in the field, >> this first edition is now available. Download your free book today! >> http://p.sf.net/sfu/13534_NeoTech >> _______________________________________________ >> Blueobelisk-SMILES mailing list >> Blu...@li... >> https://lists.sourceforge.net/lists/listinfo/blueobelisk-smiles >> > > |