Re: [BlueObelisk-SMILES] Possible error in BNF spec

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

To wrap up on this old thread:

The go parser I wrote is available here:
https://code.google.com/p/opensmiles/

Also a couple suggestions re: BNF spec:

1) it would be helpful to have syntax clearly split into lexer and parser
parts to simplify work of future implementers. To make this more obvious -
in my syntax file I used lowercase names for lexer elements and uppercase -
for parser products.

2) Also the following change would be welcome: to make sure that sets of
symbols do not cross. For instance, all 'aliphatic_organic' symbols are
also defined as 'element_symbols' and 'aromatic_organic' is a subset of
aromatic_symbols. For lexer purposes it would be nice to have unique,
non-crossing sets of symbols and then combine them to get final desired
sets.

Sorry that it took so long.

Best regards,
Alexey Nezhdanov.

2014-03-27 16:11 GMT+01:00 Alexey Nezhdanov <sn...@gm...>:

> Hi Andrew.
>
> That is definitely way bigger answer that I was expecting to get :)
>
> 5) Alas, I never studied parsers theory - neither in university, nor on my
> own, so my experience operating with terms like LR(1) counts from yesterday.
>
> I am glad that we are on the same page now. I will definitely contribute
> my results if I come up with something that would have value outside of my
> project - like adapting the RDKit's .yy file to be usable both for C++
> and Go.
>
> Resolving the conflict by lexing bond and ringbond into different tokens
> makes sense, though I agree that it is not very elegant given that they are
> largely the same concept.
>
> 6) well, if you want my opinion on how to name it, then it's something
> like:
> chain :== chain_link | chain [bond] chain_link | chain dot chain
> I thought of just 'link' at first - but that would be confusable with
> 'bond'.
>
> 1) I always thought that atom symbols belong to lexer, not parser. Same
> applies to bonds (though these are easy as they are always one character).
> Not sure where the opening and closing braces/brackets should be, but it
> doesn't look like it makes much difference.
>
> The next couple of paragraphs I have to skip and not comment on, sorry.
> Your Kung-fu is definitely stronger than mine.
>
> Side note - I also fully understand the idea of not having enough time.
> That happened to me as well long ago and back then my last 'gift' to the
> project was to put it onto a github in the hope that other interested
> people will fork it and move it on. That worked to *some* degree.
>
> 2) Thanks for the pointer! Knowing that there are already some test cases
> waiting to be used is definitely helpful.
> BTW, "[C+1+2]" is an illegal notation according to current spec. I don't
> see why would human write +3 as +2+1 and if some code out there generates
> it - it looks like a bug to me. So it's fine if such string would be
> expected to fail to parse.
>
> Thank you for all your help! Hopefully I'll have something to give back.
>
> Best regards,
> Alexey Nezhdanov
>
> P.S. I found out that there is built-in Go tool 'yacc'  - I'll give it a
> try
>
>
> 2014-03-26 19:43 GMT+01:00 Andrew Dalke <da...@da...>:
>
> Hi Alexey,
>>
>>   I am able to reproduce what you observed, and have figured out why my
>> earlier parser didn't identify it.
>>
>> If I implement (nearly) everything as parser then PLY reports 14
>> shift/reduce errors in the section that you point out. The resulting parser
>> accepts neither "C=O" nor "C(O)"
>>
>> Which is odd, since I used PLY to do the original grammar tests.
>>
>> I was able to find some of that early test code. Turns out I implemented
>> the bond and ringbond terms in the lexer, not the parser, and the lexer's
>> "first definition wins" match resolved the ambiguity in my favor.
>>
>> My "'human' POV" also uses strong lexer. It looks like when I write the
>> OpenSMILES grammar that I back-translated the lexer as if it were a
>> grammar, without considering the lookahead-1 interactions.
>>
>> I tried evaluating the current grammar, to see how best to resolve it for
>> an LALR(1) grammar without influence of the lexer. I wasn't successful.
>> It's been over 6 years since I last dealt with shift/reduce errors, and
>> even longer since I really studied parsers, so I'm very rusty in how to
>> think about those details.
>>
>> Can you contribute an improved grammar which doesn't depend so much on
>> the lexer?
>>
>> > 6) The name branched_atom is confusing - that's actually one of the
>> reasons of my original mail. It sounds like "atom that has branches
>> attached to it", where as in spec it stays for "atom with 0 or more
>> branches".
>>
>> I agree. I was unable to come up with a better name. I still can't. Can
>> you suggest an alternative?
>>
>> > Parseable by scripts means, essentially - a pointer to an example
>> software that is known to correctly parse this particular definition.
>>
>> That would be nice. It would mean a few additions to the grammar, like
>> DIGIT and NUMBER, which are trivial to add.
>>
>> How would you handle atom symbol tokenization? That is, if the grammar is
>> completely in the parser then there's only a single character lookahead, so
>> 'H' resolves as hydrogen even if it's supposed to be 'He'. Would the
>> left-factoring be explicit in the grammar? If so, does it make things more
>> difficult to understand?
>>
>> This is of course easily solved in the lexer. That solution means the
>> grammar definition would have to express a preference for what's in the
>> lexer and what's in the grammar.
>>
>> On the other hand, that's what it is now, since it uses my assumption of
>> what will be in the lexer. I just used a different boundary than you did.
>> (And your boundary is more reasonable.) This assumption would just need to
>> be made more explicit.
>>
>> ANTLR, by the way, combines lexing and parsing into a single set of
>> rules, and uses LL(k) instead of DFA for the lexer, so this choice of
>> grammar representation is biased by the choice of implementation technology.
>>
>> > As for proposals - I'm not ready to propose anything but counting in
>> the additional knowledge I gained over these 22 hours a bison/flex solution
>> seems to be very appealing to me.
>>
>> I can see the appeal to bison/flex, because it would test that the result
>> is LALR(1), and provide an unambiguous way to convert from the grammar to
>> LALR(1).
>>
>> I don't like how the choice of the lexer/parser boundary affects the
>> ambiguity. That is, my converter from the current grammar to an LALR(1)
>> parser would give a different result than yours, since we each resolved the
>> shift/reduce ambiguity differently.
>>
>> There are alternatives to LALR(1). The most common ones I've seen are
>> PEGs/packrat parsers and ANTLR's LL(*). And my own code uses state machines
>> built on top of Ragel.
>>
>> I can also see the appeal of ANTLR, because v3 supports many language
>> back-ends. I like the idea that the same grammar can be used to provide
>> reference implementations for a lot of different language.
>>
>> I wonder if focusing on a specific LALR(1) decomposition might make it
>> more difficult to use those alternative forms. I believe that a fully
>> left-factored grammar should be fine for all of them. But I think a fully
>> left-factored grammar is harder to understand.
>>
>> I wonder if a less left-factored grammar might still be easily portable
>> across the range of likely back-ends.
>>
>> Hopefully you can provide a tweak to the grammar to prevent this
>> ambiguity?
>>
>> Otherwise, making a large set of changes, like for a fully left-factored
>> form, requires more work than I have time for right now.
>>
>>
>> > 2) Yes, test cases of course, sorry for confusion. Ideally, every token
>> should be tested in a few different ways, so the isotope number should be
>> tested as well.
>>
>> I once started coming up with an extensive test suite, and found the
>> combinatorial complexity made the problem rather tedious. Just checking
>> that the bracket atoms were correct, with the various combinations of
>> isotope, charge, etc., along with failure modes, was more than I wanted to
>> do.
>>
>> You can see some of those tests in the *.txt files of my OpenSMILES-ragel
>> project at https://bitbucket.org/dalke/opensmiles-ragel .
>>
>> In essence, I was coming up with tests which ensure that the parser
>> generator was working correctly. In the general case, the tests can't
>> assume that an implementor used a parser generator. For example, in some
>> toolkits the illegal "[C+1+2]" assigns a charge of +3 to the atom.
>>
>> The idea tests were more tedious than I wanted to do, so I didn't.
>>
>> Feel free to work on it if it's something which interests you. It's
>> something which will likely be of use to other OpenSMILES developers.
>>
>> Cheers,
>>
>>                                 Andrew
>>                                 da...@da...
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/13534_NeoTech
>> _______________________________________________
>> Blueobelisk-SMILES mailing list
>> Blu...@li...
>> https://lists.sourceforge.net/lists/listinfo/blueobelisk-smiles
>>
>
>