Menu

#21 Support to configure how re2c code interfaced with the symbol buffer?

closed
None
5
2015-10-13
2014-12-05
No

It would be great if re2c could be configured to interface with the parser buffer through interfaces other than dereferencing pointers to an array.

Currently, re2c assumes that the buffer is a C array and is operated directly through pointers to it. Consequently, YYCURSOR is assumed to be an l-value of type * YYCTYPE. Therefore, re2c generates code to access the current symbol and next symbol, respectively, as:

yych = YYCURSOR;
yych =
++YYCURSOR;

If re2c is used to help write a lexer in C++, the symbol buffer can be managed directly by an std::ifstream object, which also provides the relevant interfaces. The same operations mentioned above could be performed through the following calls:

yych = ifstream_object.peek();
yych = ifstream_object.get();

Would it be possible to offer a way to define how re2c would generate the code that handled access the current and next symbol in the buffer? Something similar to the following code:

re2c:define:define:ACCESS_CURRENT_SYMBOL = yych = YYCURSOR;
re2c:define:define:ACCESS_NEXT_SYMBOL = yych =
++YYCURSOR;
re2c:define:define:MOVE_TO_NEXT_SYMBOL = ++YYCURSOR;

Discussion

  • Ulya Trofimovich

    It seems to be possible. There's also YYMARKER that will need
    'tellg/seekg' and maybe something else is missing.
    Anyway, the generalized API won't break current model.

    I'll experiment with this idea and try to get it done.

    Thanks for an interesting suggestion. :)

    Ulya

     
  • Ulya Trofimovich

    • assigned_to: Ulya Trofimovich
     
  • Dan Nuffer

    Dan Nuffer - 2014-12-05

    c++ iterators (anything besides an input iterator) provide the abstraction you need, as they behave like pointers, and could be used directly with re2c. You could use something like http://www.boost.org/doc/libs/1_56_0/libs/spirit/doc/html/spirit/support/multi_pass.html#spirit.support.multi_pass.reading_from_standard_input_streams
    or write your own iterator without much trouble.

     
    • Ulya Trofimovich

      I think the iterator class needs the following operations defined
      to be used with re2c:

      a (dereference, e.g. "YYCURSOR")
      ++a (increment, e.g. "++YYCURSOR")
      a = b (assignment, e.g. "YYMARKER = YYCURSOR")
      a <= b (less or equal, e.g. "YYLIMIT <= YYCURSOR")
      a - b (substraction, e.g. "(YYLIMIT - YYCURSOR) < n")

      Multiple iterators exist at the same time, and they all must remain valid
      (hence not 'input iterator', if I guess correctly).

      So should I close the ticket?

       
    • Anonymous

      Anonymous - 2014-12-05

      Given the choice, it's undoubtedly better that a code generator generates code in a convenient manner. I don't believe that adding dependencies to circumvent this issue, and battling this limitation with complexity, is acceptable.

      Please bear in mind that generating a lexer is pretty much a "run once" kind of thing. Between your suggestion and re-editing the output of re2c by hand, editing the source file is a better way to circumvent the issue.

       
      • Ulya Trofimovich

        My analyses of re2c code generator revealed the following
        cases in which re2c deals with input stream:

        1. ++YYCURSOR;
        2. yych = *YYCURSOR;
        3. yych = *++YYCURSOR;
        4. yych = *(YYMARKER = YYCURSOR);
        5. yych = *(YYMARKER = ++YCURSOR);
        6. YYCURSOR = YYMARKER;
        7. YYCURSOR = YYCTXMARKER;
        8. YYMARKER = YYCURSOR;
        9. YYMARKER = ++YYCURSOR;
        10. YYCTXMARKER = YYCURSOR + 1;
        11. if (YYLIMIT <= YYCURSOR) ...
        12. if ((YYLIMIT - YYCURSOR) < n) ...
        13. YYDEBUG (label, *YYCURSOR);

        All these cases can be generalized using four primitives:

        YYGETC () ---- resolves to "*YYCURSOR" or "is.peek ()"
        YYMOVE () ---- resolves to "++YYCURSOR" or "is.getc ()"
        YYTELL () ---- resolves to "YYCURSOR" or "is.tellg ()"
        YYSEEK (p) ---- resolves to "YYCURSOR = p" or "is.seek (p, i.beg)"

        However, we'll have to decompose cases 3, 4 and 5 above,
        which will affect the generated C/C++ code. It shouldn't
        break the code, but all the tests will have to be updated.

        So I's not that hard to generalize API for std::ifstream.
        The question is, whether the new API is good enough.

        I can implement it (in fact by the time Dan replied I've gone
        half way through it), but I think Dan has much more experience
        in C++ and re2c in particular, so I shouldn't add new API without
        his approval.

        Ulya

         
  • Dan Nuffer

    Dan Nuffer - 2014-12-05

    Here's my 2 cents:

    To me the main priorities of re2c are:
    1. Speed
    2. Backward compatibility
    3. Programmer friendliness
    4. Flexibility
    5. Maintainability

    Sometimes it is hard to strike a balance between them :-)
    The main use case for re2c is fast parsing of in-memory buffers, not ifstreams, so any change which compromises the majority of users for an edge case isn't something I'd like.

    It is hard to predict how the generated code will end up being optimized by different compilers, so it is necessary to test it, so creating a simpler "api" that requires more statements might speed things up, slow them down, or be the same.

    The other thing to keep in mind is that scanning from an fstream will be much slower than reading the data to a buffer and then scanning it.

    Ulya, as you pointed out, re2c basically has an API now which consists of four operations. Those all work out great for in memory scanning. Changing them to the four operations that are applicable to streams isn't a clearly a benefit to me. Currently, to build a basic scanner, the developer just has to worry about a few concepts based on a buffering model. It is fairly easy to build an adapter around any sort of other input model in order to use it (e.g. FILE*, iostreams, iterators, sockets, etc. see https://sourceforge.net/p/re2c/code-git/ci/master/tree/libre2c/libre2c/)

    Adding a new option that could create code that uses a different set of macros has its own costs, re2c would be more difficult to learn (it's bad enough already), the code gets more complex and harder to maintain.

    It might be worth experimenting with and see if it improves things overall, but I'm a bit skeptical...

     
    • Ulya Trofimovich

      Dan, as you pointed out, we'll need to check if complex expressions like
      "yych = ++YYCURSOR"
      compile to faster or equal assembly code compared to
      "++YYCURSOR; yych =
      YYCURSOR"
      I took it for granted, but maybe I'm mistaken. I can experiment with GCC
      and CLANG.

      I dont see how the new API can 'compromise the majority of users for an
      edge case' though.
      Users won't see any difference by default, they won't need to write any
      new interface code.
      Even for those who want to use 'std::ifstream' we can have a convenience
      switch '--ifstream'
      that will override the default. Only some extreme users will use the new
      API directly.

      Surely, ifstreams will be slower then C buffer, but it's up to users to
      chose the interface.

      As for 're2c will be more difficult to learn and that's bad already'
      ---- true.

      Finally, as for maintainability, I don't think re2c's code will get worse.
      For now, all those 13 cases I mentioned are scattered over code.cc in
      various combinations,
      duplicating many times. Unifying them can actually clear the code.

      So, I mostly see one possible danger: slower code.
      What do you think?

      Ulya

      P.S.
      By the way, for me priorities look like:
      1. Speed
      2. Speed
      3. Speed
      4. Flexibility
      5. Backward compatibility
      6. Programmer friendliness
      7. Maintainability
      :)

       
  • Anonymous

    Anonymous - 2014-12-06

    Ulya,

    As backward compatibility is always important, adding a way to configure re2c's buffer operations would already do the trick. As simplicity is also important, maybe adding a '--ifstream' switch would be counterproductive in this regard. Quite possibly, the same effect could be had with an example of how to configure re2c to output calls to std::ifstream.

    Regarding ifstream being slower than a plain old C buffer, I don't doubt it might be less efficient in an apples-to-apples comparison. Nevertheless, due to the way re2c generates its code, if a file is being accessed through a ifstream object then ifstream will end up being used to fill up the C buffer. Therefore, as re2c currently stands, re2c's code requires two separate buffers to be operated in series: the C buffer being supplied with calls to ifstream, and ifstream's buffer being filled from the input stream. This issue can only be avoided if it becomes possible to configure re2c's output or if the output of re2c is edited to replace calls to the C buffer with calls to ifstream's interface.

     
    • Ulya Trofimovich

      Hi Rui!
      Didn't quite understood your letter.

      As backward compatibility is always important, adding a way to
      configure re2c's buffer operations would already do the trick. As
      simplicity is also important, maybe adding a '--ifstream' switch would
      be counterproductive in this regard. Quite possibly, the same effect
      could be had with an example of how to configure re2c to output calls
      to std::ifstream.

      What can be simpler than a command-line switch? :)

      Regarding ifstream being slower than a plain old C buffer, I don't
      doubt it might be less efficient in an apples-to-apples comparison.
      Nevertheless, due to the way re2c generates its code, if a file is
      being accessed through a ifstream object then ifstream will end up
      being used to fill up the C buffer. Therefore, as re2c currently
      stands, re2c's code requires two separate buffers to be operated in
      series: the C buffer being supplied with calls to ifstream, and
      ifstream's buffer being filled from the input stream. This issue can
      only be avoided if it becomes possible to configure re2c's output or
      if the output of re2c is edited to replace calls to the C buffer with
      calls to ifstream's interface.

      It is possible right now to use ifstream directly (without any copying
      to C-buffer), if you use iterator (as Dan suggested).
      It will be also possible in case we add new API as I suggested (the only
      difference will be that you will use a switch instead of writing an
      iterator).

      Ulya

       
  • Ulya Trofimovich

    • status: open --> closed
     

Log in to post a comment.