re2c scanner generator / Feature Requests / #21 Support to configure how re2c code interfaced with the symbol buffer?

Ulya Trofimovich - 2014-12-05

It seems to be possible. There's also YYMARKER that will need
'tellg/seekg' and maybe something else is missing.
Anyway, the generalized API won't break current model.

I'll experiment with this idea and try to get it done.

Thanks for an interesting suggestion. :)

Ulya

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ulya Trofimovich - 2014-12-05

assigned_to: Ulya Trofimovich
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dan Nuffer - 2014-12-05

c++ iterators (anything besides an input iterator) provide the abstraction you need, as they behave like pointers, and could be used directly with re2c. You could use something like http://www.boost.org/doc/libs/1_56_0/libs/spirit/doc/html/spirit/support/multi_pass.html#spirit.support.multi_pass.reading_from_standard_input_streams
or write your own iterator without much trouble.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ulya Trofimovich - 2014-12-05
  
  I think the iterator class needs the following operations defined
  to be used with re2c:
  
  a (dereference, e.g. "YYCURSOR")
  ++a (increment, e.g. "++YYCURSOR")
  a = b (assignment, e.g. "YYMARKER = YYCURSOR")
  a <= b (less or equal, e.g. "YYLIMIT <= YYCURSOR")
  a - b (substraction, e.g. "(YYLIMIT - YYCURSOR) < n")
  
  Multiple iterators exist at the same time, and they all must remain valid
  (hence not 'input iterator', if I guess correctly).
  
  So should I close the ticket?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2014-12-05
  
  Given the choice, it's undoubtedly better that a code generator generates code in a convenient manner. I don't believe that adding dependencies to circumvent this issue, and battling this limitation with complexity, is acceptable.
  
  Please bear in mind that generating a lexer is pretty much a "run once" kind of thing. Between your suggestion and re-editing the output of re2c by hand, editing the source file is a better way to circumvent the issue.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ulya Trofimovich - 2014-12-05
    
    My analyses of re2c code generator revealed the following
    cases in which re2c deals with input stream:
    
    ++YYCURSOR;
    
    yych = *YYCURSOR;
    
    yych = *++YYCURSOR;
    
    yych = *(YYMARKER = YYCURSOR);
    
    yych = *(YYMARKER = ++YCURSOR);
    
    YYCURSOR = YYMARKER;
    
    YYCURSOR = YYCTXMARKER;
    
    YYMARKER = YYCURSOR;
    
    YYMARKER = ++YYCURSOR;
    
    YYCTXMARKER = YYCURSOR + 1;
    
    if (YYLIMIT <= YYCURSOR) ...
    
    if ((YYLIMIT - YYCURSOR) < n) ...
    
    YYDEBUG (label, *YYCURSOR);
    
    All these cases can be generalized using four primitives:
    
    YYGETC () ---- resolves to "*YYCURSOR" or "is.peek ()"
    YYMOVE () ---- resolves to "++YYCURSOR" or "is.getc ()"
    YYTELL () ---- resolves to "YYCURSOR" or "is.tellg ()"
    YYSEEK (p) ---- resolves to "YYCURSOR = p" or "is.seek (p, i.beg)"
    
    However, we'll have to decompose cases 3, 4 and 5 above,
    which will affect the generated C/C++ code. It shouldn't
    break the code, but all the tests will have to be updated.
    
    So I's not that hard to generalize API for std::ifstream.
    The question is, whether the new API is good enough.
    
    I can implement it (in fact by the time Dan replied I've gone
    half way through it), but I think Dan has much more experience
    in C++ and re2c in particular, so I shouldn't add new API without
    his approval.
    
    Ulya
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dan Nuffer - 2014-12-05

Here's my 2 cents:

To me the main priorities of re2c are:
1. Speed
2. Backward compatibility
3. Programmer friendliness
4. Flexibility
5. Maintainability

Sometimes it is hard to strike a balance between them :-)
The main use case for re2c is fast parsing of in-memory buffers, not ifstreams, so any change which compromises the majority of users for an edge case isn't something I'd like.

It is hard to predict how the generated code will end up being optimized by different compilers, so it is necessary to test it, so creating a simpler "api" that requires more statements might speed things up, slow them down, or be the same.

The other thing to keep in mind is that scanning from an fstream will be much slower than reading the data to a buffer and then scanning it.

Ulya, as you pointed out, re2c basically has an API now which consists of four operations. Those all work out great for in memory scanning. Changing them to the four operations that are applicable to streams isn't a clearly a benefit to me. Currently, to build a basic scanner, the developer just has to worry about a few concepts based on a buffering model. It is fairly easy to build an adapter around any sort of other input model in order to use it (e.g. FILE*, iostreams, iterators, sockets, etc. see https://sourceforge.net/p/re2c/code-git/ci/master/tree/libre2c/libre2c/)

Adding a new option that could create code that uses a different set of macros has its own costs, re2c would be more difficult to learn (it's bad enough already), the code gets more complex and harder to maintain.

It might be worth experimenting with and see if it improves things overall, but I'm a bit skeptical...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ulya Trofimovich - 2014-12-05
  
  Dan, as you pointed out, we'll need to check if complex expressions like
  "yych = ++YYCURSOR"
  compile to faster or equal assembly code compared to
  "++YYCURSOR; yych = YYCURSOR"
  I took it for granted, but maybe I'm mistaken. I can experiment with GCC
  and CLANG.
  
  I dont see how the new API can 'compromise the majority of users for an
  edge case' though.
  Users won't see any difference by default, they won't need to write any
  new interface code.
  Even for those who want to use 'std::ifstream' we can have a convenience
  switch '--ifstream'
  that will override the default. Only some extreme users will use the new
  API directly.
  
  Surely, ifstreams will be slower then C buffer, but it's up to users to
  chose the interface.
  
  As for 're2c will be more difficult to learn and that's bad already'
  ---- true.
  
  Finally, as for maintainability, I don't think re2c's code will get worse.
  For now, all those 13 cases I mentioned are scattered over code.cc in
  various combinations,
  duplicating many times. Unifying them can actually clear the code.
  
  So, I mostly see one possible danger: slower code.
  What do you think?
  
  Ulya
  
  P.S.
  By the way, for me priorities look like:
  1. Speed
  2. Speed
  3. Speed
  4. Flexibility
  5. Backward compatibility
  6. Programmer friendliness
  7. Maintainability
  :)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-12-06

Ulya,

As backward compatibility is always important, adding a way to configure re2c's buffer operations would already do the trick. As simplicity is also important, maybe adding a '--ifstream' switch would be counterproductive in this regard. Quite possibly, the same effect could be had with an example of how to configure re2c to output calls to std::ifstream.

Regarding ifstream being slower than a plain old C buffer, I don't doubt it might be less efficient in an apples-to-apples comparison. Nevertheless, due to the way re2c generates its code, if a file is being accessed through a ifstream object then ifstream will end up being used to fill up the C buffer. Therefore, as re2c currently stands, re2c's code requires two separate buffers to be operated in series: the C buffer being supplied with calls to ifstream, and ifstream's buffer being filled from the input stream. This issue can only be avoided if it becomes possible to configure re2c's output or if the output of re2c is edited to replace calls to the C buffer with calls to ifstream's interface.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ulya Trofimovich - 2014-12-06
  
  Hi Rui!
  Didn't quite understood your letter.
  
  As backward compatibility is always important, adding a way to
  configure re2c's buffer operations would already do the trick. As
  simplicity is also important, maybe adding a '--ifstream' switch would
  be counterproductive in this regard. Quite possibly, the same effect
  could be had with an example of how to configure re2c to output calls
  to std::ifstream.
  
  What can be simpler than a command-line switch? :)
  
  Regarding ifstream being slower than a plain old C buffer, I don't
  doubt it might be less efficient in an apples-to-apples comparison.
  Nevertheless, due to the way re2c generates its code, if a file is
  being accessed through a ifstream object then ifstream will end up
  being used to fill up the C buffer. Therefore, as re2c currently
  stands, re2c's code requires two separate buffers to be operated in
  series: the C buffer being supplied with calls to ifstream, and
  ifstream's buffer being filled from the input stream. This issue can
  only be avoided if it becomes possible to configure re2c's output or
  if the output of re2c is edited to replace calls to the C buffer with
  calls to ifstream's interface.
  
  It is possible right now to use ifstream directly (without any copying
  to C-buffer), if you use iterator (as Dan suggested).
  It will be also possible in case we add new API as I suggested (the only
  difference will be that you will use a switch instead of writing an
  iterator).
  
  Ulya
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Ulya Trofimovich - 2015-10-13

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Support to configure how re2c code interfaced with the symbol buffer?

Group

Searches

Help

#21 Support to configure how re2c code interfaced with the symbol buffer?

Discussion