RE/flex lexical analyzer generator Wiki

The regex-centric, fast lexical analyzer generator for C++

Brought to you by: engelen

Home

Authors:

There is a newer version of this page. You can find it here.

Project Members:

Robert van Engelen (admin)

Flex reimagined. Fast, flexible, adds Boost 💪

See Constructing Lexical Analyzers with RE/flex - a Modern Alternative to Flex for C++ and the RE/flex User Guide for more details.

RE/flex is faster than Flex and much faster than regex libraries such as Boost.Regex, C++11 std::regex, PCRE2 and RE2. For example, tokenizing a representative C source code file into 244 tokens takes:

Command / Function	Software	Time (μs)
reflex −−fast	RE/flex 1.1.5	13
flex −+ −−full	Flex 2.5.35	17
reflex −−full	RE/flex 1.1.5	29
boost::spirit::lex::lexertl::actor_lexer::iterator_type	Boost.Spirit.Lex 1.66.0	40
hs_compile_multi(), hs_scan()	Hyperscan 5.1.0	209
reflex −m=boost-perl	Boost.Regex 1.66.0	230
pcre2_match()	PCRE2 (pre-compiled) 10.30	318
RE2::Consume()	RE2 (pre-compiled) 2018-04-01	417
reflex −m=boost	Boost.Regex POSIX 1.66.0	450
RE2::Consume()	RE2 POSIX (pre-compiled) 2018-04-01	1226
flex −+	Flex 2.5.35	3968
std::regex::cregex_iterator()	C++11 std::regex	5979

Note: Best times of 10 tests with average time in microseconds over 100 runs (using clang 9.0.0 with -O2, 2.9 GHz Intel Core i7, 16 GB 2133 MHz LPDDR3). Hyperscan disqualifies as a potential scanner due to its event handler requirements. Download the tests

Features

Fully compatible with Flex to eliminate a learning curve, making a transition to RE/flex frustration-free.
Extensive documentation in the online manual.
Works with Bison/Yacc and supports reentrant, bison-bridge and bison-locations.
Generates scanners for lexical analysis on ASCII, UTF-8/16/32, EBCDIC, ISO-8859-1 files, C++ streams, and (wide) strings.
Adds Unicode support, integrated Unicode pattern matching on UTF-8/16/32 files and wide strings (no need to write a YY_INPUT routine for decoding).
Adds Unicode property matching \p{C} and C++11, Java, C#, and Python Unicode properties for identifier name matching.
Adds indent \i and dedent \j patterns to match rules on text with indentation, including \t (tab) adjustments.
Adds lazy quantifiers to the POSIX regular expression syntax, so not more hacks to work around the greedy repetitions in Flex.
Adds word boundary anchors to the POSIX regular expression syntax.
Adds %class and %init to customize the generated Lexer classes.
Adds %include to modularize lex specifications.- Adds an extensible hierarchy of pattern matcher engines, with a choice of regex engines, including the RE/flex regex engine and Boost.Regex.
Generates MT-safe (reentrant) code by default.
Generates clean source code that defines a C++ Lexer class derived from an abstract lexer class.
RE/flex generates lex.yy.cpp files while Flex generates lex.yy.cc files (in C++ mode with flex option -+), to distinguish the generated files.
Generates Graphviz files to visualize FSMs with the Graphviz dot tool.
Includes many examples, such as a tokenizer for C/C++ code, a tokenizer for Python code, a tokenizer for Java code, and more.
Converts the official Unicode scripts Scripts.txt and UnicodeData.txt to UTF-8 patterns by applying a RE/flex scanner to convert these scripts to C++ code. Future Unicode standards can be automatically converted using these scanners that are written in RE/flex itself.
Conversion of regex expressions, for regex engines that lack regex features.
The RE/flex regex library makes C++11 std::regex and Boost.Regex much easier to use in plain C++ code for pattern matching on (wide) strings, files, and streams.

The RE/flex software is self-contained. No other libraries are required. Boost.Regex is optional to use as a regex engine.

The RE/flex repo includes tokenizers for Java, Python, and C/C++.

Installation

Windows users

Use reflex/bin/reflex.exe from the command line or add a Custom Build Step in MSVC++ as follows:

select the project name in Solution Explorer then Property Pages from the View menu (see also custom-build steps in Visual Studio);
add an extra path to the reflex/include folder in the Include Directories under VC++ Directories, which should look like $(VC_IncludePath);$(WindowsSDK_IncludePath);C:\Users\YourUserName\Documents\reflex\include (this assumes the reflex source package is in your Documents folder).
enter "C:\Users\YourUserName\Documents\reflex\bin\reflex.exe" --header-file "C:\Users\YourUserName\Documents\mylexer.l" in the Command Line property under Custom Build Step (this assumes mylexer.l is in your Documents folder);
enter lex.yy.h lex.yy.cpp in the Outputs property;
specify Execute Before as PreBuildEvent.

To compile your program with MSVC++, make sure to drag the folders reflex/lib and reflex/unicode to the Source Files in the Solution Explorer panel of your project. After running reflex.exe drag the generated lex.yy.h and lex.yy.cpp files there as well. If you are using specific reflex command-line options such as --flex, add these in step 3.

Unix/Linux and Mac OS

You have two options: 1) quick install or 2) configure and make.

Quick install

For a quick clean build assuming your environment is pretty much standard:

$ ./clean.sh
$ ./build.sh

This compiles the reflex tool and installs it locally in reflex/bin. You can add this location to your $PATH variable to enable the new reflex command:

export PATH=$PATH:/reflex_install_path/bin

The libreflex.a and libreflex.so libraries are saved locally in lib. Link against one of these libraries when you use the RE/flex regex engine in your code. The RE/flex header files are locally located in include/reflex.

To install the library and the reflex command in /usr/local/lib and /usr/local/bin:

$ sudo ./allinstall.sh

Configure and make

The configure script accepts configuration and installation options. To view these options, run:

$ ./configure --help

Run configure and make:

$ ./configure && make

After this successfully completes, you can optionally run make install to install the reflex command and libreflex library:

$ sudo make install

Optional libraries to install:

To use Boost.Regex as a regex engine with the RE/flex library and scanner generator, install Boost and link your code against libboost_regex.a
To visualize the FSM graphs generated with reflex option --graphs-file, install Graphviz dot.

Usage

There are two ways you can use this project:

as a scanner generator for C++, similar to Flex;
as an extensible regex matching library for C++.

For the first option, simply build the reflex tool and run it on the command line on a lex specification:

$ reflex --flex --bison --graphs-file lexspec.l

This generates a scanner for Bison from the Flex specification lexspec.l and saves the finite state machine (FSM) as a Graphviz .gv file that can be visualized with the Graphviz dot tool:

$ dot -Tpdf reflex.INITIAL.gv > reflex.INITIAL.pdf
$ open reflex.INITIAL.pdf

Visualize DFA graphs with Graphviz dot

Several examples are included to get you started. See the manual for more details.

For the second option, simply use the new RE/flex matcher classes to start pattern matching on strings, wide strings, files, and streams.

You can select matchers that are based on different regex engines:

RE/flex regex: #include <reflex/matcher.h> and use reflex::Matcher;
Boost.Regex: #include <reflex/boostmatcher.h> and use reflex::BoostMatcher or reflex::BoostPosixMatcher;
C++11 std::regex: #include <reflex/stdmatcher.h> and use reflex::StdMatcher or reflex::StdPosixMatcher.

Each matcher may differ in regex syntax features (see the full documentation), but they have the same methods and iterators:

matches() returns nonzero if the input from begin to end matches;
find() search input and return nonzero if a match was found;
scan() scan input and return nonzero if input at current position matches;
split() return nonzero for a split of the input at the next match;
find.begin()...find.end() filter iterator;
scan.begin()...scan.end() tokenizer iterator;
split.begin()...split.end() splitter iterator.

For example:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to check if the birthdate string is a valid date
if (reflex::BoostMatcher("\\d{4}-\\d{2}-\\d{2}", birthdate).matches() != 0)
  std::cout << "Valid date!" << std::endl;

With a group capture to fetch the year:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to check if the birthdate string is a valid date
reflex::BoostMatcher matcher("(\\d{4})-\\d{2}-\\d{2}", birthdate);
if (matcher.matches() != 0)
  std::cout << std::string(matcher[1].first, matcher[1].second) << " was a good year!" << std::endl;

To search a string for words \w+:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to search for words in a sentence
reflex::BoostMatcher matcher("\\w+", "How now brown cow.");
while (matcher.find() != 0)
  std::cout << "Found " << matcher.text() << std::endl;

The split method is roughly the inverse of the find method and returns text located between matches. For example using non-word matching \W+:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to search for words in a sentence
reflex::BoostMatcher matcher("\\W+", "How now brown cow.");
while (matcher.split() != 0)
  std::cout << "Found " << matcher.text() << std::endl;

To pattern match the content of a file that may use UTF-8, 16, or 32 encodings:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to search and display words from a FILE
FILE *fd = fopen("somefile.txt", "r");
if (fd == NULL)
  exit(EXIT_FAILURE);
reflex::BoostMatcher matcher("\\w+", fd);
while (matcher.find() != 0)
  std::cout << "Found " << matcher.text() << std::endl;
fclose(fd);

Same again, but this time with a C++ input stream:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
// use a BoostMatcher to search and display words from a stream
std::ifstream file("somefile.txt", std::ifstream::in);
reflex::BoostMatcher matcher("\\w+", file);
while (matcher.find() != 0)
  std::cout << "Found " << matcher.text() << std::endl;
file.close();

Stuffing the search results into a container using RE/flex iterators:

#include <reflex/boostmatcher.h> // reflex::BoostMatcher, reflex::Input, boost::regex
#include <vector>         // std::vector
// use a BoostMatcher to convert words of a sentence into a string vector
reflex::BoostMatcher matcher("\\w+", "How now brown cow.");
std::vector<std::string> words(matcher.find.begin(), matcher.find.end());

Use C++11 range-based loops with RE/flex iterators:

#include <reflex/stdmatcher.h> // reflex::StdMatcher, reflex::Input, std::regex
// use a StdMatcher to to search for words in a sentence
for (auto& match : reflex::StdMatcher("\\w+", "How now brown cow.").find)
  std::cout << "Found " << match.text() << std::endl;

RE/flex also allows you to convert expressive regex syntax forms such as \p Unicode classes, character class set operations such as [a-z--[aeiou]], escapes such as \X, and (?x) mode modifiers, to a regex string that the underlying regex library understands and will be able to use:

std::string reflex::Matcher::convert(const std::string& regex)
std::string reflex::BoostMatcher::convert(const std::string& regex)
std::string reflex::StdMatcher::convert(const std::string& regex)

For example:

#include <reflex/matcher.h> // reflex::Matcher, reflex::Input, reflex::Pattern
// use a Matcher to check if sentence is in Greek:
static const reflex::Pattern pattern(reflex::Matcher::convert("[\\p{Greek}\\p{Zs}\\pP]+"));
if (reflex::Matcher(pattern, sentence).matches() != 0)
  std::cout << "This is Greek" << std::endl;

Conversion is fast (it runs in linear time in the size of the regex), but it is not without some overhead. Making converted regex patterns static as shown above saves the cost of conversion to just once to support many matchings.

You can use convert with option reflex::convert_flag::unicode to make . (dot), \w, \s and so on match Unicode.

License and copyright

RE/flex is distributed under the BSD-3 license LICENSE.txt.
Use, modification, and distribution are subject to the BSD-3 license.