flexc++ - Generate a C++ scanner class and parsing function
flexc++ [options] rules-file
Flexc++(1) was designed after flex(1) and
flex++(1). Like these latter two programs flexc++ generates
code performing pattern-matching on text, possibly executing actions when
certain regular expressions are recognized.
Flexc++, contrary to flex and flex++,
generates code that is explicitly intended for use by C++ programs.
The well-known flex(1) program generates C source-code and
flex++(1) merely offers a C++-like shell around the
yylex function generated by flex(1) and hardly supports
present-day ideas about C++ software development.
Contrary to this, flexc++ creates a C++ class
offering a predefined member function lex matching input against
regular expressions and possibly executing C++ code once regular
expressions were matched. The code generated by flexc++ is pure
C++, allowing its users to apply all of the features offered by that
language.
Not every aspect of flexc++ is covered by the man-pages. In
addition to what’s summarized by the man-pages the flexc++
manual offers a chapter covering pre-loading of input lines (allowing you
to, e.g, display lines in which errors are observed even though not all of
the line’s tokens have already been scanned), as well as a chapter
covering technical documentation about the inner working of
flexc++.
Before version 2.08.00 the lexical scanner’s specification
file (e.g., lexer) could be split into several files using
//include directives, but //include directives required that
all files were specified relative to the location of the lexer file
itself. E.g., if inc/part1 had to include part2, available in
the same (inc) directory as part1, then part1 had to
specify //include inc/part2 instead of merely //include
part2.
Starting with version 2.08.00 //include directives use the
directories of the files containing these directories as the current
directory. In the provided example part1 should simply contain
//include part2. See the flexc++ flexc++api(3) and
flexc++input(7) man-pages for details.
Flexc++ offers several man-pages. These man-pages contain
the following main sections:
This man-page
This man-page offers the following sections:
- o
- 1. QUICK START: a quick start overview about how to use
flexc++;
- o
- 2. QUICK START: FLEXC++ and BISONC++: a quick start overview about
how to use flexc++ in combination with bisonc++(1);
- o
- 3. GENERATED FILES: files generated by flexc++ and their
purposes
- o
- 4. OPTIONS: options available for flexc++.
The flexc++api(3) man-page:
This man-page describes the classes generated by flexc++,
describing flexc++’s actions from the programmer’s
point of view.
- o
- 1. INTERACTIVE SCANNERS: how to create an interactive scanner
- o
- 2. THE CLASS INTERFACE: SCANNER.H: Constructors and members of the
scanner class generated by flexc++
- o
- 3. NAMING CONVENTION: symbols defined by flexc++ in the
scanner class.
- o
- 4. CONSTRUCTORS: constructors defined in the scanner class.
- o
- 5. PUBLIC MEMBER FUNCTION: public member declared in the scanner
class.
- o
- 6. PRIVATE MEMBER FUNCTIONS: private members declared in the
scanner class.
- o
- 7. SCANNER CLASS HEADER EXAMPLE: an example of a generated scanner
class header
- o
- 8. THE SCANNER BASE CLASS: the scanner class is derived from a base
class. The base class is described in this section
- o
- 9. PUBLIC ENUMS AND -TYPES: enums and types declared by the base
class
- o
- 10. PROTECTED ENUMS AND -TYPES: enumerations and types used by the
scanner and scanner base classes
- o
- 11. NO PUBLIC CONSTRUCTORS: the scanner base class does not offer
public constructors.
- o
- 12. PUBLIC MEMBER FUNCTIONS: several members defined by the scanner
base class have public access rights.
- o
- 13. PROTECTED CONSTRUCTORS: the base class can be constructed by a
derived class. Usually this is the scanner class generated by
flexc++.
- o
- 14. PROTECTED MEMBER FUNCTIONS: this section covers the base class
member functions that can only be used by scanner class or scanner base
class members
- o
- 15. PROTECTED DATA MEMBERS: this section covers the base class data
members that can only be used by scanner class or scanner base class
members
- o
- 16. FLEX++ TO FLEXC++ MEMBERS: a short overview of frequently used
flex(1) members that received different names in
flexc++.
- o
- 17. THE CLASS INPUT: the scanner’s job is completely
decoupled from the actual input stream. The class Input, nested
within the scanner base class handles the communication with the input
streams. The class Input, is described in this section.
- o
- 18. INPUT CONSTRUCTORS: the class Input can easily be
replaced by another class. The constructor-requirements are described in
this section.
- o
- 19. REQUIRED PUBLIC MEMBER FUNCTIONS: this section covers the
required public members of a self-made Input class
The flexc++input(7) man-page:
This man-page describes how flexc++’s input s
should be organized. It contains the following sections:
- o
- 1. SPECIFICATION FILE(S): the format and contents of flexc++
input files, specifying the Scanner’s characteristics
- o
- 2. FILE SWITCHING: how to switch to another input specification
file
- o
- 3. DIRECTIVES: directives that can be used in input specification
files
- o
- 4. MINI SCANNERS: how to declare mini-scanners
- o
- 5. DEFINITIONS: how to define symbolic names for regular
expressions
- o
- 6. %% SEPARATOR: the separator between the input specification
sections
- o
- 7. REGULAR EXPRESSIONS: regular expressions supported by
flexc++
- o
- 8. SPECIFICATION EXAMPLE: an example of a specification file
A bare-bones, no-frills scanner is generated as follows:
- o
- First define a subdirectory scanner, and change-dir to
scanner. This directory is going to contain all scanner-related
files, created next.
- o
- Create a file lexer defining the regular expressions to recognize,
and the tokens to return. Use token values exceeding 0xff when plain ascii
character values could also be used as token values. Example (assume
capitalized words are token-symbols defined in an enum defined by the
scanner class):
%%
[ \t\n]+ // skip white space chars.
[0-9]+ return NUMBER;
[[:alpha:]_][[:alpha:][:digit:]_]* return IDENTIFIER;
. return matched()[0];
- o
- Execute:
flexc++ lexer
This generates four files: Scanner.h, Scanner.ih, Scannerbase.h, and
lex.cc.
- o
- Edit Scanner.h to add the enum defining the token-symbols in
(usually) the public section of the class Scanner. E.g.,
class Scanner: public ScannerBase
{
public:
enum Tokens
{
IDENTIFIER = 0x100,
NUMBER
};
// ... (etc, as generated by flexc++)
}
- o
- Change-dir to scanner’s base directory, and there create a
file main.cc defining int main:
#include <iostream>
#include "scanner/Scanner.h"
using namespace std;
int main()
{
Scanner scanner; // define a Scanner object
while (int token = scanner.lex()) // get all tokens
{
string const &text = scanner.matched();
switch (token)
{
case Scanner::IDENTIFIER:
cout << "identifier: " << text << ’\n’;
break;
case Scanner::NUMBER:
cout << "number: " << text << ’\n’;
break;
default:
cout << "char. token: `" << text << "’\n";
break;
}
}
}
- o
- Compile all .cc files, creating a.out:
g++ *.cc scanner/*.cc
- o
- To `tokenize’ main.cc, execute:
a.out < main.cc
To interface flexc++ to the bisonc++(1) parser
generator proceed as follows:
- o
- Start from the directory containing main.cc used in the previous
section; the lexical scanner developed there is also used here.
- o
- Create a directory parser and change-dir to that directory.
- o
- Define the following grammar in the file grammar:
%scanner ../scanner/Scanner.h
%token-path ../scanner/tokens.h
%token IDENTIFIER NUMBER CHAR
%%
startrule:
startrule tokenshow
|
tokenshow
;
tokenshow:
token
{
std::cout << "matched: " << d_scanner.matched() << ’\n’;
}
;
token:
IDENTIFIER
|
NUMBER
|
CHAR
;
- o
- Create the parser by executing:
bisonc++ grammar
This generates five files: parse.cc, Parserbase.h, Parser.h,
Parser.ih and ../scanner/tokens.h,
where the last file contains the class Tokens
defining the enumeration Tokens_ specifying the
symbolic token names.
- o
- Now that the parser has been defined, edit the (three) lines in the file
../scanner/lexer containing return statements. Change these lines
as follows (the first two lines of the file lexer remain as-is):
[0-9]+ return Tokens::NUMBER;
[[:alpha:]_][[:alpha:][:digit:]_]* return Tokens::IDENTIFIER;
. return Tokens::CHAR;
This allows the scanner to return Parser tokens to the generated
parser.
- o
- Modify the scanner so that it returns these Parser tokens by
executing:
flexc++ lexer
- o
- Next, add the line
#include "tokens.h"
to the file scanner/Scanner.ih, informing the scanner about the
existence of the tokens expected by the parser.
- If ever you have to use members from the parser’s base class
generated by bisonc++(1), then
#include "../parser/Parserbase.h"
should be added to the file scanner/Scanner.ih. In that case
including the file token.h in scanner/Scanner.ih is
optional.
- o
- Change-dir to the scanner’s parent directory and rewrite the
main.cc file defined in the previous section to contain:
#include "parser/Parser.h"
int main(int argc, char **argv)
{
Parser parser;
parser.parse();
}
- o
- Compile all sources:
g++ *.cc */*.cc
- o
- Execute the program, providing it with some source file to be processed:
a.out < main.cc
Flexc++ generates four files from a well-formed input
file:
- o
- A file containing the implementation of the lex member function and
its support functions. By default this file is named lex.cc.
- o
- A file containing the scanner’s class interface. By default this
file is named Scanner.h. The scanner class itself is generated once
and is thereafter `owned’ by the programmer, who may change it
ad-lib. Newly added members (data members, function members) will
survive future flexc++ runs as flexc++ will never rewrite an
existing scanner class interface file, unless explicitly ordered to do
so.
- o
- A file containing the interface of the scanner class’s base
class. The scanner class is publicly derived from this base class.
It is used to minimize the size of the scanner interface itself. The
scanner base class is `owned’ by flexc++ and should never be
hand-modified. By default the scanner’s base class is provided in
the file Scannerbase.h. At each new flexc++ run this file is
rewritten unless flexc++ is explicitly ordered not to do
so.
- o
- A file containing the implementation header. This file should
contain includes and declarations that are only required when compiling
the members of the scanner class. By default this file is named
Scanner.ih. This file, like the file containing the scanner
class’s interface is never rewritten by flexc++ unless
flexc++ is explicitly ordered to do so.
Where available, single letter options are listed between
parentheses following their associated long-option variants. Single letter
options require arguments if their associated long options require arguments
as well. Options affecting the class header or implementation header file
are ignored if these files already exist. Options accepting a
`filename’ do not accept path names, i.e., they cannot contain
directory separators (/); options accepting a
’pathname’ may contain directory separators.
Some options may generate errors. This happens when an option
conflicts with the contents of an existing file which flexc++ cannot
modify (e.g., a scanner class header file exists, but doesn’t define
a name space, but a --namespace option was provided). To solve the
error the offending option could be omitted, the existing file could be
removed, or the existing file could be hand-edited according to the
option’s specification. Note that flexc++ currently does not
handle the opposite error condition: if a previously used option is omitted,
then flexc++ does not detect the inconsistency. In those cases you
may encounter compilation errors.
- o
- --baseclass-header=filename (-b)
Use filename as the name of the file to contain the scanner
class’s base class. Defaults to the name of the scanner class plus
base.h
- It is an error if this option is used and an already existing
scanner-class header file does not include `filename’.
- o
- --baseclass-skeleton=pathname (-C)
Use pathname as the path to the file containing the skeleton of the
scanner class’s base class. Its filename defaults to
flexc++base.h.
- o
- --case-insensitive
Use this option to generate a scanner case insensitively matching
regular expressions. All regular expressions specified in
flexc++’s input file are interpreted case insensitively and
the resulting scanner object will case insensitively interpret its
input.
- When this option is specified the resulting scanner does not distinguish
between the following rules:
First // initial F is transformed to f
first
FIRST // all capitals are transformed to lower case chars
With a case-insensitive scanner only the first rule can be matched, and
flexc++ will issue warnings for the second and third rule about
rules that cannot be matched.
- Input processed by a case-insensitive scanner is also handled case
insensitively. The above mentioned First rule is matched for all of
the following input words: first First FIRST firST.
- Although the matching process proceeds case insensitively, the matched
text (as returned by the scanner’s matched() member) always
contains the original, unmodified text. So, with the above input
matched() returns, respectively first, First, FIRST and
firST, while matching the rule First.
- o
- --class-header=filename (-c)
Use filename as the name of the file to contain the scanner class.
Defaults to the name of the scanner class plus the suffix .h
- o
- --class-name=className
Use className (rather than Scanner) as the name of the scanner
class. Unless overridden by other options generated files will be given
the (transformed to lower case) className* name instead of
scanner*.
- It is an error if this option is used and an already existing
scanner-class header file does not define class
`className’
- o
- --class-skeleton=pathname (-C)
Use pathname as the path to the file containing the skeleton of the
scanner class. Its filename defaults to flexc++.h.
- o
- --construction (-K)
Write details about the lexical scanner to the file
`rules-file’.output. Details cover the used character
ranges, information about the regexes, the raw NFA states, and the final
DFAs.
- o
- --debug (-d)
Provide lex and its support functions with debugging code, showing
the actual parsing process on the standard output stream. When included,
the debugging output is active by default, but its activity may be
controlled using the setDebug(bool on-off) member. Note that
#ifdef DEBUG macros are not used anymore. By rerunning
flexc++ without the --debug option an equivalent scanner is
generated not containing the debugging code. This option does not provide
debug information about flexc++ itself. For that use the options
--own-parser and/or --own-tokens (see below).
- o
- --filenames=genericName (-f)
Generic name of generated files (header files, not the lex-function
source file, see the --lex-source option for that). By default the
header file names will be equal to the name of the generated class.
- o
- --help (-h)
Write basic usage information to the standard output stream and
terminate.
- o
- --implementation-header=filename (-i)
Use filename as the name of the file to contain the implementation
header. Defaults to the name of the generated scanner class plus the
suffix .ih. The implementation header should contain all directives
and declarations only used by the implementations of the
scanner’s member functions. It is the only header file that is
included by the source file containing lex()’s
implementation. User defined implementation of other class members may use
the same convention, thus concentrating all directives and declarations
that are required for the compilation of other source files belonging to
the scanner class in one header file.
- It is an error if this option is used and an already existing
’filename’ file does not include the scanner class
header file.
- o
- --implementation-skeleton=pathname (-I)
Use pathname as the path to the file containing the skeleton of the
implementation header. Its filename defaults to flexc++.ih.
- o
- --lex-skeleton=pathname (-L)
Use pathname as the path to the file containing the lex()
member function’s skeleton. Its filename defaults to
flexc++.cc.
- o
- --lex-function-name=funname
Use funname rather than lex as the name of the member function
performing the lexical scanning.
- o
- --lex-source=filename (-l)
Define filename as the name of the source file to contain the scanner
member function lex. Defaults to lex.cc.
- o
- --matched-rules (-’R’)
The generated scanner will write the numbers of matched rules to the
standard output. It is implied by the --debug option. Displaying
the matched rules can be suppressed by calling the generated
scanner’s member setDebug(false) (or, of course, by
re-generating the scanner without using specifying
--matched-rules).
- o
- --max-depth=depth (-m)
Set the maximum inclusion depth of the lexical scanner’s
specification files to depth. By default the maximum depth is set
to 10. When more than depth specification files are used the
scanner throws a Max stream stack size exceeded
std::length_error exception.
- o
- --namespace=identifier
Define the scanner class in the namespace identifier. By default no
namespace is used. If this options is used the implementation header is
provided with a commented out using namespace declaration
for the requested namespace. In addition, the scanner and scanner base
class header files also use the specified namespace to define their
include guard directives.
- It is an error if this option is used and an already existing
scanner-class header file does not define namespace
identifier.
- o
- --no-baseclass-header
Do not write the file containing the scanner’s base class interface
even if it doesn’t yet exist. By default the file containing the
scanner’s base class interface is (re)written each time
flexc++ is called.
- o
- --no-lines
Do not put #line preprocessor directives in the file containing the
scanner’s lex function. By default #line directives
are entered at the beginning of the action statements in the generated
lex.cc file, allowing the compiler and debuggers to associate
errors with lines in your grammar specification file, rather than with the
source file containing the lex function itself.
- o
- --no-lex-source
Do not write the file containing the scanner’s predefined scanner
member functions, even if that file doesn’t yet exist. By default
the file containing the scanner’s lex member function is
(re)written each time flexc++ is called. This option should
normally be avoided, as this file contains parsing tables which are
altered whenever the grammar definition is modified.
- o
- --own-parser (-P)
The actions performed by flexc++’s own parser are written to
the standard output stream.
- This option does not result in the generated program optionally
displaying the actions of its lex function. If that is what you
want, use the --debug option.
- o
- --own-tokens (-T)
The tokens returned as well as the text matched by flexc++ are
written to the standard output stream when this option is used.
- This option does not result in the generated program displaying
returned tokens and matched text. If that is what you want, use the
--print-tokens option.
- o
- --print-tokens (-t)
The tokens returned as well as the text matched by the generated
lex function are displayed on the standard output stream, just
before returning the token to lex’s caller. Displaying
tokens and matched text is suppressed again when the lex.cc file is
generated without using this option. The function showing the tokens
(ScannerBase::print_) is called from Scanner::printTokens,
which is defined in-line in Scanner.h. Calling
ScannerBase::print_, therefore, can also easily be controlled by an
option controlled by the program using the scanner object.
- This option does not show the tokens returned and text matched by
flexc++ itself when reading its input s. If that is what you
want, use the --own-tokens option.
- o
- --regex-calls
Show the function call order when parsing regular expressions (this option
is normally not required. Its main purpose is to help developers
understand what happens when regular expressions are parsed).
- o
- --show-filenames (-F)
Write the names of the files that are generated to the standard error
stream.
- o
- --skeleton-directory=pathname (-S)
Defines the directory containing the skeleton files. This option can be
overridden by the specific skeleton-specifying options (-B -C, -H,
and -I).
- o
- --target-directory=pathname
Specifies the directory where generated files should be written. By default
this is the directory where flexc++ is called.
- o
- --usage (-h)
Write basic usage information to the standard output stream and
terminate.
- o
- --verbose(-V)
The verbose option generates on the standard output stream various pieces of
additional information, not covered by the --construction and
--show-filenames options.
- o
- --version (-v)
Display flexc++’s version number and terminate.
Flexc++’s default skeleton files are in
/usr/share/flexc++.
By default, flexc++ generates the following files:
- o
- Scanner.h: the header file containing the scanner class’s
interface.
- o
- Scannerbase.h: the header file containing the interface of the
scanner class’s base class.
- o
- Scanner.ih: the internal header file that is meant to be included
by the scanner class’s source files (e.g., it is included by
lex.cc, see the next item’s file), and that should contain
all declarations required for compiling the scanner class’s
sources.
- o
- lex.cc: the source file implementing the scanner class member
function lex (and support functions), performing the lexical scan.
Flexc++ was originally started as a programming project by
Jean-Paul van Oosten and Richard Berendsen in the 2007-2008 academic year.
After graduating, Richard left the project and moved to Amsterdam. Jean-Paul
remained in Groningen, and after on-and-off activities on the project, in
close cooperation with Frank B. Brokken, Frank undertook a rewrite of the
project’s code around 2010. During the development of flexc++,
the lookahead-operator handling continuously threatened the completion of
the project. But in version 2.00.00 the lookahead operator received a
completely new implementation (with a bug fix in version 2.04.00), which
solved previously encountered problems with the lookahead-operator.
This is free software, distributed under the terms of the GNU
General Public License (GPL).
Frank B. Brokken (f.b.brokken@rug.nl),
Jean-Paul van Oosten (j.p.van.oosten@rug.nl),
Richard Berendsen (richardberendsen@xs4all.nl) (until 2010).