ragel - compile regular languages into executable state
machines
Ragel compiles executable finite state machines from regular
languages. Ragel can generate C, C++, Objective-C, D, Go, or Java code.
Ragel state machines can not only recognize byte sequences as regular
expression machines do, but can also execute code at arbitrary points in the
recognition of a regular language. User code is embedded using inline
operators that do not disrupt the regular language syntax.
The core language consists of standard regular expression
operators, such as union, concatenation and kleene star, accompanied by
action embedding operators. Ragel also provides operators that let you
control any non-determinism that you create, construct scanners using the
longest match paradigm, and build state machines using the statechart model.
It is also possible to influence the execution of a state machine from
inside an embedded action by jumping or calling to other parts of the
machine and reprocessing input.
Ragel provides a very flexibile interface to the host language
that attempts to place minimal restrictions on how the generated code is
used and integrated into the application. The generated code has no
dependencies.
- -h, -H, -?,
--help
- Display help and exit.
- -v
- Print version information and exit.
- -o file
- Write output to file. If -o is not given, a default file name is chosen by
replacing the file extenstion of the input file. For source files ending
in .rh the suffix .h is used. For all other source files a suffix based on
the output language is used (.c, .cpp, .m, etc.). If -o is not given for
Graphviz output the generated dot file is written to standard output.
- -s
- Print some statistics on standard error.
- --error-format=gnu
- Print error messages using the format "file:line:column:"
(default)
- --error-format=msvc
- Print error messages using the format "file(line,column):"
- -d
- Do not remove duplicate actions from action lists.
- -I dir
- Add dir to the list of directories to search for included and imported
files
- -n
- Do not perform state minimization.
- -m
- Perform minimization once, at the end of the state machine
compilation.
- -l
- Minimize after nearly every operation. Lists of like operations such as
unions are minimized once at the end. This is the default minimization
option.
- -e
- Minimize after every operation.
- -x
- Compile the state machines and emit an XML representation of the host data
and the machines.
- -V
- Generate a dot file for Graphviz.
- -p
- Display printable characters on labels.
- -S <spec>
- FSM specification to output.
- -M <machine>
- Machine definition/instantiation to output.
- -C
- The host language is C, C++, Obj-C or Obj-C++. This is the default host
language option.
- -D
- The host language is D.
- -J
- The host language is Java.
- -Z
- The host language is Go.
- -R
- The host language is Ruby.
- -L
- Inhibit writing of #line directives.
- -T0
- (C/D/Java/Ruby/C#/Go) Generate a table driven FSM. This is the default
code style. The table driven FSM represents the state machine as static
data. There are tables of states, transitions, indicies and actions. The
current state is stored in a variable. The execution is a loop that looks
that given the current state and current character to process looks up the
transition to take using a binary search, executes any actions and moves
to the target state. In general, the table driven FSM produces a smaller
binary and requires a less expensive host language compile but results in
slower running code. The table driven FSM is suitable for any FSM.
- -T1
- (C/D/Ruby/C#/Go) Generate a faster table driven FSM by expanding action
lists in the action execute code.
- -F0
- (C/D/Ruby/C#/Go) Generate a flat table driven FSM. Transitions are
represented as an array indexed by the current alphabet character. This
eliminates the need for a binary search to locate transitions and produces
faster code, however it is only suitable for small alphabets.
- -F1
- (C/D/Ruby/C#/Go) Generate a faster flat table driven FSM by expanding
action lists in the action execute code.
- -G0
- (C/D/C#/Go) Generate a goto driven FSM. The goto driven FSM represents the
state machine as a series of goto statements. While in the machine, the
current state is stored by the processor's instruction pointer. The
execution is a flat function where control is passed from state to state
using gotos. In general, the goto FSM produces faster code but results in
a larger binary and a more expensive host language compile.
- -G1
- (C/D/C#/Go) Generate a faster goto driven FSM by expanding action lists in
the action execute code.
- -G2
- (C/D/Go) Generate a really fast goto driven FSM by embedding action lists
in the state machine control code.
- -P<N>
- (C/D) N-Way Split really fast goto-driven FSM.
NOTE: This is a very brief description of Ragel input. Ragel is
described in more detail in the user guide available from the homepage (see
below).
Ragel normally passes input files straight to the output. When it
sees an FSM specification that contains machine instantiations it stops to
generate the state machine. If there are write statements (such as
"write exec") then ragel emits the corresponding code. There can
be any number of FSM specifications in an input file. A multi-line FSM
specification starts with '%%{' and ends with '}%%'. A single line FSM
specification starts with %% and ends at the first newline.
The basic machines are the base operands of the regular language
expressions.
- 'hello'
- Concat literal. Produces a concatenation of the characters in the string.
Supports escape sequences with '\'. The result will have a start state and
a transition to a new state for each character in the string. The last
state in the sequence will be made final. To make the string
case-insensitive, append an 'i' to the string, as in 'cmd'i.
- "hello"
- Identical to single quote version.
- [hello]
- Or literal. Produces a union of characters. Supports character ranges with
'-', negating the sense of the union with an initial '^' and escape
sequences with '\'. The result will have two states with a transition
between them for each character or range.
NOTE: '', "", and [] produce null FSMs. Null machines
have one state that is both a start state and a final state and match the
zero length string. A null machine may be created with the null builtin
machine.
- integer
- Makes a two state machine with one transition on the given integer
number.
- hex
- Makes a two state machine with one transition on the given hexidecimal
number.
- /simple_regex/
- A simple regular expression. Supports the notation '.', '*' and '[]',
character ranges with '-', negating the sense of an OR expression with and
initial '^' and escape sequences with '\'. Also supports one trailing
flag: i. Use it to produce a case-insensitive regular expression, as in
/GET/i.
- lit .. lit
- Specifies a range. The allowable upper and lower bounds are concat
literals of length one and number machines. For example, 0x10..0x20,
0..63, and 'a'..'z' are valid ranges.
- variable_name
- References the machine definition assigned to the variable name
given.
- builtin_machine
- There are several builtin machines available. They are all two state
machines for the purpose of matching common classes of characters. They
are:
- any
- Any character in the alphabet.
- ascii
- Ascii characters 0..127.
- extend
- Ascii extended characters. This is the range -128..127 for signed
alphabets and the range 0..255 for unsigned alphabets.
- alpha
- Alphabetic characters /[A-Za-z]/.
- digit
- Digits /[0-9]/.
- alnum
- Alpha numerics /[0-9A-Za-z]/.
- lower
- Lowercase characters /[a-z]/.
- upper
- Uppercase characters /[A-Z]/.
- xdigit
- Hexidecimal digits /[0-9A-Fa-f]/.
- cntrl
- Control characters 0..31.
- graph
- Graphical characters /[!-~]/.
- print
- Printable characters /[ -~]/.
- punct
- Punctuation. Graphical characters that are not alpha-numerics
/[!-/:-@\[-`{-~]/.
- space
- Whitespace /[\t\v\f\n\r ]/.
- null
- Zero length string. Equivalent to '', "" and [].
- empty
- Empty set. Matches nothing.
Operators are grouped by precedence, group 1 being the lowest and
group 6 the highest.
GROUP 1:
- expr , expr
- Join machines together without drawing any transitions, setting up a start
state or any final states. Start state must be explicitly specified with
the "start" label. Final states may be specified with the an
epsilon transitions to the implicitly created "final"
state.
GROUP 2:
- expr |
expr
- Produces a machine that matches any string in machine one or machine
two.
- expr &
expr
- Produces a machine that matches any string that is in both machine one and
machine two.
- expr -
expr
- Produces a machine that matches any string that is in machine one but not
in machine two.
- expr --
expr
- Strong Subtraction. Matches any string in machine one that does not have
any string in machine two as a substring.
GROUP 3:
- expr .
expr
- Produces a machine that matches all the strings in machine one followed by
all the strings in machine two.
- expr :>
expr
- Entry-Guarded Concatenation: terminates machine one upon entry to machine
two.
- expr :>>
expr
- Finish-Guarded Concatenation: terminates machine one when machine two
finishes.
- expr <:
expr
- Left-Guarded Concatenation: gives a higher priority to machine one.
NOTE: Concatenation is the default operator. Two machines next to
each other with no operator between them results in the concatenation
operation.
GROUP 4:
- label:
expr
- Attaches a label to an expression. Labels can be used by epsilon
transitions and fgoto and fcall statements in actions. Also note that the
referencing of a machine definition causes the implicit creation of label
by the same name.
GROUP 5:
- expr ->
label
- Draws an epsilon transition to the state defined by label. Label must be a
name in the current scope. Epsilon transitions are resolved when comma
operators are evaluated and at the root of the expression tree of machine
assignment/instantiation.
GROUP 6: Actions
An action may be a name predefined with an action statement or may
be specified directly with '{' and '}' in the expression.
- expr >
action
- Embeds action into starting transitions.
- expr @
action
- Embeds action into transitions that go into a final state.
- expr $
action
- Embeds action into all transitions. Does not include pending out
transitions.
- expr %
action
- Embeds action into pending out transitions from final states.
GROUP 6: EOF Actions
When a machine's finish routine is called the current state's EOF
actions are executed.
- expr >/
action
- Embed an EOF action into the start state.
- expr </
action
- Embed an EOF action into all states except the start state.
- expr $/
action
- Embed an EOF action into all states.
- expr %/
action
- Embed an EOF action into final states.
- expr @/
action
- Embed an EOF action into all states that are not final.
- expr <>/
action
- Embed an EOF action into all states that are not the start state and that
are not final (middle states).
GROUP 6: Global Error Actions
Global error actions are stored in states until the final state
machine has been fully constructed. They are then transferred to error
transitions, giving the effect of a default action.
- expr >!
action
- Embed a global error action into the start state.
- expr <!
action
- Embed a global error action into all states except the start state.
- expr $!
action
- Embed a global error action into all states.
- expr %!
action
- Embed a global error action into the final states.
- expr @!
action
- Embed a global error action into all states which are not final.
- expr <>!
action
- Embed a global error action into all states which are not the start state
and are not final (middle states).
GROUP 6: Local Error Actions
Local error actions are stored in states until the named machine
is fully constructed. They are then transferred to error transitions, giving
the effect of a default action for a section of the total machine. Note that
the name may be omitted, in which case the action will be transferred to
error actions upon construction of the current machine.
- expr >^
action
- Embed a local error action into the start state.
- expr <^
action
- Embed a local error action into all states except the start state.
- expr $^
action
- Embed a local error action into all states.
- expr %^
action
- Embed a local error action into the final states.
- expr @^
action
- Embed a local error action into all states which are not final.
- expr <>^
action
- Embed a local error action into all states which are not the start state
and are not final (middle states).
GROUP 6: To-State Actions
To state actions are stored in states and executed any time the
machine moves into a state. This includes regular transitions, and transfers
of control such as fgoto. Note that setting the current state from outside
the machine (for example during initialization) does not count as a
transition into a state.
- expr >~
action
- Embed a to-state action action into the start state.
- expr <~
action
- Embed a to-state action into all states except the start state.
- expr $~
action
- Embed a to-state action into all states.
- expr %~
action
- Embed a to-state action into the final states.
- expr @~
action
- Embed a to-state action into all states which are not final.
- expr <>~
action
- Embed a to-state action into all states which are not the start state and
are not final (middle states).
GROUP 6: From-State Actions
From state actions are executed whenever a state takes a
transition on a character. This includes the error transition and a
transition to self.
- expr >*
action
- Embed a from-state action into the start state.
- expr <*
action
- Embed a from-state action into every state except the start state.
- expr $*
action
- Embed a from-state action into all states.
- expr %*
action
- Embed a from-state action into the final states.
- expr @*
action
- Embed a from-state action into all states which are not final.
- expr <>*
action
- Embed a from-state action into all states which are not the start state
and are not final (middle states).
GROUP 6: Priority Assignment
Priorities are assigned to names within transitions. Only
priorities on the same name are allowed to interact. In the first form of
priorities the name defaults to the name of the machine definition the
priority is assigned in. Transitions do not have default priorities.
- expr >
int
- Assigns the priority int in all transitions leaving the start state.
- expr @
int
- Assigns the priority int in all transitions that go into a final
state.
- expr $
int
- Assigns the priority int in all existing transitions.
- expr %
int
- Assigns the priority int in all pending out transitions.
A second form of priority assignment allows the programmer to
specify the name to which the priority is assigned, allowing interactions to
cross machine definition boundaries.
- expr >
(name,int)
- Assigns the priority int to name in all transitions leaving the start
state.
- expr @ (name,
int)
- Assigns the priority int to name in all transitions that go into a final
state.
- expr $ (name,
int)
- Assigns the priority int to name in all existing transitions.
- expr % (name,
int)
- Assigns the priority int to name in all pending out transitions.
GROUP 7:
- expr *
- Produces the kleene star of a machine. Matches zero or more repetitions of
the machine.
- expr **
- Longest-Match Kleene Star. This version of kleene star puts a higher
priority on staying in the machine over wrapping around and starting over.
This operator is equivalent to ( ( expr ) $0 %1 )*.
- expr ?
- Produces a machine that accepts the machine given or the null string. This
operator is equivalent to ( expr | '' ).
- expr +
- Produces the machine concatenated with the kleen star of itself. Matches
one or more repetitions of the machine. This operator is equivalent to (
expr . expr* ).
- expr {n}
- Produces a machine that matches exactly n repetitions of expr.
- expr
{,n}
- Produces a machine that matches anywhere from zero to n repetitions of
expr.
- expr
{n,}
- Produces a machine that matches n or more repetitions of expr.
- expr
{n,m}
- Produces a machine that matches n to m repetitions of expr.
GROUP 8:
- ! expr
- Produces a machine that matches any string not matched by the given
machine. This operator is equivalent to ( *extend - expr ).
- ^ expr
- Character-Level Negation. Matches any single character not matched by the
single character machine expr.
GROUP 9:
- ( expr )
- Forces precedence on operators.
- fc
- The current character. Equivalent to *p.
- fpc
- A pointer to the current character. Equivalent to p.
- fcurs
- An integer value representing the current state.
- ftargs
- An integer value representing the target state.
- fentry(<label>)
- An integer value representing the entry point <label>.
- fhold;
- Do not advance over the current character. Equivalent to --p;.
- fexec
<expr>;
- Sets the current character to something else. Equivalent to p =
(<expr>)-1;
- fgoto
<label>;
- Jump to the machine defined by <label>.
- fgoto
*<expr>;
- Jump to the entry point given by <expr>. The expression must
evaluate to an integer value representing a state.
- fnext
<label>;
- Set the next state to be the entry point defined by <label>. The
fnext statement does not immediately jump to the specified state. Any
action code following the statement is executed.
- fnext
*<expr>;
- Set the next state to be the entry point given by <expr>. The
expression must evaluate to an integer value representing a state.
- fcall
<label>;
- Call the machine defined by <label>. The next fret will jump to the
target of the transition on which the action is invoked.
- fcall
*<expr>;
- Call the entry point given by <expr>. The next fret will jump to the
target of the transition on which the action is invoked.
- fret;
- Return to the target state of the transition on which the last fcall was
made.
- fbreak;
- Save the current state and immediately break out of the machine.
Ragel was written by Adrian Thurston
<thurston@complang.org>. Objective-C output contributed by Erich
Ocean. D output contributed by Alan West. Ruby output contributed by Victor
Hugo Borja. C Sharp code generation contributed by Daniel Tang.
Contributions to Java code generation by Colin Fleming. Go code generation
contributed by Justine Tunney.