PEG(1) | General Commands Manual | PEG(1) |
peg, leg - parser generators
peg [-hvV -ooutput] [filename ...]
leg [-hvV -ooutput] [filename ...]
peg and leg are tools for generating recursive-descent parsers: programs that perform pattern matching on text. They process a Parsing Expression Grammar (PEG) [Ford 2004] to produce a program that recognises legal sentences of that grammar. peg processes PEGs written using the original syntax described by Ford; leg processes PEGs written using slightly different syntax and conventions that are intended to make it an attractive replacement for parsers built with lex(1) and yacc(1). Unlike lex and yacc, peg and leg support unlimited backtracking, provide ordered choice as a means for disambiguation, and can combine scanning (lexical analysis) and parsing (syntactic analysis) into a single activity.
peg reads the specified filenames, or standard input if no filenames are given, for a grammar describing the parser to generate. peg then generates a C source file that defines a function yyparse(). This C source file can be included in, or compiled and then linked with, a client program. Each time the client program calls yyparse() the parser consumes input text according to the parsing rules, starting from the first rule in the grammar. yyparse() returns non-zero if the input could be parsed according to the grammar; it returns zero if the input could not be parsed.
The prefix 'yy' or 'YY' is prepended to all externally-visible symbols in the generated parser. This is intended to reduce the risk of namespace pollution in client programs. (The choice of 'yy' is historical; see lex(1) and yacc(1), for example.)
peg and leg provide the following options:
The following peg input specifies a grammar with a single rule (called 'start') that is satisfied when the input contains the string "username".
start <- "username"
(The quotation marks are not part of the matched text; they serve to indicate a literal string to be matched.) In other words, yyparse() in the generated C source will return non-zero only if the next eight characters read from the input spell the word "username". If the input contains anything else, yyparse() returns zero and no input will have been consumed. (Subsequent calls to yyparse() will also return zero, since the parser is effectively blocked looking for the string "username".) To ensure progress we can add an alternative clause to the 'start' rule that will match any single character if "username" is not found.
start <- "username"
/ .
yyparse() now always returns non-zero (except at the very end of the input). To do something useful we can add actions to the rules. These actions are performed after a complete match is found (starting from the first rule) and are chosen according to the 'path' taken through the grammar to match the input. (Linguists would call this path a 'phrase marker'.)
start <- "username" { printf("%s\n", getlogin()); }
/ < . > { putchar(yytext[0]); }
The first line instructs the parser to print the user's login name whenever it sees "username" in the input. If that match fails, the second line tells the parser to echo the next character on the input the standard output. Our parser is now performing useful work: it will copy the input to the output, replacing all occurrences of "username" with the user's account name.
Note the angle brackets ('<' and '>') that were added to the second alternative. These have no effect on the meaning of the rule, but serve to delimit the text made available to the following action in the variable yytext.
If the above grammar is placed in the file username.peg, running the command
will save the corresponding parser in the file username.c. To create a complete program this parser could be included by a C program as follows.
peg -o username.c username.peg
#include <stdio.h> /* printf(), putchar() */
#include <unistd.h> /* getlogin() */
#include "username.c" /* yyparse() */
int main()
{
while (yyparse()) /* repeat until EOF */
;
return 0;
}
A grammar consists of a set of named rules.
name <- pattern
The pattern contains one or more of the following elements.
Similarly, the following matches any single non-digit character.
[a-zA-Z_]
[^0-9]
The above elements can be made optional and/or repeatable with the following suffixes:
The above elements and suffixes can be converted into predicates (that match arbitrary input text and subsequently succeed or fail without consuming that input) with the following prefixes:
which matches the end of file, after the last character of the input has already been consumed.
!.
A special form of the '&' predicate is provided:
Several elements (with or without prefixes and suffixes) can be combined into a sequence by writing them one after the other. The entire sequence matches only if each individual element within it matches, from left to right.
Sequences can be separated into disjoint alternatives by the alternation operator '/'.
Finally, the pound sign (#) introduces a comment (discarded) that continues until the end of the line.
To summarise the above, the parser tries to match the input text against a pattern containing literals, names (representing other rules), and various operators (written as prefixes, suffixes, juxtaposition for sequencing and and infix alternation operator) that modify how the elements within the pattern are matched. Matches are made from left to right, 'descending' into named sub-rules as they are encountered. If the matching process fails, the parser 'back tracks' ('rewinding' the input appropriately in the process) to find the nearest alternative 'path' through the grammar. In other words the parser performs a depth-first, left-to-right search for the first successfully-matching path through the rules. If found, the actions along the successful path are executed (in the order they were encountered).
Note that predicates are evaluated immediately during the search for a successful match, since they contribute to the success or failure of the search. Actions, however, are evaluated only after a successful match has been found.
The grammar for peg grammars is shown below. This will both illustrate and formalise the above description.
Grammar <- Spacing Definition+ EndOfFile
Definition <- Identifier LEFTARROW Expression
Expression <- Sequence ( SLASH Sequence )*
Sequence <- Prefix*
Prefix <- AND Action
/ ( AND | NOT )? Suffix
Suffix <- Primary ( QUERY / STAR / PLUS )?
Primary <- Identifier !LEFTARROW
/ OPEN Expression CLOSE
/ Literal
/ Class
/ DOT
/ Action
/ BEGIN
/ END
Identifier <- < IdentStart IdentCont* > Spacing
IdentStart <- [a-zA-Z_]
IdentCont <- IdentStart / [0-9]
Literal <- ['] < ( !['] Char )* > ['] Spacing
/ ["] < ( !["] Char )* > ["] Spacing
Class <- '[' < ( !']' Range )* > ']' Spacing
Range <- Char '-' Char / Char
Char <- '\\' [abefnrtv'"\[\]\\]
/ '\\' [0-3][0-7][0-7]
/ '\\' [0-7][0-7]?
/ '\\' '-'
/ !'\\' .
LEFTARROW <- '<-' Spacing
SLASH <- '/' Spacing
AND <- '&' Spacing
NOT <- '!' Spacing
QUERY <- '?' Spacing
STAR <- '*' Spacing
PLUS <- '+' Spacing
OPEN <- '(' Spacing
CLOSE <- ')' Spacing
DOT <- '.' Spacing
Spacing <- ( Space / Comment )*
Comment <- '#' ( !EndOfLine . )* EndOfLine
Space <- ' ' / '\t' / EndOfLine
EndOfLine <- '\r\n' / '\n' / '\r'
EndOfFile <- !.
Action <- '{' < [^}]* > '}' Spacing
BEGIN <- '<' Spacing
END <- '>' Spacing
leg is a variant of peg that adds some features of lex(1) and yacc(1). It differs from peg in the following ways.
This example shows how ignored whitespace can be obvious when reading the grammar and yet unobtrusive when placed liberally at the end of every rule associated with a lexical element.
- = [ \t\n\r]*
number = [0-9]+ -
name = [a-zA-Z_][a-zA_Z_0-9]* -
l-paren = '(' -
r-paren = ')' -
is therefore written
name <- sequence-1
/ sequence-2
/ sequence-3
in leg (with the final semicolon being optional, as described next).
name = sequence-1
| sequence-2
| sequence-3
;
rule = e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); }
| ...
rule = (e1 e2 e3) ~{ error("one of e[123] has failed"); }
| ...
The desk calculator example below illustrates the use of '$$' and ':'.
The extensions in leg described above allow useful parsers and evaluators (including declarations, grammar rules, and supporting C functions such as 'main') to be kept within a single source file. To illustrate this we show a simple desk calculator supporting the four common arithmetic operators and named variables. The intermediate results of arithmetic evaluation will be accumulated on an implicit stack by returning them as semantic values from sub-rules.
%{
#include <stdio.h> /* printf() */
#include <stdlib.h> /* atoi() */
int vars[26];
%}
Stmt = - e:Expr EOL { printf("%d\n", e); }
| ( !EOL . )* EOL { printf("error\n"); }
Expr = i:ID ASSIGN s:Sum { $$ = vars[i] = s; }
| s:Sum { $$ = s; }
Sum = l:Product
( PLUS r:Product { l += r; }
| MINUS r:Product { l -= r; }
)* { $$ = l; }
Product = l:Value
( TIMES r:Value { l *= r; }
| DIVIDE r:Value { l /= r; }
)* { $$ = l; }
Value = i:NUMBER { $$ = atoi(yytext); }
| i:ID !ASSIGN { $$ = vars[i]; }
| OPEN i:Expr CLOSE { $$ = i; }
NUMBER = < [0-9]+ > - { $$ = atoi(yytext); }
ID = < [a-z] > - { $$ = yytext[0] - 'a'; }
ASSIGN = '=' -
PLUS = '+' -
MINUS = '-' -
TIMES = '*' -
DIVIDE = '/' -
OPEN = '(' -
CLOSE = ')' -
- = [ \t]*
EOL = '\n' | '\r\n' | '\r' | ';'
%%
int main()
{
while (yyparse())
;
return 0;
}
The grammar for leg grammars is shown below. This will both illustrate and formalise the above description.
grammar = -
( declaration | definition )+
trailer? end-of-file
declaration = '%{' < ( !'%}' . )* > RPERCENT
trailer = '%%' < .* >
definition = identifier EQUAL expression SEMICOLON?
expression = sequence ( BAR sequence )*
sequence = error+
error = prefix ( TILDE action )?
prefix = AND action
| ( AND | NOT )? suffix
suffix = primary ( QUERY | STAR | PLUS )?
primary = identifier COLON identifier !EQUAL
| identifier !EQUAL
| OPEN expression CLOSE
| literal
| class
| DOT
| action
| BEGIN
| END
identifier = < [-a-zA-Z_][-a-zA-Z_0-9]* > -
literal = ['] < ( !['] char )* > ['] -
| ["] < ( !["] char )* > ["] -
class = '[' < ( !']' range )* > ']' -
range = char '-' char | char
char = '\\' [abefnrtv'"\[\]\\]
| '\\' [0-3][0-7][0-7]
| '\\' [0-7][0-7]?
| !'\\' .
action = '{' < braces* > '}' -
braces = '{' braces* '}'
| !'}' .
EQUAL = '=' -
COLON = ':' -
SEMICOLON = ';' -
BAR = '|' -
AND = '&' -
NOT = '!' -
QUERY = '?' -
STAR = '*' -
PLUS = '+' -
OPEN = '(' -
CLOSE = ')' -
DOT = '.' -
BEGIN = '<' -
END = '>' -
TILDE = '~' -
RPERCENT = '%}' -
- = ( space | comment )*
space = ' ' | '\t' | end-of-line
comment = '#' ( !end-of-line . )* end-of-line
end-of-line = '\r\n' | '\n' | '\r'
end-of-file = !.
The following symbols can be redefined in declaration sections to modify the generated parser code.
where 'foo' is the name of the first rule in the grammar.
int yyparse() { return yyparsefrom(yy_foo); }
Note that if YY_CTX_LOCAL is defined (see below) then an additional first argument, containing the parser context, is passed to YY_INPUT.
#define YY_INPUT(buf, result, max_size) \
{ \
int yyc= getchar(); \
result= (EOF == yyc) ? 0 : (*(buf)= yyc, 1); \
}
therefore saves the current input position and returns 1 ('true') as the result of the predicate.
#define YY_BEGIN (yybegin= yypos, 1)
#define YY_END (yyend= yypos, 1)
leaves yyparse() and yyparsefrom() with global visibility. If they should not be externally visible in other source files, this macro can be redefined to declare them 'static'.
#define YY_PARSE(T) T
#define YY_PARSE(T) static T
Note that if this symbol is undefined then the compiled parser will statically allocate its global state and will be neither reentrant nor thread-safe. Note also that the parser yycontext structure is initialised automatically the first time yyparse() is called; this structure must therefore be properly initialised to zero before the first call to yyparse().
#include <stdio.h>
#define YY_CTX_LOCAL
#include "the-generated-parser.peg.c"
int main()
{
yycontext ctx;
memset(&ctx, 0, sizeof(yycontext));
while (yyparse(&ctx));
return 0;
}
The following variables can be referred to within actions.
Programs that wish to release all the resources associated with a parser can use the following function.
Note that the storage for the yycontext structure itself is never allocated or reclaimed implicitly. The application must allocate these structures in automatic storage, or use calloc() and free() to manage them explicitly. The example in the following section demonstrates one approach to resource management.
The yy variable passed to actions contains the state of the parser plus any additional fields defined by YY_CTX_MEMBERS. Theses fields can be used to store application-specific information that is global to a particular call of yyparse(). A trivial but complete leg example follows in which the yycontext structure is extended with a count of the number of newline characters seen in the input so far (the grammar otherwise consumes and ignores the entire input). The caller of yyparse() uses count to print the number of lines of input that were read.
%{
#define YY_CTX_LOCAL 1
#define YY_CTX_MEMBERS \
int count;
%}
Char = ('\n' | '\r\n' | '\r') { yy->count++ }
| .
%%
#include <stdio.h>
#include <string.h>
int main()
{
/* create a local parser context in automatic storage */
yycontext yy;
/* the context *must* be initialised to zero before first use*/
memset(&yy, 0, sizeof(yy));
while (yyparse(&yy))
;
printf("%d newlines\n", yy.count);
/* release all resources associated with the context */
yyrelease(&yy);
return 0;
}
peg and leg warn about the following conditions while converting a grammar into a parser.
Left recursion, especially that found in standards documents, is often 'direct' and implies trivial repetition.
The recursion can easily be eliminated by converting the parts of the pattern following the recursion into a repeatable suffix.
# (6.7.6)
direct-abstract-declarator =
LPAREN abstract-declarator RPAREN
| direct-abstract-declarator? LBRACKET assign-expr? RBRACKET
| direct-abstract-declarator? LBRACKET STAR RBRACKET
| direct-abstract-declarator? LPAREN param-type-list? RPAREN
# (6.7.6)
direct-abstract-declarator =
direct-abstract-declarator-head?
direct-abstract-declarator-tail*
direct-abstract-declarator-head =
LPAREN abstract-declarator RPAREN
direct-abstract-declarator-tail =
LBRACKET assign-expr? RBRACKET
| LBRACKET STAR RBRACKET
| LPAREN param-type-list? RPAREN
A parser that accepts empty input will always succeed. Consider the following example, not atypical of a first attempt to write a PEG-based parser:
Program = Expression*
Expression = "whatever"
%%
int main() {
while (yyparse())
puts("success!");
return 0;
}
This program loops forever, no matter what (if any) input is provided on stdin. Many fixes are possible, the easiest being to insist that the parser always consumes some non-empty input. Changing the first line to
Program = Expression+
accomplishes this. If the parser is expected to consume the entire input, then explicitly requiring the end-of-file is also highly recommended:
Program = Expression+ !.
This works because the parser will only fail to match ("!" predicate) any character at all ("." expression) when it attempts to read beyond the end of the input.
You have to type 'man peg' to read the manual page for leg(1).
The 'yy' and 'YY' prefixes cannot be changed.
Left recursion is detected in the input grammar but is not handled correctly in the generated parser.
Diagnostics for errors in the input grammar are obscure and not particularly helpful.
The operators ! and ~ should really be named the other way around.
Several commonly-used lex(1) features (yywrap(), yyin, etc.) are completely absent.
The generated parser does not contain '#line' directives to direct C compiler errors back to the grammar description when appropriate.
D. Val Schorre, META II, a syntax-oriented compiler writing language, 19th ACM National Conference, 1964, pp. 41.301--41.311. Describes a self-implementing parser generator for analytic grammars with no backtracking.
Alexander Birman, The TMG Recognition Schema, Ph.D. dissertation, Princeton, 1970. A mathematical treatment of the power and complexity of recursive-descent parsing with backtracking.
Bryan Ford, Parsing Expression Grammars: A Recognition-Based Syntactic Foundation, ACM SIGPLAN Symposium on Principles of Programming Languages, 2004. Defines PEGs and analyses them in relation to context-free and regular grammars. Introduces the syntax adopted in peg.
The standard Unix utilities lex(1) and yacc(1) which influenced the syntax and features of leg.
The source code for peg and leg whose grammar parsers are written using themselves.
The latest version of this software and documentation:
http://piumarta.com/software/peg
peg, leg and this manual page were written by Ian Piumarta (first-name at last-name dot com) while investigating the viability of regular and parsing-expression grammars for efficiently extracting type and signature information from C header files.
Please send bug reports and suggestions for improvements to the author at the above address.
September 2013 | Version 0.1 |