fst-compiler(1) | fst-compiler | fst-compiler(1) |
fst-compiler, fst-compiler-utf8 - Two compilers for SFST programs
fst-compiler grammar-file [ output-file ]
fst-compiler-utf8 grammar-file [ output-file ]
fst-compiler is a compiler for finite-state transducer programs. It generates a minimized finite state transducer which can be used with fst-mor, fst-infl, fst-print, fst-compare, fst-parse, and fst-lattice. The compact transducer representation which is generated with the -c flag, is supported by fst-infl2, fst-train, and fst-match. The memory-efficient transducer representation which is generated with the -l flag, is only supported by fst-infl3.
The first program argument is the name of a file which contains the transducer program. The programming language is described below. The second argument is the name of the file to which the resulting transducer will be written in binary form. If a second argument is missing, the output will be written to stdout.
fst-compiler-utf8 differs from fst-compiler only in the character encoding. fst-compiler-utf8 supports UTF8 encoding of the source files whereas fst-compiler is to be used for 8-Bit character codes like latin1 which are an extension of the ASCII code. Information about the encoding is stored in the transducer files and used by the other SFST programs.
A transducer program consists of an (optional) sequence of alphabet and variable definitions followed by a single transducer expression which defines the result transducer.
Alphabet
An alphabet definition consists of the keyword ALPHABET followed by = and some transducer expression e.g.
This command redefines the alphabet as the set of symbol pairs occurring on the transitions of the transducer. Occurrences of two-level operators, negation operators and unquoted periods always have to be preceded by an alphabet definition.
Variables
There are two different types of variables. Symbol set variables are enclosed by hash signs (#) and take symbol sequences (see below) as values:
Transducer variables are enclosed by dollar signs and take transducer expressions as values:
Variables whose name starts with the symbol `=' are special agreement variables. If an agreement variable occurs more than once in a transducer expression, it will always have the same value. Consider the following transducer program:
The result transducer recognizes the strings aXa, bXb, and cXc. Only acyclic transducers (i.e. transducers with a finite set of string mappings) can be assigned to agreement variables.
Symbols
A symbol is either
- a single character like A s 5,
- a quoted character like \* or \_,
- the null symbol <>.
Symbol sequence
A symbol sequence is a sequence of characters, multi-character symbols and character ranges, e.g. a-z \. <x>.
symbol range
A symbol range is either
- a single symbol
- a symbol sequence enclosed in square brackets like [A-Za-z] or
- a symbol sequence starting with ^ and enclosed in square brackets like [^A-Za-z] (designating the complement of [a-zA-Z]) or
- the period (which represents any symbol from the alphabet)
Transducer expressions
A transducer expression (TE) is recursively defined as follows:
[a-z]:[a-Z]
Further Features
Comments start with the symbol % and extend up to the end of the line. Blanks are ignored unless they are quoted. Expressions terminate at the end of a line unless the end of line is preceded by a backslash. The command
can be used to insert source code from a file named fname. The command
stores the regular expression RE in the file fname. The command
tells the compiler to use the Hopcroft minimisation algorithm from now on, and
switches back to the default minimisation algorithm (Brzozowski). The command
Here is an example of a simple transducer program. Assuming that the file "adj-stems" contains the two lines
easy late big
this transducer will correctly analyze the adjective forms easy, easier, easiest and late, later, and latest.
ALPHABET = [a-zA-Z] y:i e:<> <ADJ>:<>
$R$ = y<=>i (<ADJ>:<> e)
$R2$ = e<=><> (<ADJ>:<> e)
$R$ = $R$ & $R2$
$Stems$ = "adj-stems"
$S$ = $Stems$ <ADJ> (<pos>:<>|<cmp>:{er}|<sup>:{est})
$S$ || $R$
fst-compiler returns 0 unless some error occurs.
The compiler gets the operator precedence wrong in case of two-level rules and interprets the expression "ab c<=>d ef" as "a(b c<=>d (ef))". Therefore, you should always surround the left context of two-level rules with parenthesis: (ab) c<=>d (ef)
fst-mor, fst-infl, fst-infl2, fst-infl3, fst-print, fst-compact, fst-parse, fst-compare, fst-compact, fst-lowmem, fst-lattice, fst-train
Helmut Schmid, Institute for Computational Linguistics, University of Stuttgart, Email: schmid@ims.uni-stuttgart.de, This software is available under the GNU Public License.
December 2004 |