txt(3) | Library Functions Manual | txt(3) |
txt - standard text processing module
The Standard Text Processing module is an original implementation of an object collection dedicated to text processing. Although text scaning is the current operation perfomed in the field of text processing, the module provides also specialized object to store and index text data. Text sorting and transliteration is also part of this module.
Scanning concepts
Text scanning is the ability to extract lexical elements or lexemes from a
stream. A scanner or lexical analyzer is the principal object used to
perform this task. A scanner is created by adding special object that acts
as a pattern matcher. When a pattern is matched, a special object called a
lexeme is returned.
Pattern object
A Pattern object is a special object that acts as model for the string to
match. There are several ways to build a pattern. The simplest way to build
it is with a regular expression. Another type of pattern is a balanced
pattern. In its first form, a pattern object can be created with a regular
expression object.
# create a pattern object const pat (afnix:txt:Pattern "$d+")
In this example, the pattern object is built to detect integer objects.
pat:check "123" # true pat:match "123" # 123
The check method return true if the input string matches the pattern. The match method returns the string that matches the pattern. Since the pattern object can also operates with stream object, the match method is appropriate to match a particular string. The pattern object is, as usual, available with the appropriate predicate.
afnix:txt:pattern-p pat # true
Another form of pattern object is the balanced pattern. A balanced pattern is determined by a starting string and an ending string. There are two types of balanced pattern. One is a single balanced pattern and the other one is the recursive balanced pattern. The single balanced pattern is appropriate for those lexical element that are defined by a character. For example, the classical C-string is a single balanced pattern with the double quote character.
# create a balanced pattern const pat (afnix:txt:Pattern "ELEMENT" "<" ">") pat:check "<xml>" # true pat:match "<xml>" # xml
In the case of the C-string, the pattern might be more appropriately defined with an additional escape character. Such character is used by the pattern matcher to grab characters that might be part of the pattern definition.
# create a balanced pattern const pat (afnix:txt:Pattern "STRING" "'" '\') pat:check "'hello'" # true pat:match "'hello'" # "hello"
In this form, a balanced pattern with an escape character is created. The same string is used for both the starting and ending string. Another constructor that takes two strings can be used if the starting and ending strings are different. The last pattern form is the balanced recursive form. In this form, a starting and ending string are used to delimit the pattern. However, in this mode, a recursive use of the starting and ending strings is allowed. In order to have an exact match, the number of starting string must equal the number of ending string. For example, the C-comment pattern can be viewed as recursive balanced pattern.
# create a c-comment pattern const pat (afnix:txt:Pattern "STRING" "/*" "*/" )
Lexeme object
The Lexeme object is the object built by a scanner that contains the matched
string. A lexeme is therefore a tagged string. Additionally, a lexeme can
carry additional information like a source name and index.
# create an empty lexeme const lexm (afnix:txt:Lexeme) afnix:txt:lexeme-p lexm # true
The default lexeme is created with any value. A value can be set with the set-value method and retrieved with the get-value methods.
lexm:set-value "hello" lexm:get-value # hello
Similar are the set-tag and get-tag methods which operate with an integer. The source name and index are defined as well with the same methods.
# check for the source lexm:set-source "world" lexm:get-source # world # check for the source index lexm:set-index 2000 lexm:get-index # 2000
Text scanning
Text scanning is the ability to extract lexical elements or lexemes from an
input stream. Generally, the lexemes are the results of a matching operation
which is defined by a pattern object. As a result, the definition of a
scanner object is the object itself plus one or several pattern object.
Scanner construction
By default, a scanner is created without pattern objects. The length method
returns the number of pattern objects. As usual, a predicate is associated
with the scanner object.
# the default scanner const scan (afnix:txt:Scanner) afnix:txt:scanner-p scan # true # the length method scan:length # 0
The scanner construction proceeds by adding pattern objects. Each pattern can be created independently, and later added to the scanner. For example, a scanner that reads real, integer and string can be defined as follow:
# create the scanner pattern const REAL (
afnix:txt:Pattern "REAL" [$d+.$d*]) const STRING (
afnix:txt:Pattern "STRING" """ '\') const INTEGER (
afnix:txt:Pattern "INTEGER" [$d+|"0x"$x+]) # add the pattern to the scanner scanner:add INTEGER REAL STRING
The order of pattern integration defines the priority at which a token is recognized. The symbol name for each pattern is optional since the functional programming permits the creation of patterns directly. This writing style makes the scanner definition easier to read.
Using the scanner
Once constructed, the scanner can be used as is. A stream is generally the
best way to operate. If the scanner reaches the end-of-stream or cannot
recognize a lexeme, the nil object is returned. With a loop, it is easy to
get all lexemes.
while (trans valid (is:valid-p)) {
# try to get the lexeme
trans lexm (scanner:scan is)
# check for nil lexeme and print the value
if (not (nil-p lexm)) (println (lexm:get-value))
# update the valid flag
valid:= (and (is:valid-p) (not (nil-p lexm))) }
In this loop, it is necessary first to check for the end of the stream. This is done with the help of the special loop construct that initialize the valid symbol. As soon as the the lexeme is built, it can be used. The lexeme holds the value as well as it tag.
Text sorting
Sorting is one the primary function implemented inside the text processing
module. There are three sorting functions available in the module.
Ascending and descending order sorting
The sort-ascent function operates with a vector object and sorts the elements
in ascending order. Any kind of objects can be sorted as long as they
support a comparison method. The elements are sorted in placed by using a
quick sort algorithm.
# create an unsorted vector const v-i (Vector 7 5 3 4 1 8 0 9 2 6) # sort the vector in place afnix:txt:sort-ascent v-i # print the vector for (e) (v) (println e)
The sort-descent function is similar to the sort-ascent function except that the object are sorted in descending order.
Lexical sorting
The sort-lexical function operates with a vector object and sorts the elements
in ascending order using a lexicographic ordering relation. Objects in the
vector must be literal objects or an exception is raised.
Transliteration
Transliteration is the process of changing characters my mapping one to
another one. The transliteration process operates with a character source
and produces a target character with the help of a mapping table. The
transliteration process is not necessarily reversible as often indicated in
the literature.
Literate object
The Literate object is a transliteration object that is bound by default with
the identity function mapping. As usual, a predicate is associate with the
object.
# create a transliterate object const tl (afnix:txt:Literate) # check the object afnix:txt:literate-p tl # true
The transliteration process can also operate with an escape character in order to map double character sequence into a single one, as usually found inside programming language.
# create a transliterate object by escape const tl (afnix:txt:Literate '\')
Transliteration configuration
The set-map configures the transliteration mapping table while the
set-escape-map configure the escape mapping table. The mapping is done by
setting the source character and the target character. For instance, if one
want to map the tabulation character to a white space, the mapping table is
set as follow:
tl:set-map '' ' '
The escape mapping table operates the same way. It should be noted that the mapping algorithm translate first the input character, eventually yielding to an escape character and then the escape mapping takes place. Note also that the set-escape method can be used to set the escape character.
tl:set-map '' ' '
Transliteration process
The transliteration process is done either with a string or an input stream.
In the first case, the translate method operates with a string and returns a
translated string. On the other hand, the read method returns a character
when operating with a stream.
# set the mapping characters tl:set-map 'w' tl:set-map '\' 'o' tl:set-map 'r' tl:set-map ' d' # translate a string tl:translate "helo" # word
Pattern
The Pattern class is a pattern matching class based either on regular
expression or balanced string. In the regex mode, the pattern is defined
with a regex and a matching is said to occur when a regex match is achieved.
In the balanced string mode, the pattern is defined with a start pattern and
end pattern strings. The balanced mode can be a single or recursive.
Additionally, an escape character can be associated with the class. A name
and a tag is also bound to the pattern object as a mean to ease the
integration within a scanner.
Predicate
Inheritance
Constructors
Constants
Methods
Lexeme
The Lexeme class is a literal object that is designed to hold a matching
pattern. A lexeme consists in string (i.e. the lexeme value), a tag and
eventually a source name (i.e. file name) and a source index (line
number).
Predicate
Inheritance
Constructors
Methods
Scanner
The Scanner class is a text scanner or lexical analyzer that operates on an
input stream and permits to match one or several patterns. The scanner is
built by adding patterns to the scanner object. With an input stream, the
scanner object attempts to build a buffer that match at least one pattern.
When such matching occurs, a lexeme is built. When building a lexeme, the
pattern tag is used to mark the lexeme.
Predicate
Inheritance
Constructors
Methods
Literate
The Literate class is transliteration mapping class. Transliteration is the
process of changing characters my mapping one to another one. The
transliteration process operates with a character source and produces a
target character with the help of a mapping table. This transliteration
object can also operate with an escape table. In the presence of an escape
character, an escape mapping table is used instead of the regular one.
Predicate
Inheritance
Constructors
Methods
Functions
AFNIX | AFNIX Module |