agrep - search a file for a string or regular expression, with
approximate matching capabilities
agrep [ -#cdehiklnpstvwxBDGIS ] pattern [ -f
patternfile ] [ filename... ]
agrep searches the input filenames (standard input
is the default, but see a warning under LIMITATIONS) for records containing
strings which either exactly or approximately match a pattern.
A record is by default a line, but it can be defined differently using the
-d option (see below). Normally, each record found is copied to the standard
output. Approximate matching allows finding records that contain the pattern
with several errors including substitutions, insertions, and deletions. For
example, Massechusets matches Massachusetts with two errors (one
substitution and one insertion). Running agrep -2 Massechusets foo
outputs all lines in foo containing any string with at most 2 errors from
Massechusets.
agrep supports many kinds of queries including arbitrary
wild cards, sets of patterns, and in general, regular expressions. See
PATTERNS below. It supports most of the options supported by the grep
family plus several more (but it is not 100% compatible with grep). For more
information on the algorithms used by agrep see Wu and Manber, "Fast
Text Searching With Errors," Technical report #91-11, Department of
Computer Science, University of Arizona, June 1991 (available by anonymous
ftp from cs.arizona.edu in agrep/agrep.ps.1), and Wu and Manber, "Agrep
-- A Fast Approximate Pattern Searching Tool", To appear in USENIX
Conference 1992 January (available by anonymous ftp from cs.arizona.edu in
agrep/agrep.ps.2).
As with the rest of the grep family, the characters
`$', `^', `∗', `[',
`]', `^', `|', `(', `)',
`!', and `\' can cause unexpected results when included in the
pattern, as these characters are also meaningful to the shell. To
avoid these problems, one should always enclose the entire pattern argument
in single quotes, i.e., 'pattern'. Do not use double quotes (").
When agrep is applied to more than one input file, the name
of the file is displayed preceding each line which matches the pattern. The
filename is not displayed when processing a single file, so if you actually
want the filename to appear, use /dev/null as a second file in the
list.
- -#
- # is a non-negative integer (at most 8) specifying the maximum
number of errors permitted in finding the approximate matches (defaults to
zero). Generally, each insertion, deletion, or substitution counts as one
error. It is possible to adjust the relative cost of insertions, deletions
and substitutions (see -I -D and -S options).
- -c
- Display only the count of matching records.
- -d 'delim'
- Define delim to be the separator between two records. The default
value is '$', namely a record is by default a line. delim can be a
string of size at most 8 (with possible use of ^ and $), but not a regular
expression. Text between two delim's, before the first
delim, and after the last delim is considered as one record.
For example, -d '$$' defines paragraphs as records and -d '^From '
defines mail messages as records. agrep matches each record
separately. This option does not currently work with regular
expressions.
- -e pattern
- Same as a simple pattern argument, but useful when the
pattern begins with a `-'.
- -f
patternfile
- patternfile contains a set of (simple) patterns. The output is all
lines that match at least one of the patterns in patternfile.
Currently, the -f option works only for exact match and for simple
patterns (any meta symbol is interpreted as a regular character); it is
compatible only with -c, -h, -i, -l, -s, -v, -w, and -x options. see
LIMITATIONS for size bounds.
- -h
- Do not display filenames.
- -i
- Case-insensitive search — e.g., "A" and "a" are
considered equivalent.
- -k
- No symbol in the pattern is treated as a meta character. For example,
agrep -k 'a(b|c)*d' foo will find the occurrences of a(b|c)*d in foo
whereas agrep 'a(b|c)*d' foo will find substrings in foo that match the
regular expression 'a(b|c)*d'.
- -l
- List only the files that contain a match. This option is useful for
looking for files containing a certain pattern. For example, " agrep
-l 'wonderful' * " will list the names of those files in current
directory that contain the word 'wonderful'.
- -n
- Each line that is printed is prefixed by its record number in the
file.
- -p
- Find records in the text that contain a supersequence of the pattern. For
example, agrep -p DCS foo will match "Department of
Computer Science."
- -s
- Work silently, that is, display nothing except error messages. This is
useful for checking the error status.
- -t
- Output the record starting from the end of delim to (and including)
the next delim. This is useful for cases where delim should
come at the end of the record.
- -v
- Inverse mode — display only those records that do not
contain the pattern.
- -w
- Search for the pattern as a word — i.e., surrounded by
non-alphanumeric characters. The non-alphanumeric must surround the
match; they cannot be counted as errors. For example, agrep -w -1
car will match cars, but not characters.
- -x
- The pattern must match the whole line.
- -y
- Used with -B option. When -y is on, agrep will always output the best
matches without giving a prompt.
- -B
- Best match mode. When -B is specified and no exact matches are found,
agrep will continue to search until the closest matches (i.e., the ones
with minimum number of errors) are found, at which point the following
message will be shown: "the best match contains x errors, there are y
matches, output them? (y/n)" The best match mode is not supported for
standard input, e.g., pipeline input. When the -#, -c, or -l options are
specified, the -B option is ignored. In general, -B may be slower than -#,
but not by very much.
- -Dk
- Set the cost of a deletion to k (k is a positive integer).
This option does not currently work with regular expressions.
- -G
- Output the files that contain a match.
- -Ik
- Set the cost of an insertion to k (k is a positive integer).
This option does not currently work with regular expressions.
- -Sk
- Set the cost of a substitution to k (k is a positive
integer). This option does not currently work with regular
expressions.
agrep supports a large variety of patterns, including
simple strings, strings with classes of characters, sets of strings, wild
cards, and regular expressions.
- Strings
- any sequence of characters, including the special symbols `^' for
beginning of line and `$' for end of line. The special characters listed
above ( `$', `^', `∗', `[',
`^', `|', `(', `)', `!', and `\'
) should be preceded by `\' if they are to be matched as regular
characters. For example, \^abc\\ corresponds to the string ^abc\, whereas
^abc corresponds to the string abc at the beginning of a line.
- Classes of
characters
- a list of characters inside [] (in order) corresponds to any character
from the list. For example, [a-ho-z] is any character between a and h or
between o and z. The symbol `^' inside [] complements the list. For
example, [^i-n] denote any character in the character set except character
'i' to 'n'. The symbol `^' thus has two meanings, but this is consistent
with egrep. The symbol `.' (don't care) stands for any symbol (except for
the newline symbol).
- Boolean
operations
- agrep supports an `and' operation `;' and an `or' operation `,',
but not a combination of both. For example, 'fast;network' searches for
all records containing both words.
- Wild cards
- The symbol '#' is used to denote a wild card. # matches zero or any number
of arbitrary characters. For example, ex#e matches example. The symbol #
is equivalent to .* in egrep. In fact, .* will work too, because it is a
valid regular expression (see below), but unless this is part of an actual
regular expression, # will work faster.
- Combination
of exact and approximate matching
- any pattern inside angle brackets <> must match the text exactly
even if the match is with errors. For example, <mathemat>ics matches
mathematical with one error (replacing the last s with an a), but
mathe<matics> does not match mathematical no matter how many errors
we allow.
- Regular
expressions
- The syntax of regular expressions in agrep is in general the same
as that for egrep. The union operation `|', Kleene closure `*', and
parentheses () are all supported. Currently '+' is not supported. Regular
expressions are currently limited to approximately 30 characters
(generally excluding meta characters). Some options (-d, -w, -f, -t, -x,
-D, -I, -S) do not currently work with regular expressions. The maximal
number of errors for regular expressions that use '*' or '|' is 4.
- agrep -2 -c ABCDEFG
foo
- gives the number of lines in file foo that contain ABCDEFG within two
errors.
- agrep -1 -D2 -S2
'ABCD#YZ' foo
- outputs the lines containing ABCD followed, within arbitrary distance, by
YZ, with up to one additional insertion (-D2 and -S2 make deletions and
substitutions too "expensive").
- agrep -5 -p abcdefghij
/path/to/dictionary/words
- outputs the list of all words containing at least 5 of the first 10
letters of the alphabet in order. (Try it: any list starting with
academia and ending with sacrilegious must mean something!)
- agrep -1
'abc[0-9](de|fg)*[x-z]' foo
- outputs the lines containing, within up to one error, the string that
starts with abc followed by one digit, followed by zero or more
repetitions of either de or fg, followed by either x, y, or z.
- agrep -d '^From '
'breakdown;internet' mbox
- outputs all mail messages (the pattern '^From ' separates mail
messages in a mail file) that contain keywords 'breakdown' and
'internet'.
- agrep -d '$$' -1
'<word1> <word2>' foo
- finds all paragraphs that contain word1 followed by word2 with one error
in place of the blank. In particular, if word1 is the last word in a line
and word2 is the first word in the next line, then the space will be
substituted by a newline symbol and it will match. Thus, this is a way to
overcome separation by a newline. Note that -d '$$' (or another delim
which spans more than one line) is necessary, because otherwise agrep
searches only one line at a time.
- agrep '^agrep' <this
manual>
- outputs all the examples of the use of agrep in this man pages.
Any bug reports or comments will be appreciated! Please mail them
to sw@cs.arizona.edu or udi@cs.arizona.edu
Regular expressions do not support the '+' operator (match 1 or
more instances of the preceding token). These can be searched for by using
this syntax in the pattern:
'pattern(pattern)*'
(search for strings containing one instance of the pattern,
followed by 0 or more instances of the pattern).
The following can cause an infinite loop: agrep pattern *
> output_file. If the number of matches is high, they may be deposited in
output_file before it is completely read leading to more matches of the
pattern within output_file (the matches are against the whole directory).
It's not clear whether this is a "bug" (grep will do the same),
but be warned.
The maximum size of the patternfile is limited to be 250Kb,
and the maximum number of patterns is limited to be 30,000.
Standard input is the default if no input file is given. However,
if standard input is keyed in directly (as opposed to through a pipe, for
example) agrep may not work for some non-simple patterns.
There is no size limit for simple patterns. More complicated
patterns are currently limited to approximately 30 characters. Lines are
limited to 1024 characters. Records are limited to 48K, and may be truncated
if they are larger than that. The limit of record length can be changed by
modifying the parameter Max_record in agrep.h.
Exit status is 0 if any matches are found, 1 if none, 2 for syntax
errors or inaccessible files.
Sun Wu and Udi Manber, Department of Computer Science, University
of Arizona, Tucson, AZ 85721. {sw|udi}@cs.arizona.edu.