extract++(1) | General Commands Manual | extract++(1) |
extract++ - SWISH++ text extractor
extract++ [ options ] directory... file...
extract++ is the SWISH++ text extractor, a utility to extract what text there is from a (mostly) binary file (similar to the strings(1) command) prior to indexing. Original files are untouched.
Text is extracted from the specified files and files in the specified directories; text from files in subdirectories of specified directories is also extracted by default (unless the -r, --no-recurse, -f, or --filter option or the RecurseSubdirs or ExtractFilter variable is given).
Ordinarily, text is extracted from files either only if their filename matches one of the patterns in the set specified with either the -e or --pattern option or the IncludeFile variable (unless standard input is used; see next paragraph) or is not among the set specified with either the -E or --no-pattern option or the ExcludeFile variable.
If there is a single filename of `-', the list of directories and files to extract is instead taken from standard input (one per line). In this case, filename patterns of files to extract need not be specified explicitly: all files, regardless of whether they match a pattern (unless they are among the set not to extract specified with either the -E or --no-pattern option or the ExcludeFile variable), are extracted, i.e., extract++ assumes you know what you're doing when specifying filenames in this manner.
Ordinarily, the text extracted from a file is written to another file in the same directory having the same filename but with the ``.txt'' extension appended by default, e.g., ``foo.doc'' becomes ``foo.doc.txt'' after extraction. (See also the -x or --extension option or the ExtractExtension variable.) However, extraction is not performed if the extracted text file exists.
If either the -f or --filter option or the ExtractFilter variable is given, then only a single file specified on the command line is extracted to standard output. In this case, filename patterns are not used and the existence of an extracted text file is irrelevant.
Via the FilterFile configuration file variable, files having particular patterns can be filtered prior to extraction. (See the examples in swish++.conf(5).)
extract++ performs the same character mapping, character entity conversions, and word determination heuristics used by index++(1) but also additionally:
extract++ was developed to be able to index non-text files in proprietary formats such as Microsoft Office documents. There are a couple of reasons why the functionality of extract++ isn't simply built into index++(1):
Options begin with either a `-' for short options or a ``--'' for long options. Either a `-' or ``--'' by itself explicitly ends the options; however, the difference is that `-' is returned as the first non-option whereas ``--'' is skipped entirely. Long option names may be abbreviated so long as the abbreviation is unambiguous.
For a short option that takes an argument, the argument is either taken to be the remaining characters of the same option, if any, or, if not, is taken from the next option unless said option begins with a `-'.
Short options that take no arguments can be grouped (but the last option in the group can take an argument), e.g., -lrv4 is equivalent to -l -r -v4.
For a long option that takes an argument, the argument is either taken to be the characters after a `=', if any, or, if not, is taken from the next option unless said option begins with a `-'.
The following variables can be set in a configuration file. Variables and command-line options can be mixed.
To extract text from all Microsoft Office files on a web server:
cd /home/www/htdocs extract++ -v3 -e '*.doc' -e '*.ppt' -e '*.xls' .
(See the examples in swish++.conf(5).)
Exits with one of the values given below:
index++(1), search++(1), strings(1), swish++.conf(5), glob(7)
Adobe Systems Incorporated. PostScript Language Reference Manual, 2nd ed. Addison-Wesley, Reading, MA. pp. 346-359.
International Standards Organization. ``ISO/IEC 9945-2: Information Technology -- Portable Operating System Interface (POSIX) -- Part 2: Shell and Utilities,'' 1993.
Paul J. Lucas <pauljlucas@mac.com>
November 1, 2002 | SWISH++ |