index++(1) | General Commands Manual | index++(1) |
index++ - SWISH++ indexer
index++ [ options ] directory... file...
index++ is the SWISH++ file indexer. It indexes the specified files and files in the specified directories; files in subdirectories of specified directories are also indexed by default (unless either the -r or --no-recurse option or the RecurseSubdirs variable is given). Files are indexed either only if their filename matches one of the patterns in the set specified with either the -e or --pattern option or the IncludeFile variable (unless standard input is used; see next paragraph) or is not in the set specified with either the -E or --no-pattern option or the ExcludeFile variable.
If there is a single filename of `-', the list of directories and files to index is instead taken from standard input (one per line). In this case, filename patterns of files to index need not be specified explicitly: all files, regardless of whether they match a pattern (unless they are in the set not to index specified with either the -E or --no-pattern option or the ExcludeFile variable), are indexed, i.e., index++ assumes you know what you're doing when specifying filenames in this manner.
In any case, care must be taken not to specify files or subdirectories in directories that are also specified: since directories are recursively indexed by default (unless either the -r or --no-recurse option or the RecurseSubdirs variable is given), explicitly specifying a subdirectory or file in a directory that is also specified will result in those files being indexed more than once.
Characters in the ISO 8859-1 (Latin 1) character set are mapped to their closest ASCII equivalent before further examination and indexing. (Individual indexing modules may also do their own character mapping.)
Stop words, words that occur too frequently or have no information content, are not indexed. (There is a default built-in set of a few hundred such English words.) Additionally, several heuristics are used to determine which words should not be indexed.
First, a word is checked to see if it looks like an acronym. A word is considered an acronym only if it starts with a capital letter and is composed exclusively of capital letters, digits, and punctuation symbols, e.g., ``AT&T.'' If a word looks like an acronym, it is indexed and no further checks are done.
Second, there are several other checks that are applied. A word is not indexed if it:
Via the FilterFile configuration file variable, files matching particular patterns can be filtered prior to indexing. Via the FilterAttachment configuration file variable, e-mail attachments whose MIME types match particular patterns can be filtered prior to indexing. (See FILTERS in swish++.conf(5).)
In order to add words from new documents to an existing index++, either the entire set of documents can be reindexed or the new documents alone can be incrementally indexed. In many cases, reindexing everything is sufficient since index++ is really fast. For a very large document set, however, this may use too many resources.
However, there is a pitfall for incremental indexing: if any of the -f, --word-files, -p, or --word-percent options or WordFilesMax or WordPercentMax variables are used, then words that are too frequent are discarded. If new documents are added containing very few of those words, then they could no longer be too frequent. However, there is no way to get them back since they were discarded.
The way around this problem is not to discard any words by specifying 101%. However, because no words are discarded, the size of the index file will be larger, perhaps significantly so.
It is possible that, in practice, the loss of words may not be that important especially if new documents are very similar to old documents and that words that were too frequent in the old set would also be too frequent in new set.
Another way around this problem is to do periodic full indexing.
index++ is written in a modular fashion where different types of files have different indexing modules. Currently, there are 7 modules: Text (plain text), HTML (HTML and XHTML), ID3 (ID3 tags found in MP3 files), LaTeX, Mail (RFC 822 and Usenet News), Manual (Unix manual pages in nroff(1) with man(7) macros), and RTF (Rich Text Format).
This module simply indexes plain text files performing character mapping and word determination as has already been described.
Additional processing is done for HTML and XHTML files. The additional processing is:
In compliance with the HTML specification, any one of no quotes, single quotes, or double quotes may be used to contain attribute values and attributes can appear in any order. Values containing whitespace, however, must be quoted. The specification is vague as to whether whitespace surrounding the = is legal, but index++ allows it.
ID3 tags are used to store audio meta information for MP3 files (generally). Since audio files contain mostly binary information, only the ID3 tag text fields are indexed. ID3 tag versions 1.x and 2.x (through 2.4) are supported (except for encrypted frames). If a file contains both 1.x and 2.x tags, only the 2.x tag is indexed. The processing done for files containing an ID3 tag is:
Additional processing is done for LaTeX files. If a \title command is found within the first TitleLines lines of the file (default is 12), then the value of the title is stored in the generated index file as the file's title rather than the file's name. (Every non-space whitespace character in the title is converted to a space; leading and trailing spaces are removed.)
Additional processing is done for mail and news files. The additional processing is:
Indexing mail and news files is most effective only when there is exactly one message per file. While Usenet news files are usually this way, mail files are not. Mail files, e.g., mailboxes, are usually comprised of multiple messages. Such files would need to be split up into files of individual messages prior to indexing since there's no point in indexing a single mailbox: every search result would return a rank of 100 for the same file. Therefore, the splitmail++(1) utility is included in the SWISH++ distribution.
Additional processing is done for Unix manual page files. The additional processing is:
This module simply indexes rich text format files without all formatting commands.
Options begin with either a `-' for short options or a ``--'' for long options. Either a `-' or ``--'' by itself explicitly ends the options; either short or long options may be used. Long option names may be abbreviated so long as the abbreviation is unambiguous.
For a short option that takes an argument, the argument is either taken to be the remaining characters of the same option, if any, or, if not, is taken from the next option unless said option begins with a `-'.
Short options that take no arguments can be grouped (but the last option in the group can take an argument), e.g., -lrv4 is equivalent to -l -r -v4.
For a long option that takes an argument, the argument is either taken to be the characters after a `=', if any, or, if not, is taken from the next option unless said option begins with a `-'.
The following variables can be set in a configuration file. Variables and command-line options can be mixed, the latter taking priority.
All these example assume you change your working directory to your web server's document root prior to indexing.
To index all HTML and text files on a web server:
index++ -v3 -e 'html:*.*htm*' -e 'text:*.txt' .
To index all files not under directories named CVS:
find . -name CVS -prune -o -type f -a -print | index++ -e 'html:*.*htm*' -
When using the Windows command interpreter, single quotes around filename patterns don't work; you must use double quotes:
index++ -v3 -e "html:*.*htm*" -e "text:*.txt" .
This is a problem with Windows, not SWISH++. (Double quotes will also work under Unix.)
In an HTML or XHTML document, there may be sections that should not be indexed. For example, if every page of a web site contains a navigation menu such as:
<SELECT NAME="menu">
<OPTION>Home
<OPTION>Automotive
<OPTION>Clothing
<OPTION>Hardware </SELECT>
or a common header and footer, then, ordinarily, those words would be indexed for every page and therefore be discarded because they would be too frequent. However, via either the -C or --no-class option or the ExcludeClass variable, one or more class names can be specified and then HTML or XHTML elements belonging to one of those classes will not have the text up to the tag that ends them indexed. Given a class name of, say, no_index, the above menu can be changed to:
<SELECT NAME="menu" CLASS="no_index">
and then everything up to the </SELECT> tag will not be indexed.
For an HTML element that has an optional end tag (such as the <P> element), the text up to the tag that ends it will not be indexed, which is either the element's own end tag or a tag of some other element that implicitly ends it. For example, in:
<P CLASS="no_index"> This was the poem that Alice read: <BLOCKQUOTE>
<B>Jabberwocky</B><BR>
`Twas brillig, and the slithy toves<BR>
Did gyre and gimble in the wabe;<BR>
All mimsy were the borogoves,<BR>
And the mome raths outgrabe. </BLOCKQUOTE>
the <BLOCKQUOTE> tag implicitly ends the <P> element (as do all block-level elements) so the only text that is not indexed above is: ``This was the poem that Alice read.''
For an HTML or XHTML element that does not have an end tag, only the text within the start tag will not be indexed. For example, in:
<IMG SRC="home.gif" ALT="Home" CLASS="no_index">
the word ``Home'' will not be indexed even though it ordinarily would have been if the CLASS attribute were not there.
(See Filters under EXAMPLES in swish++.conf(5).)
Exits with one of the values given below:
extract++(1), find(1), nroff(1), search++(1), splitmail++(1), swish++.conf(5), glob(7), man(7).
Tim Berners-Lee. ``The text/enriched MIME Content-type,'' Request for Comments 1563, Network Working Group of the Internet Engineering Task Force, January 1994.
David H. Crocker. ``Standard for the Format of ARPA Internet Text Messages,'' Request for Comments 822, Department of Electrical Engineering, University of Delaware, August 1982.
Frank Dawson and Tim Howes. ``vCard MIME Directory Profile,'' Request for Comments 2426, Network Working Group of the Internet Engineering Task Force, September 1998.
Ned Freed and Nathaniel S. Borenstein. ``Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies,'' Request for Comments 2045, RFC 822 Extensions Working Group of the Internet Engineering Task Force, November 1996.
David Goldsmith and Mark Davis. ``UTF-7, a mail-safe transformation format of Unicode,'' Request for Comments 2152, Network Working Group of the Internet Engineering Task Force, May 1997.
International Standards Organization. ISO 8859-1: Information Processing -- 8-bit single-byte coded graphic character sets -- Part 1: Latin alphabet No. 1, 1987.
--. ISO 8879: Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML), 1986.
--. ISO/IEC 9945-2: Information Technology -- Portable Operating System Interface (POSIX) -- Part 2: Shell and Utilities, 1993.
Leslie Lamport. LaTeX: A Document Preparation System, 2nd ed., Addison-Wesley, Reading, MA, 1994.
Martin Nilsson. ID3 tag version 2, March 1998.
--. ID3 tag version 2.3.0, February 1999.
--. ID3 tag version 2.4.0 - Main Structure, November 2002.
--. ID3 tag version 2.4.0 - Native Frames, November 2002.
Steven Pemberton, et al. XHTML 1.0: The Extensible HyperText Markup Language, World Wide Web Consortium, January 2000.
Dave Raggett, Arnaud Le Hors, and Ian Jacobs. ``On SGML and HTML: SGML constructs used in HTML: Entities,'' HTML 4.0 Specification, §3.2.3, World Wide Web Consortium, April 1998.
--. ``The global structure of an HTML document: The document head: The title attribute,'' HTML 4.0 Specification, §7.4.3, World Wide Web Consortium, April 1998.
--. ``The global structure of an HTML document: The document head: Meta data,'' HTML 4.0 Specification, §7.4.4, World Wide Web Consortium, April 1998.
--. ``The global structure of an HTML document: The document body: Element identifiers: the id and class attributes,'' HTML 4.0 Specification, §7.5.2, World Wide Web Consortium, April 1998.
--. ``Tables: Elements for constructing tables: The TABLE element,'' HTML 4.0 Specification, §11.2.1, World Wide Web Consortium, April 1998.
--. ``Objects, Images, and Applets: Generic inclusion: the OBJECT element,'' HTML 4.0 Specification, §13.3, World Wide Web Consortium, April 1998.
--. ``Objects, Images, and Applets: How to specify alternate text,'' HTML 4.0 Specification, §13.8, World Wide Web Consortium, April 1998.
--. ``Index of Elements,'' HTML 4.0 Specification, World Wide Web Consortium, April 1998.
Marcin Sawicki, et al. Ruby Annotation, World Wide Web Consortium, April 2001.
The Unicode Consortium. ``Encoding Forms,'' The Unicode Standard 3.0, §2.3, Addison-Wesley, 2000.
Francois Yergeau. ``UTF-8, a transformation format of ISO 10646,'' Request for Comments 2279, Network Working Group of the Internet Engineering Task Force, January 1998.
Paul J. Lucas <pauljlucas@mac.com>
March 25, 2004 | SWISH++ |