search++(1) | General Commands Manual | search++(1) |
search++ - SWISH++ searcher
search++ [ options ] query
search++ is the SWISH++ searcher. It searches a previously generated index for the words specified in a query. In addition to running from the command-line, it can run as a daemon process functioning as a ``search++ server.''
The formal grammar of a query is:
In practice, however, the query is the set of words sought after, possibly restricted to meta data, and possibly combined with the operators ``and,'' ``or,'' ``near,'' ``not,'' and ``not near.'' The asterisk (*) can be used as a wildcard character at the end of words. Note that an asterisk and parentheses are shell meta-characters and as such must either be escaped (backslashed) or quoted when passed to a shell.
Although syntactically legal, it is a semantic error to have ``near'' just before ``not'' since such queries are nonsensical, e.g.:
mouse near not computer
Queries are evaluated in left-to-right order, i.e., ``and'' has the same precedence as ``or.'' For more about query syntax, see the EXAMPLES.
The same character mapping and word determination heuristics used by index++(1) are used on queries prior to searching.
The results are output either in ``classic'' or XML format. In either case, the components of the results are:
The ``classic'' results format is plain text as:
rank path-name file-size file-title
It can be parsed easily in Perl with:
($rank,$path,$size,$title) = split( / /, $_, 4 );
(The separator can be changed via the -R or --separator options or the ResultSeparator variable.)
Prior to results lines, comment lines may also appear containing additional information about the query results. Comment lines are in the format of:
# comment-key: comment-value
The keys and values are:
The XML results format is given by the DTD:
<!ELEMENT SearchResults (IgnoredList?, ResultCount, ResultList?)> <!ELEMENT IgnoredList (Ignored+)> <!ELEMENT Ignored (#PCDATA)> <!ELEMENT ResultCount (#PCDATA)> <!ELEMENT ResultList (File+)> <!ELEMENT File (Rank, Path, Size, Title)> <!ELEMENT Rank (#PCDATA)> <!ELEMENT Path (#PCDATA)> <!ELEMENT Size (#PCDATA)> <!ELEMENT Title (#PCDATA)>
and by the XML schema located at:
http://homepage.mac.com/pauljlucas/software/swish/SearchResults/SearchResults.xsd
For example:
<?xml version="1.0" encoding="us-ascii"?> <!DOCTYPE SearchResults SYSTEM
"http://homepage.mac.com/pauljlucas/software/swish/SearchResults.dtd"> <SearchResults
xmlns="http://homepage.mac.com/pauljlucas/software/swish/SearchResults"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://homepage.mac.com/pauljlucas/software/swish/SearchResults
SearchResults.xsd">
<IgnoredList>
<Ignored>stop-word</Ignored>
...
</IgnoredList>
<ResultCount>42</ResultCount>
<ResultList>
<File>
<Rank>rank</Rank>
<Path>path-name</Path>
<Size>file-size</Size>
<Title>file-title</Title>
</File>
...
</ResultList> </SearchResults>
search++ can alternatively run as a daemon process (via either the -b or --daemon-type options or the SearchDaemon variable) functioning as a ``search++ server'' by listening to a Unix domain socket (specified by either the -u or --socket-file options or the SocketFile variable), a TCP socket (specified by either the -a or --socket-address options or the SocketAddress variable), or both. Unix domain sockets are preferred for both performance and security. For search-intensive applications, such as a search engine on a heavily used web site, this can yield a large performance improvement since the start-up cost (fork(2), exec(2), and initialization) is paid only once.
If the process was started with root privileges, it will give them away immediately after initialization and before servicing any requests.
Search clients connect to a daemon via a socket and send a query in the same manner as on the command line (including the first word being ``search++''). The only exception is that shell meta-characters must not be escaped (backslashed) since no shell is involved. Search results are returned via the same socket. See the EXAMPLES.
A daemon can serve multiple query requests simultaneously since it is multi-threaded. When started, it ``pre-threads'' meaning that it creates a pool of threads in advance that service an indefinite number of requests as a further performance improvement since a thread is not created and destroyed per request.
There is an initial, minimum number of threads in the thread pool. The number of threads grows dynamically when there are more requests than threads, but not more than a specified maximum to prevent the server from thrashing. (See the -t, --min-threads, -T, and --max-threads options or the ThreadsMin or ThreadsMax variables.) If the number of threads reaches the maximum, subsequent requests are queued until existing threads become available to service them after completing in-progress requests. (See either the -q or --queue-size options or the SocketQueueSize variable.)
If there are more than the minimum number of threads and some remain idle longer than a specified timeout period (because the number of requests per unit time has dropped), then threads will die off until the pool returns to its original minimum size. (See either the -O or --thread-timeout options or the ThreadTimeout variable.)
A single daemon can search only a single index. To search++ multiple indices concurrently, multiple daemons can be run, each searching its own index and using its own socket. An index must not be modified or deleted while a daemon is using it.
Options begin with either a `-' for short options or a ``--'' for long options. Either a `-' or ``--'' by itself explicitly ends the options; however, the difference is that `-' is returned as the first non-option whereas ``--'' is skipped entirely. Either short or long options may be used. Long option names may be abbreviated so long as the abbreviation is unambiguous.
For a short option that takes an argument, the argument is either taken to be the remaining characters of the same option, if any, or, if not, is taken from the next option unless said option begins with a `-'.
Short options that take no arguments can be grouped (but the last option in the group can take an argument), e.g., -Bq511 is equivalent to -B -q 511.
For a long option that takes an argument, the argument is either taken to be the characters after a `=', if any, or, if not, is taken from the next option unless said option begins with a `-'.
The following variables can be set in a configuration file. Variables and command-line options can be mixed, the latter taking priority.
The query:
computer mouse
is the same as and short for:
computer and mouse
(because ``and'' is implicit) and would return only those documents that contain both words. The query:
cat or kitten or feline
would return only those documents regarding cats. The query:
mouse and computer or keyboard
is the same as:
(mouse and computer) or keyboard
(because queries are evaluated left-to-right) in that they will both return only those documents regarding either mice attached to a computer or any kind of keyboard. However, neither of those is the same as:
mouse and (computer or keyboard)
that would return only those documents regarding mice (including the rodents) and either a computer or a keyboard.
The query:
comput*
would return only those documents that contain words beginning with ``comput'' such as ``computation,'' ``computational,'' ``computer,'' ``computerize,'' ``computing,'' and others. Wildcarded words can be used anywhere ordinary words can be. The query:
comput* (medicine or doctor*)
would return only those documents that contain something about computer use in medicine or by doctors.
The query:
mouse or mice and not computer*
would return only those documents regarding mice (the rodents) and not the kind attached to a computer.
Using ``near'' is the same as using ``and'' except that it not only requires both words to be in the documents, but that they be near each other, i.e., it returns potentially fewer documents than the corresponding ``and'' query. The query:
computer near mouse
would return only those documents where both words are near each other. They query:
mouse near (computer or keyboard)
is the same as:
(mouse near computer) or (mouse near keyboard)
i.e., ``near'' gets distributed across parenthesized subqueries.
Using ``not near'' is the same as using ``and not'' except that it allows the right-hand side words to be in the documents, just not near the left-hand side words, i.e., it returns potentially more documents than the corresponding ``and not'' query. Of course the word(s) on the right-hand side need not be in the documents at all, i.e., they would be considered ``infinitely far'' apart. The query:
mouse or mice not near computer*
would return only those documents regarding mice (the rodents) more effectively than the query:
mouse or mice and not computer*
because the latter would exclude documents about mice (the rodents) where computers just so happened to be mentioned in the same documents.
The query:
author = hawking
would return only those documents whose author attribute contains ``hawking.'' The query:
author = hawking radiation
would return only those documents regarding radiation whose author attribute contains ``hawking.'' The query:
author = (stephen hawking)
would return only those documents whose author is Stephen Hawking. The query:
author = (stephen hawking) or (black near hole*)
would return only those documents whose author is Stephen Hawking or that contain the word ``black'' near ``hole'' or ``holes'' regardless of the author. Note that the second set of parentheses are necessary otherwise the query would have been the same as:
(author = (stephen hawking) or black) near hole*
that would have additionally required both ``stephen'' and ``hawking'' to be near ``hole'' or ``holes.''
To send a query request to a sarch daemon using Perl, first open the socket and connect to the daemon (see [Wall], pp. 439-440):
use Socket; $SocketFile = '/tmp/search.socket'; socket( SEARCH, PF_UNIX, SOCK_STREAM, 0 ) or die "can not open socket: $!\n"; connect( SEARCH, sockaddr_un( $SocketFile ) ) or die "can not connect to \"$SocketFile\": $!\n";
Autoflush must be set for the socket filehandle (see [Wall], p. 781), otherwise the server thread will hang since I/O buffering will wait for the buffer to fill that will never happen since queries are short:
select( (select( SEARCH ), $| = 1)[0] );
Next, send a query request (beginning with the word ``search++'' and any options just as with a command-line) to the daemon via the socket filehandle making sure to include a trailing newline since the server reads an entire line of input (so therefore it looks and waits for a newline):
$query = 'mouse and computer'; print SEARCH "search++ $query\n";
Finally, read the results back and print them:
print while <SEARCH>; close( SEARCH );
Exits with one of the values given below:
index++(1), perlfunc(1), exec(2), fork(2), unlink(2), accept(3), bind(3), listen(3), select(3), swish++.conf(5), launchd(8), searchmonitor(8)
Tim Bray, et al. Extensible Markup Language (XML) 1.0, February 10, 1998.
Bradford Nichols, Dick Buttlar, and Jacqueline Proulx Farrell. Pthreads Programming, O'Reilly & Associates, Sebastopol, CA, 1996.
M.F. Porter. ``An Algorithm For Suffix Stripping,'' Program, 14(3), July 1980, pp. 130-137.
W. Richard Stevens. Unix Network Programming, Vol 1, 2nd ed., Prentice-Hall, Upper Saddle River, NJ, 1998.
Larry Wall, et al. Programming Perl, 3rd ed., O'Reilly & Associates, Inc., Sebastopol, CA, 2000.
Paul J. Lucas <pauljlucas@mac.com>
June 16, 2005 | SWISH++ |