WWW::Search(3pm) | User Contributed Perl Documentation | WWW::Search(3pm) |
WWW::Search - Virtual base class for WWW searches
use WWW::Search; my $sEngine = 'AltaVista'; my $oSearch = new WWW::Search($sEngine);
This class is the parent for all access methods supported by the "WWW::Search" library. This library implements a Perl API to web-based search engines.
See README for a list of search engines currently supported, and for a lot of interesting high-level information about this distribution.
Search results can be limited, and there is a pause between each request to avoid overloading either the client or the server.
Here is a sample program:
my $sQuery = 'Columbus Ohio sushi restaurant'; my $oSearch = new WWW::Search('AltaVista'); $oSearch->native_query(WWW::Search::escape_query($sQuery)); $oSearch->login($sUser, $sPassword); while (my $oResult = $oSearch->next_result()) { print $oResult->url, "\n"; } # while $oSearch->logout;
Results are objects of type "WWW::SearchResult" (see WWW::SearchResult for details). Note that different backends support different result fields. All backends are required to support title and url.
For specific search engines, see WWW::Search::TheEngineName (replacing TheEngineName with a particular search engine).
For details about the results of a search, see WWW::SearchResult.
$oSearch = new WWW::Search('SearchEngineName');
where SearchEngineName is replaced with a particular search engine. For example:
$oSearch = new WWW::Search('Yahoo');
If no search engine is specified, a default (currently 'Null::Empty') will be chosen for you.
use WWW::Search; my @asEngines = sort &WWW::Search::installed_engines(); local $" = ', '; print (" + These WWW::Search backends are installed: @asEngines\n"); # Choose a backend at random (yes, this is rather silly): my $oSearch = WWW::Search->new($asEngines[rand(scalar(@asEngines))]);
Example:
$oSearch->native_query('search-engine-specific+escaped+query+string', { option1 => 'able', option2 => 'baker' } );
The hash of options following the query string is optional. The query string is backend-specific. There are two kinds of options: options specific to the backend, and generic options applicable to multiple backends.
Generic options all begin with 'search_'. Currently a few are supported:
Some backends may not implement these generic options, but any which do implement them must provide these semantics.
Backend-specific options are described in the documentation for each backend. In most cases the options and their values are packed together to create the query portion of the final URL.
Details about how the search string and option hash are interpreted might be found in the search-engine-specific manual pages (WWW::Search::SearchEngineName).
Same arguments as "native_query()" above.
Currently, this feature is supported by only a few backends; consult the documentation for each backend to see if it is implemented.
$oSearch->cookie_jar('/tmp/my_cookies');
If you give an HTTP::Cookies object, it is up to you to save the cookies if/when you wish.
use HTTP::Cookies; my $oJar = HTTP::Cookies->new(...); $oSearch->cookie_jar($oJar);
If you pass in no arguments, the cookie jar (if any) is returned.
my $oJar = $oSearch->cookie_jar; unless (ref $oJar) { print "No jar" };
If you don't want to put passwords in the environment, one solution would be to subclass LWP::UserAgent and use $ENV{WWW_SEARCH_USERAGENT} instead (see user_agent below).
env_proxy() must be called before the first retrieval is attempted.
Example:
$ENV{http_proxy } = 'http://my.proxy.com:80'; $ENV{http_proxy_user} = 'bugsbun'; $ENV{http_proxy_pwd } = 'c4rr0t5'; $oSearch->env_proxy('yes'); # Turn on with any true value ... $oSearch->env_proxy(0); # Turn off with zero ... if ($oSearch->env_proxy) # Test
Takes the same arguments as LWP::UserAgent::proxy().
This routine should be called before calling any of the result functions (any method with "result" in its name).
Example:
# Turn on and set address: $oSearch->http_proxy(['http','ftp'] => 'http://proxy:8080'); # Turn off: $oSearch->http_proxy('');
These routines set/get username and password used in proxy authentication. Authentication is attempted only if all three items (proxy URL, username and password) have been set.
Example:
$oSearch->http_proxy_user("myuser"); $oSearch->http_proxy_pwd("mypassword"); $oSearch->http_proxy_user(undef); # Example for no authentication $username = $oSearch->http_proxy_user();
Defaults to 500.
Example:
$max =
$oSearch->maximum_to_retrieve(100);
You can also spell this method "maximum_to_return".
Defaults to 60.
Example:
$oSearch->timeout(120);
Note: This might take a while, because a web backend will keep asking the search engine for "next page of results" over and over until there are no more next pages, and THEN return from this function.
If an error occurs at any time during query processing, it will be indicated in the response().
Example:
@results = $oSearch->results(); # Go have a cup of coffee while the previous line executes... foreach $oResult (@results) { print $oResult->url(), "\n"; } # foreach
while ($oResult = $oSearch->next_result()) { print $oResult->url(), "\n"; } # while
When there are no more results, or if an error occurs, next_result() will return undef.
If an error occurs at any time during query processing, it will be indicated in the response().
The only guaranteed valid offset is 0, which will replay the results from the beginning. In particular, seeking past the end of the current cached results probably will not do what you might think it should.
Results are cached, so this does not re-issue the query or cause IO (unless you go off the end of the results). To re-do the query, create a new search object.
Example:
$oSearch->seek_result(0);
if (! $oSearch->response->is_success) { print STDERR "Error: " . $oSearch->response->as_string() . "\n"; } # if
Note to backend authors: even if the backend does not involve the web, it should return an HTTP::Response object.
Returns an HTTP::Response object describing the result of the submission request. Consult the documentation for each backend to find out the meaning of the response.
Example:
$escaped = WWW::Search::escape_query('+hi +mom'); # $escaped is now '%2Bhi+%2Bmom'
See also "unescape_query()". NOTE that this is not a method, it is a plain function.
Example:
$unescaped = WWW::Search::unescape_query('%22hi+mom%22'); # $unescaped eq q{"hi mom"}
NOTE that this is not a method, it is a plain function.
NOTE that this is not a method, it is a plain function.
Returns the user-agent object.
If a backend needs the low-level LWP::UserAgent or LWP::RobotUA to have a particular name, $oSearch->agent_name() and possibly $oSearch->agent_email() should be called to set the desired values *before* calling $oSearch->user_agent().
If the environment variable WWW_SEARCH_USERAGENT has a value, it will be used as the class for a new user agent object. This class should be a subclass of LWP::UserAgent. For example,
$ENV{WWW_SEARCH_USERAGENT} = 'My::Own::UserAgent'; # If this env.var. has no value, # LWP::UserAgent or LWP::RobotUA will be used. $oSearch = new WWW::Search('MyBackend'); $oSearch->agent_name('MySpider'); if ($iBackendWebsiteRequiresNonRobot) { $oSearch->user_agent('non-robot'); } else { $oSearch->agent_email('me@here.com'); $oSearch->user_agent(); }
Backends should use robot-style user-agents whenever possible.
$oSearch->http_referer('http://prev.engine.com/wherever/setup.html'); $oResponse = $oSearch->http_request('GET', $url);
$oSearch->http_method('POST');
... $oSearch->maximum_to_return(10); while ($oSearch->next_result) { ... } my $urlSave = $oSearch->next_url;
Then, when you start up the next session (e.g. after the user clicks your "next" button), restore this value before calling for the results:
$oSearch->native_query(...); $oSearch->next_url($urlSave); $oSearch->maximum_to_return(20); while ($oSearch->next_result) { ... }
WARNING: It is entirely up to you to keep your interface in sync with the number of hits per page being returned from the backend. And, we make no guarantees whether this method will work for any given backend. (Their caching scheme might not enable you to jump into the middle of a list of search results, for example.)
If a backend defines this method, it is in total control of the WWW fetch, parsing, and preparing for the next page of results. See the WWW::Search::AltaVista module for example usage of the _native_retrieve_some method.
An easier way to achieve this in a backend is to inherit _native_retrieve_some from WWW::Search, and do only the HTML parsing. Simply define a method _parse_tree which takes one argument, an HTML::TreeBuilder object, and returns an integer, the number of results found on this page. See the WWW::Search::Yahoo module for example usage of the _parse_tree method.
A backend should, in general, define either _parse_tree() or _native_retrieve_some(), but not both.
Additional features of the default _native_retrieve_some method:
Sets $self->{_prev_url} to the URL of the page just retrieved.
Calls $self->preprocess_results_page() on the raw HTML of the page.
Then, parses the page with an HTML::TreeBuilder object and passes that populated object to $self->_parse_tree().
Additional notes on using the _parse_tree method:
The built-in HTML::TreeBuilder object used to parse the page has store_comments turned ON. If a backend needs to use a subclassed or modified HTML::TreeBuilder object, the backend should set $self->{'_treebuilder'} to that object before any results are retrieved. The best place to do this is at the end of _native_setup_search.
my $oTree = new myTreeBuilder; $oTree->store_pis(1); # for example $self->{'_treebuilder'} = $oTree;
When _parse_tree() is called, the $self->next_url is cleared. During parsing, the backend should set $self->next_url to the appropriate URL for the next page of results. (If _parse_tree() does not set the value, the search will end after parsing this page of results.)
When _parse_tree() is called, the URL for the page being parsed can be found in $self->{_prev_url}.
Takes one argument, a string (the HTML webpage); returns one string (the same HTML, modified).
This method is called from within _native_retrieve_some (above) before the HTML of the page is parsed.
See the WWW::Search::Ebay distribution 2.07 or higher for example usage.
Returns the value of the $TEST_CASES variable of the backend engine.
If the value is undef, the key will not be added to the string.
At one time, for testing purposes, we asked backends to use this function rather than piecing the URL together by hand, to ensure that URLs are identical across platforms and software versions. But this is no longer necessary.
Example:
$self->{_options} = { 'opt3' => 'val3', 'search_url' => 'http://www.deja.com/dnquery.xp', 'opt1' => 'val1', 'QRY' => $native_query, 'opt2' => 'val2', }; $self->{_next_url} = $self->{_options}{'search_url'} .'?'. $self->hash_to_cgi_string($self->{_options});
"WWW::Search" supports backends to separate search engines. Each backend is implemented as a subclass of "WWW::Search". WWW::Search::Yahoo provides a good sample backend.
A backend must have the routine "_native_setup_search()". A backend must have the routine "_native_retrieve_some()" or "_parse_tree()".
"_native_setup_search()" is invoked before the search. It is passed a single argument: the escaped, native version of the query.
"_native_retrieve_some()" is the core of a backend. It will be called periodically to fetch URLs. It should retrieve several hits from the search service and add them to the cache. It should return the number of hits found, or undef when there are no more hits.
Internally, "_native_retrieve_some()" typically sends an HTTP request to the search service, parses the HTML, extracts the links and descriptions, then saves the URL for the next page of results. See the code for the "WWW::Search::AltaVista" module for an example.
Alternatively, a backend can define the method "_parse_tree()" instead of "_native_retrieve_some()". See the "WWW::Search::Ebay" module for a good example.
If you implement a new backend, please let the authors know.
The bugs are there for you to find (some people call them Easter Eggs).
Desired features:
John Heidemann <johnh@isi.edu> Maintained by Martin Thurn, "mthurn@cpan.org", <http://www.sandcrawler.com/SWB/cpan-modules.html>.
Copyright (c) 1996 University of Southern California. All rights reserved.
Redistribution and use in source and binary forms are permitted provided that the above copyright notice and this paragraph are duplicated in all such forms and that any documentation, advertising materials, and other materials related to such distribution and use acknowledge that the software was developed by the University of Southern California, Information Sciences Institute. The name of the University may not be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
2020-09-10 | perl v5.30.3 |