CHECKBOT(1p) | User Contributed Perl Documentation | CHECKBOT(1p) |
Checkbot - WWW Link Verifier
checkbot [--cookies] [--debug] [--file file
name] [--help]
[--mailto email addresses] [--noproxy list of domains]
[--verbose]
[--url start URL]
[--match match string] [--exclude exclude string]
[--proxy proxy URL] [--internal-only]
[--ignore ignore string]
[--filter substitution regular expression]
[--style style file URL]
[--note note] [--sleep seconds] [--timeout timeout]
[--interval seconds] [--dontwarn HTTP responde codes]
[--enable-virtual]
[--language language code]
[--suppress suppression file]
[start URLs]
Checkbot verifies the links in a specific portion of the World Wide Web. It creates HTML pages with diagnostics.
Checkbot uses LWP to find URLs on pages and to check them. It supports the same schemes as LWP does, and finds the same links that HTML::LinkExtor will find.
Checkbot considers links to be either 'internal' or 'external'. Internal links are links within the web space that needs to be checked. If an internal link points to a web document this document is retrieved, and its links are extracted and processed. External links are only checked to be working. Checkbot checks links as it finds them, so internal and external links are checked at the same time, even though they are treated differently.
Options for Checkbot are:
The default value for this option is "checkbot.html".
If no scheme is specified for the URL, the file protocol is assumed.
If no explicit match string is given, the start URLs (See option "--url") will be used as a match string instead. In this case the last page name, if any, will be trimmed. For example, a start URL like "http://some.site/index.html" will result in a default match string of "http://some.site/".
The match string can be a perl regular expression. For example, to check the main server page and all HTML pages directly underneath it, but not the HTML pages in the subdirectories of the server, the match string would be "www.someserver.xyz/($|[^/]+.html)".
The exclude string can be a perl regular expression. For example, to consider all URLs with a query string external, use "[=\?]". This can be useful when a URL with a query string unlocks the path to a huge database which will be checked.
For example "/old/new/" would replace occurrences of 'old' with 'new' in each URL.
The ignore string can be a perl regular expression.
For example "www.server.com\/(one|two)" would match all URLs starting with either www.server.com/one or www.server.com/two.
Only meaningful in combination with the "--mailto" option.
Checkbot uses the response codes generated by the server, even if this response code is not defined in RFC 2616 (HTTP/1.1). In addition to the normal HTTP response code, Checkbot defines a few response codes for situations which are not technically a problem, but which causes problems in many cases anyway. These codes are:
901 Host name expected but not found In this case the URL supports a host name, but non was found in the URL. This usually indicates a mistake in the URL. An exception is that this check is not applied to news: URLs. 902 Unqualified host name found In this case the host name does not contain the domain part. This usually means that the pages work fine when viewed within the original domain, but not when viewed from outside it. 903 Double slash in URL path The URL has a double slash in it. This is legal, but some web servers cannot handle it very well and may cause Checkbot to run away. See also the comments below. 904 Unknown scheme in URL The URL starts with a scheme that Checkbot does not know about. This is often caused by mistyping the scheme of the URL, but the scheme can also be a legal one. In that case please let me know so that it can be added to Checkbot.
The format of the suppression file is a simple whitespace delimited format, first listing the error code followed by the URL. Each error code and URL combination is listed on a new line. Comments can be added to the file by starting the line with a "#" character.
# 301 Moved Permanently 301 http://www.w3.org/P3P # 403 Forbidden 403 http://www.herring.com/
For further flexibility a regular expression can be used instead of a normal URL. The regular expression must be enclosed with forward slashes. For example, to suppress all 403 errors on wikipedia:
403 /http:\/\/wikipedia.org\/.*/
Deprecated options which will disappear in a future release:
Use of this option is deprecated. Please use the --dontwarn mechanism for error 902 instead.
First, there might be a database application as part of the web site which generates a new page based on links on another page. Since Checkbot tries to travel through all links this will create an infinite number of pages. This kind of run-away effect is usually predictable. It can be avoided by using the --exclude option.
Second, a server configuration problem can cause a loop in generating URLs for pages that really do not exist. This will result in URLs of the form http://some.server/images/images/images/logo.png, with ever more 'images' included. Checkbot cannot check for this because the server should have indicated that the requested pages do not exist. There is no easy way to solve this other than fixing the offending web server or the broken links.
Can't locate object method "new" via package "LWP::Protocol::https::Socket"
usually means that the current installation of LWP does not support checking of SSL links (i.e. links starting with https://). This problem can be solved by installing the Crypt::SSLeay module.
The most simple use of Checkbot is to check a set of pages on a server. To check my checkbot pages I would use:
checkbot http://degraaff.org/checkbot/
Checkbot runs can take some time so Checkbot can send a notification mail when the run is done:
checkbot --mailto hans@degraaff.org http://degraaff.org/checkbot/
It is possible to check a set of local file without using a web server. This only works for static files but may be useful in some cases.
checkbot file:///var/www/documents/
This script uses the "LWP" modules.
This script can send mail when "Mail::Send" is present.
Hans de Graaff <hans@degraaff.org>
any
2008-10-15 | perl v5.14.2 |