LINKCHECKER(1) | LinkChecker commandline usage | LINKCHECKER(1) |
linkchecker - command line client to check HTML documents and websites for broken links
linkchecker [options] [file-or-url]...
LinkChecker features
The most common use checks the given domain recursively:
linkchecker http://www.example.com/
Beware that this checks the whole site which can have thousands of URLs. Use
the -r option to restrict the recursion depth.
Don't check URLs with /secret in its name. All other links are checked
as usual:
linkchecker --ignore-url=/secret mysite.example.com
Checking a local HTML file on Unix:
linkchecker ../bla.html
Checking a local HTML file on Windows:
linkchecker c:\temp\test.html
You can skip the http:// url part if the domain starts with
www.:
linkchecker www.example.com
You can skip the ftp:// url part if the domain starts with ftp.:
linkchecker -r0 ftp.example.com
Generate a sitemap graph and convert it with the graphviz dot utility:
linkchecker -odot -v www.example.com | dot -Tps > sitemap.ps
Configuration files can specify all options above. They can also specify some options that cannot be set on the command line. See linkcheckerrc(5) for more info.
Note that by default only errors and warnings are logged. You should use the --verbose option to get the complete URL list, especially when outputting a sitemap graph format.
LinkChecker accepts Python regular expressions. See http://docs.python.org/howto/regex.html for an introduction.
An addition is that a leading exclamation mark negates the regular expression.
A cookie file contains standard HTTP header (RFC 2616) data with the following possible names:
Multiple entries are separated by a blank line. The example below will send two cookies to all URLs starting with http://example.com/hello/ and one to all URLs starting with https://example.org/:
Host: example.com
Path: /hello
Set-cookie: ID="smee"
Set-cookie: spam="egg"
Host: example.org
Set-cookie: baggage="elitist"; comment="hologram"
To use a proxy on Unix or Windows set the $http_proxy, $https_proxy or $ftp_proxy environment variables to the proxy URL. The URL should be of the form http://[user:pass@]host[:port]. LinkChecker also detects manual proxy settings of Internet Explorer under Windows systems, and gconf or KDE on Linux systems. On a Mac use the Internet Config to select a proxy. You can also set a comma-separated domain list in the $no_proxy environment variables to ignore any proxy settings for these domains. Setting a HTTP proxy on Unix for example looks like this:
export http_proxy="http://proxy.example.com:8080"
Proxy authentication is also supported:
export http_proxy="http://user1:mypass@proxy.example.org:8081"
Setting a proxy on the Windows command prompt:
set http_proxy=http://proxy.example.com:8080
All URLs have to pass a preliminary syntax test. Minor quoting mistakes will issue a warning, all other invalid syntax issues are errors. After the syntax check passes, the URL is queued for connection checking. All connection check types are described below.
For FTP links we do:
1) connect to the specified host
2) try to login with the given user and password. The default
user is ``anonymous``, the default password is ``anonymous@``.
3) try to change to the given directory
4) list the file with the NLST command
We try to connect and if user/password are given, login to the
given telnet server.
We try to connect to the given NNTP server. If a news group or
article is specified, try to request it from the server.
An unsupported link will only print a warning. No further checking
will be made.
The complete list of recognized, but unsupported links can be found
in the linkcheck/checker/unknownurl.py source file.
The most prominent of them should be JavaScript links.
There are two plugin types: connection and content plugins. Connection plugins are run after a successful connection to the URL host. Content plugins are run if the URL type has content (mailto: URLs have no content for example) and if the check is not forbidden (ie. by HTTP robots.txt). See linkchecker --list-plugins for a list of plugins and their documentation. All plugins are enabled via the linkcheckerrc(5) configuration file.
Before descending recursively into a URL, it has to fulfill several conditions. They are checked in this order:
1. A URL must be valid.
2. A URL must be parseable. This currently includes HTML files,
Opera bookmarks files, and directories. If a file type cannot
be determined (for example it does not have a common HTML file
extension, and the content does not look like HTML), it is assumed
to be non-parseable.
3. The URL content must be retrievable. This is usually the case
except for example mailto: or unknown URL types.
4. The maximum recursion level must not be exceeded. It is
configured
with the --recursion-level option and is unlimited per default.
5. It must not match the ignored URL list. This is controlled with
the --ignore-url option.
6. The Robots Exclusion Protocol must allow links in the URL to be
followed recursively. This is checked by searching for a
"nofollow" directive in the HTML header data.
Note that the directory recursion reads all files in that directory, not just a subset like index.htm*.
URLs on the commandline starting with ftp. are treated like ftp://ftp., URLs starting with www. are treated like http://www.. You can also give local files as arguments.
If you have your system configured to automatically establish a connection to the internet (e.g. with diald), it will connect when checking links not pointing to your local host. Use the --ignore-url option to prevent this.
Javascript links are not supported.
If your platform does not support threading, LinkChecker disables it automatically.
You can supply multiple user/password pairs in a configuration file.
When checking news: links the given NNTP host doesn't need to be the same as the host of the user browsing your pages.
NNTP_SERVER - specifies default NNTP server
http_proxy - specifies default HTTP proxy server
ftp_proxy - specifies default FTP proxy server
no_proxy - comma-separated list of domains to not contact over a proxy
server
LC_MESSAGES, LANG, LANGUAGE - specify output language
The return value is 2 when
The return value is 1 when
Else the return value is zero.
LinkChecker consumes memory for each queued URL to check. With thousands of queued URLs the amount of consumed memory can become quite large. This might slow down the program or even the whole system.
~/.linkchecker/linkcheckerrc - default configuration file
~/.linkchecker/blacklist - default blacklist logger output filename
linkchecker-out.TYPE - default logger file output name
http://docs.python.org/library/codecs.html#standard-encodings - valid
output encodings
http://docs.python.org/howto/regex.html - regular expression
documentation
Bastian Kleineidam <bastian.kleineidam@web.de>
Copyright © 2000-2014 Bastian Kleineidam
2010-07-01 | LinkChecker |