httrack - offline browser : copy websites to a local directory
httrack [ url ]... [ -filter ]... [ +filter ]... [ -O,
--path ] [ -w, --mirror ] [ -W, --mirror-wizard ] [ -g,
--get-files ] [ -i, --continue ] [ -Y, --mirrorlinks ] [
-P, --proxy ] [ -%f, --httpproxy-ftp[=N] ] [ -%b,
--bind ] [ -rN, --depth[=N] ] [ -%eN, --ext-depth[=N] ] [
-mN, --max-files[=N] ] [ -MN, --max-size[=N] ] [ -EN,
--max-time[=N] ] [ -AN, --max-rate[=N] ] [ -%cN,
--connection-per-second[=N] ] [ -GN, --max-pause[=N] ] [ -cN,
--sockets[=N] ] [ -TN, --timeout[=N] ] [ -RN,
--retries[=N] ] [ -JN, --min-rate[=N] ] [ -HN,
--host-control[=N] ] [ -%P, --extended-parsing[=N] ] [ -n,
--near ] [ -t, --test ] [ -%L, --list ] [ -%S,
--urllist ] [ -NN, --structure[=N] ] [ -%D,
--cached-delayed-type-check ] [ -%M, --mime-html ] [ -LN,
--long-names[=N] ] [ -KN, --keep-links[=N] ] [ -x,
--replace-external ] [ -%x, --disable-passwords ] [ -%q,
--include-query-string ] [ -o, --generate-errors ] [ -X,
--purge-old[=N] ] [ -%p, --preserve ] [ -%T,
--utf8-conversion ] [ -bN, --cookies[=N] ] [ -u,
--check-type[=N] ] [ -j, --parse-java[=N] ] [ -sN,
--robots[=N] ] [ -%h, --http-10 ] [ -%k, --keep-alive ] [
-%B, --tolerant ] [ -%s, --updatehack ] [ -%u,
--urlhack ] [ -%A, --assume ] [ -@iN, --protocol[=N] ] [
-%w, --disable-module ] [ -F, --user-agent ] [ -%R,
--referer ] [ -%E, --from ] [ -%F, --footer ] [ -%l,
--language ] [ -%a, --accept ] [ -%X, --headers ] [ -C,
--cache[=N] ] [ -k, --store-all-in-cache ] [ -%n,
--do-not-recatch ] [ -%v, --display ] [ -Q, --do-not-log ]
[ -q, --quiet ] [ -z, --extra-log ] [ -Z, --debug-log ]
[ -v, --verbose ] [ -f, --file-log ] [ -f2,
--single-log ] [ -I, --index ] [ -%i, --build-top-index ]
[ -%I, --search-index ] [ -pN, --priority[=N] ] [ -S,
--stay-on-same-dir ] [ -D, --can-go-down ] [ -U,
--can-go-up ] [ -B, --can-go-up-and-down ] [ -a,
--stay-on-same-address ] [ -d, --stay-on-same-domain ] [ -l,
--stay-on-same-tld ] [ -e, --go-everywhere ] [ -%H,
--debug-headers ] [ -%!, --disable-security-limits ] [ -V,
--userdef-cmd ] [ -%W, --callback ] [ -K, --keep-links[=N]
] [
httrack allows you to download a World Wide Web site from
the Internet to a local directory, building recursively all directories,
getting HTML, images, and other files from the server to your computer.
HTTrack arranges the original site's relative link-structure. Simply open a
page of the "mirrored" website in your browser, and you can browse
the site from link to link, as if you were viewing it online. HTTrack can
also update an existing mirrored site, and resume interrupted downloads.
- httrack www.someweb.com/bob/
-
mirror site www.someweb.com/bob/ and only this site
- httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg
-mime:application/*
-
mirror the two sites together (with shared links) and accept any .jpg files
on .com sites
- httrack www.someweb.com/bob/bobby.html +* -r6
- means get all files starting from bobby.html, with 6 link-depth, and
possibility of going everywhere on the web
- httrack www.someweb.com/bob/bobby.html --spider -P
proxy.myhost.com:8080
- runs the spider on www.someweb.com/bob/bobby.html using a proxy
- httrack --update
- updates a mirror in the current folder
- httrack
- will bring you to the interactive mode
- httrack --continue
- continues a mirror in the current folder
- -O
- path for mirror/logfiles+cache (-O path mirror[,path cache and logfiles])
(--path <param>)
- -w
- *mirror web sites (--mirror)
- -W
- mirror web sites, semi-automatic (asks questions) (--mirror-wizard)
- -g
- just get files (saved in the current directory) (--get-files)
- -i
- continue an interrupted mirror using the cache (--continue)
- -Y
- mirror ALL links located in the first level pages (mirror links)
(--mirrorlinks)
- -P
- proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy
<param>)
- -%f
- *use proxy for ftp (f0 don t use) (--httpproxy-ftp[=N])
- -%b
- use this local hostname to make/send requests (-%b hostname) (--bind
<param>)
- -rN
- set the mirror depth to N (* r9999) (--depth[=N])
- -%eN
- set the external links depth to N (* %e0) (--ext-depth[=N])
- -mN
- maximum file length for a non-html file (--max-files[=N])
- -mN,N2
- maximum file length for non html (N) and html (N2)
- -MN
- maximum overall size that can be uploaded/scanned (--max-size[=N])
- -EN
- maximum mirror time in seconds (60=1 minute, 3600=1 hour)
(--max-time[=N])
- -AN
- maximum transfer rate in bytes/seconds (1000=1KB/s max)
(--max-rate[=N])
- -%cN
- maximum number of connections/seconds (*%c10)
(--connection-per-second[=N])
- -GN
- pause transfer if N bytes reached, and wait until lock file is deleted
(--max-pause[=N])
- -cN
- number of multiple connections (*c8) (--sockets[=N])
- -TN
- timeout, number of seconds after a non-responding link is shutdown
(--timeout[=N])
- -RN
- number of retries, in case of timeout or non-fatal errors (*R1)
(--retries[=N])
- -JN
- traffic jam control, minimum transfert rate (bytes/seconds) tolerated for
a link (--min-rate[=N])
- -HN
- host is abandoned if: 0=never, 1=timeout, 2=slow, 3=timeout or slow
(--host-control[=N])
- -%P
- *extended parsing, attempt to parse all links, even in unknown tags or
Javascript (%P0 don t use) (--extended-parsing[=N])
- -n
- get non-html files near an html file (ex: an image located outside)
(--near)
- -t
- test all URLs (even forbidden ones) (--test)
- -%L
- <file> add all URL located in this text file (one URL per line)
(--list <param>)
- -%S
- <file> add all scan rules located in this text file (one scan rule
per line) (--urllist <param>)
- -NN
- structure type (0 *original structure, 1+: see below)
(--structure[=N])
- -or
- user defined structure (-N "%h%p/%n%q.%t")
- -%N
- delayed type check, don t make any link test but wait for files download
to start instead (experimental) (%N0 don t use, %N1 use for unknown
extensions, * %N2 always use)
- -%D
- cached delayed type check, don t wait for remote type during updates, to
speedup them (%D0 wait, * %D1 don t wait)
(--cached-delayed-type-check)
- -%M
- generate a RFC MIME-encapsulated full-archive (.mht) (--mime-html)
- -LN
- long names (L1 *long names / L0 8-3 conversion / L2 ISO9660 compatible)
(--long-names[=N])
- -KN
- keep original links (e.g. http://www.adr/link) (K0 *relative link, K
absolute links, K4 original links, K3 absolute URI links, K5 transparent
proxy link) (--keep-links[=N])
- -x
- replace external html links by error pages (--replace-external)
- -%x
- do not include any password for external password protected websites (%x0
include) (--disable-passwords)
- -%q
- *include query string for local files (useless, for information purpose
only) (%q0 don t include) (--include-query-string)
- -o
- *generate output html file in case of error (404..) (o0 don t generate)
(--generate-errors)
- -X
- *purge old files after update (X0 keep delete) (--purge-old[=N])
- -%p
- preserve html files as is (identical to -K4 -%F "" )
(--preserve)
- -%T
- links conversion to UTF-8 (--utf8-conversion)
- -bN
- accept cookies in cookies.txt (0=do not accept,* 1=accept)
(--cookies[=N])
- -u
- check document type if unknown (cgi,asp..) (u0 don t check, * u1 check but
/, u2 check always) (--check-type[=N])
- -j
- *parse Java Classes (j0 don t parse, bitmask: |1 parse default, |2 don t
parse .class |4 don t parse .js |8 don t be aggressive)
(--parse-java[=N])
- -sN
- follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always,
3=always (even strict rules)) (--robots[=N])
- -%h
- force HTTP/1.0 requests (reduce update features, only for old servers or
proxies) (--http-10)
- -%k
- use keep-alive if possible, greately reducing latency for small files and
test requests (%k0 don t use) (--keep-alive)
- -%B
- tolerant requests (accept bogus responses on some servers, but not
standard!) (--tolerant)
- -%s
- update hacks: various hacks to limit re-transfers when updating (identical
size, bogus response..) (--updatehack)
- -%u
- url hacks: various hacks to limit duplicate URLs (strip //,
www.foo.com==foo.com..) (--urlhack)
- -%A
- assume that a type (cgi,asp..) is always linked with a mime type (-%A
php3,cgi=text/html;dat,bin=application/x-zip) (--assume
<param>)
- -can
- also be used to force a specific file type: --assume
foo.cgi=text/html
- -@iN
- internet protocol (0=both ipv6+ipv4, 4=ipv4 only, 6=ipv6 only)
(--protocol[=N])
- -%w
- disable a specific external mime module (-%w htsswf -%w htsjava)
(--disable-module <param>)
- -F
- user-agent field sent in HTTP headers (-F "user-agent name")
(--user-agent <param>)
- -%R
- default referer field sent in HTTP headers (--referer <param>)
- -%E
- from email address sent in HTTP headers (--from <param>)
- -%F
- footer string in Html code (-%F "Mirrored [from host %s [file %s [at
%s]]]" (--footer <param>)
- -%l
- preffered language (-%l "fr, en, jp, *" (--language
<param>)
- -%a
- accepted formats (-%a "text/html,image/png;q=0.9,*/*;q=0.1"
(--accept <param>)
- -%X
- additional HTTP header line (-%X "X-Magic: 42" (--headers
<param>)
- -C
- create/use a cache for updates and retries (C0 no cache,C1 cache is
prioritary,* C2 test update before) (--cache[=N])
- -k
- store all files in cache (not useful if files on disk)
(--store-all-in-cache)
- -%n
- do not re-download locally erased files (--do-not-recatch)
- -%v
- display on screen filenames downloaded (in realtime) - * %v1 short version
- %v2 full animation (--display)
- -Q
- no log - quiet mode (--do-not-log)
- -q
- no questions - quiet mode (--quiet)
- -z
- log - extra infos (--extra-log)
- -Z
- log - debug (--debug-log)
- -v
- log on screen (--verbose)
- -f
- *log in files (--file-log)
- -f2
- one single log file (--single-log)
- -I
- *make an index (I0 don t make) (--index)
- -%i
- make a top index for a project folder (* %i0 don t make)
(--build-top-index)
- -%I
- make an searchable index for this mirror (* %I0 don t make)
(--search-index)
- -pN
- priority mode: (* p3) (--priority[=N])
- -p0
- just scan, don t save anything (for checking links)
- -p1
- save only html files
- -p2
- save only non html files
- -*p3
- save all files
- -p7
- get html files before, then treat other files
- -S
- stay on the same directory (--stay-on-same-dir)
- -D
- *can only go down into subdirs (--can-go-down)
- -U
- can only go to upper directories (--can-go-up)
- -B
- can both go up&down into the directory structure
(--can-go-up-and-down)
- -a
- *stay on the same address (--stay-on-same-address)
- -d
- stay on the same principal domain (--stay-on-same-domain)
- -l
- stay on the same TLD (eg: .com) (--stay-on-same-tld)
- -e
- go everywhere on the web (--go-everywhere)
- -%H
- debug HTTP headers in logfile (--debug-headers)
- -#X
- *use optimized engine (limited memory boundary checks)
(--fast-engine)
- -#0
- filter test (-#0 *.gif www.bar.com/foo.gif ) (--debug-testfilters
<param>)
- -#1
- simplify test (-#1 ./foo/bar/../foobar)
- -#2
- type test (-#2 /foo/bar.php)
- -#C
- cache list (-#C *.com/spider*.gif (--debug-cache <param>)
- -#R
- cache repair (damaged cache) (--repair-cache)
- -#d
- debug parser (--debug-parsing)
- -#E
- extract new.zip cache meta-data in meta.zip
- -#f
- always flush log files (--advanced-flushlogs)
- -#FN
- maximum number of filters (--advanced-maxfilters[=N])
- -#h
- version info (--version)
- -#K
- scan stdin (debug) (--debug-scanstdin)
- -#L
- maximum number of links (-#L1000000) (--advanced-maxlinks[=N])
- -#p
- display ugly progress information (--advanced-progressinfo)
- -#P
- catch URL (--catch-url)
- -#R
- old FTP routines (debug) (--repair-cache)
- -#T
- generate transfer ops. log every minutes (--debug-xfrstats)
- -#u
- wait time (--advanced-wait)
- -#Z
- generate transfer rate statistics every minutes (--debug-ratestats)
- -%!
- bypass built-in security limits aimed to avoid bandwidth abuses
(bandwidth, simultaneous connections) (--disable-security-limits)
- -IMPORTANT
- NOTE: DANGEROUS OPTION, ONLY SUITABLE FOR EXPERTS
- -USE
- IT WITH EXTREME CARE
- -V
- execute system command after each files ($0 is the filename: -V "rm
\$0") (--userdef-cmd <param>)
- -%W
- use an external library function as a wrapper (-%W
myfoo.so[,myparameters]) (--callback <param>)
- -N0
- Site-structure (default)
- -N1
- HTML in web/, images/other files in web/images/
- -N2
- HTML in web/HTML, images/other in web/images
- -N3
- HTML in web/, images/other in web/
- -N4
- HTML in web/, images/other in web/xxx, where xxx is the file extension
(all gif will be placed onto web/gif, for example)
- -N5
- Images/other in web/xxx and HTML in web/HTML
- -N99
- All files in web/, with random names (gadget !)
- -N100
- Site-structure, without www.domain.xxx/
- -N101
- Identical to N1 except that "web" is replaced by the site s
name
- -N102
- Identical to N2 except that "web" is replaced by the site s
name
- -N103
- Identical to N3 except that "web" is replaced by the site s
name
- -N104
- Identical to N4 except that "web" is replaced by the site s
name
- -N105
- Identical to N5 except that "web" is replaced by the site s
name
- -N199
- Identical to N99 except that "web" is replaced by the site s
name
- -N1001
- Identical to N1 except that there is no "web" directory
- -N1002
- Identical to N2 except that there is no "web" directory
- -N1003
- Identical to N3 except that there is no "web" directory (option
set for g option)
- -N1004
- Identical to N4 except that there is no "web" directory
- -N1005
- Identical to N5 except that there is no "web" directory
- -N1099
- Identical to N99 except that there is no "web" directory
Details: User-defined option N
%n Name of file without file type (ex: image)
%N Name of file, including file type (ex: image.gif)
%t File type (ex: gif)
%p Path [without ending /] (ex: /someimages)
%h Host name (ex: www.someweb.com)
%M URL MD5 (128 bits, 32 ascii bytes)
%Q query string MD5 (128 bits, 32 ascii bytes)
%k full query string
%r protocol name (ex: http)
%q small query string MD5 (16 bits, 4 ascii bytes)
%s? Short name version (ex: %sN)
%[param] param variable in query string
%[param:before:after:empty:notfound] advanced variable extraction
Details: User-defined option N and advanced variable
extraction
%[param:before:after:empty:notfound]
- -param
- : parameter name
- -before
- : string to prepend if the parameter was found
- -after
- : string to append if the parameter was found
- -notfound
- : string replacement if the parameter could not be found
- -empty
- : string replacement if the parameter was empty
- -all
- fields, except the first one (the parameter name), can be empty
- -K0
- foo.cgi?q=45 -> foo4B54.html?q=45 (relative URI, default)
- -K
- -> http://www.foobar.com/folder/foo.cgi?q=45 (absolute URL)
(--keep-links[=N])
- -K3
- -> /folder/foo.cgi?q=45 (absolute URI)
- -K4
- -> foo.cgi?q=45 (original URL)
- -K5
- -> http://www.foobar.com/folder/foo4B54.html?q=45 (transparent proxy
URL)
- --mirror
-
<URLs> *make a mirror of site(s) (default)
- --get
-
<URLs> get the files indicated, do not seek other URLs (-qg)
- --list
-
<text file> add all URL located in this text file (-%L)
- --mirrorlinks
- <URLs> mirror all links in 1st level pages (-Y)
- --testlinks
-
<URLs> test links in pages (-r1p0C0I0t)
- --spider
-
<URLs> spider site(s), to test links: reports Errors & Warnings
(-p0C0I0t)
- --testsite
-
<URLs> identical to --spider
- --skeleton
-
<URLs> make a mirror, but gets only html files (-p1)
- --update
-
update a mirror, without confirmation (-iC2)
- --continue
-
continue a mirror, without confirmation (-iC1)
- --catchurl
-
create a temporary proxy to capture an URL or a form post URL
- --clean
-
erase cache & log files
- --http10
-
force http/1.0 requests (-%h)
/etc/httrack.conf
The system wide configuration file.
- HOME
- Is being used if you defined in /etc/httrack.conf the line path
~/websites/#
Errors/Warnings are reported to hts-log.txt by default, or
to stderr if the -v option was specified.
These are the principals limits of HTTrack for that moment. Note
that we did not heard about any other utility that would have solved
them.
- Several scripts generating complex filenames may not find
them (ex: img.src='image'+a+Mobj.dst+'.gif')
- Some java classes may not find some files on them (class
included)
- Cgi-bin links may not work properly in some cases
(parameters needed). To avoid them: use filters like -*cgi-bin*
Please reports bugs to <bugs@httrack.com>. Include a
complete, self-contained example that will allow the bug to be reproduced,
and say which version of httrack you are using. Do not forget to detail
options used, OS version, and any other information you deem necessary.
Copyright (C) 1998-2023 Xavier Roche and other contributors
This program is free software: you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or (at your
option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see
<http://www.gnu.org/licenses/>.
The most recent released version of httrack can be found at:
http://www.httrack.com
Xavier Roche <roche@httrack.com>
The HTML documentation (available online at
http://www.httrack.com/html/ ) contains more detailed information.
Please also refer to the httrack FAQ (available online at
http://www.httrack.com/html/faq.html )