waymore - Tool to discover extensive data from online archives
waymore [-h] [-i INPUT] [-n] [-mode {U,R,B}] [-oU OUTPUT_URLS] [-oR OUTPUT_RESPONSES] [-f] [-fc FC] [-mc MC] [-l <signed integer>] [-from <yyyyMMddhhmmss>] [-to <yyyyMMddhhmmss>]
[-ci {h,d,m,none}] [-ra REGEX_AFTER] [-url-filename] [-xwm] [-xcc] [-xav] [-xus] [-xvt] [-lcc LCC] [-lcy LCY] [-t <seconds>] [-p <integer>] [-r RETRIES] [-m <integer>]
[-ko [KEYWORDS_ONLY]] [-lr LIMIT_REQUESTS] [-ow] [-nlf] [-c CONFIG] [-wrlr WAYBACK_RATE_LIMIT_RETRY] [-urlr URLSCAN_RATE_LIMIT_RETRY] [-co] [-nd] [-v] [--version]
waymore is a versatile tool designed to extract comprehensive
information from various sources including the Wayback Machine, Common
Crawl, Alien Vault OTX, URLScan, and VirusTotal. Whether you're searching
for historical web data or analyzing security threats, waymore provides a
seamless experience with its intuitive interface and extensive features.
- -h, --help:
- Display command usage and options. Provides quick access to comprehensive
assistance, including detailed explanations of available options.
- -i INPUT, --input
INPUT:
- The target domain (or file of domains) to find links for. This can be a
domain only, or a domain with a specific path. If it is a domain only to
get everything for that domain, don't prefix with "www."
- -n, --no-subs:
- Don't include subdomains of the target domain (only used if input is not a
domain with a specific path).
- -mode {U,R,B}:
- The mode to run: U (retrieve URLs only), R (download Responses only) or B
(Both).
- -oU OUTPUT_URLS, --output-urls
OUTPUT_URLS:
- The file to save the Links output to, including path if necessary. If the
"-oR" argument is not passed, a "results" directory
will be created in the path specified by the DEFAULT_OUTPUT_DIR key in
config.yml file (typically defaults to "~/.config/waymore/").
Within that, a directory will be created with target domain (or domain
with path) passed with "-i" (or for each line of a file passed
with "-i").
- -oR OUTPUT_RESPONSES,
--output-responses OUTPUT_RESPONSES:
- The directory to save the response output files to, including path if
necessary. If the argument is not passed, a "results" directory
will be created in the path specified by the DEFAULT_OUTPUT_DIR key in
config.yml file (typically defaults to "~/.config/waymore/").
Within that, a directory will be created with target domain (or domain
with path) passed with "-i" (or for each line of a file passed
with "-i").
- -f,
--filter-responses-only:
- The initial links from Wayback Machine will not be filtered (MIME Type and
Response Code), only the responses that are downloaded, e.g. it maybe
useful to still see all available paths from the links even if you don't
want to check the content.
- -fc FC:
- Filter HTTP status codes for retrieved URLs and responses. Comma separated
list of codes (default: the FILTER_CODE values from config.yml). Passing
this argument will override the value from config.yml
- -mc MC:
- Only Match HTTP status codes for retrieved URLs and responses. Comma
separated list of codes. Passing this argument overrides the config
FILTER_CODE and -fc.
- -l <signed integer>,
--limit <signed integer>:
- How many responses will be saved (if -mode is R or B). A positive value
will get the first N results, a negative value will will get the last N
results. A value of 0 will get ALL responses (default: 5000)
- -from
<yyyyMMddhhmmss>, --from-date <yyyyMMddhhmmss>:
- What date to get responses from. If not specified it will get from the
earliest possible results. A partial value can be passed, e.g. 2016,
201805, etc.
- -to <yyyyMMddhhmmss>,
--to-date <yyyyMMddhhmmss>:
- What date to get responses to. If not specified it will get to the latest
possible results. A partial value can be passed, e.g. 2016, 201805,
etc.
- -ci {h,d,m,none},
--capture-interval {h,d,m,none}:
- Filters the search on Wayback Machine (archive.org) to only get at most 1
capture per hour (h), day (d) or month (m). This filter is used for
responses only. The default is 'd' but can also be set to 'none' to not
filter anything and get all responses.
- -ra REGEX_AFTER, --regex-after
REGEX_AFTER:
- RegEx for filtering purposes against links found all sources of URLs AND
responses downloaded. Only positive matches will be output.
- -url-filename:
- Set the file name of downloaded responses to the URL that generated the
response, otherwise it will be set to the hash value of the response.
Using the hash value means multiple URLs that generated the same response
will only result in one file being saved for that response.
- -xwm:
- Exclude checks for links from Wayback Machine (archive.org)
- -xcc:
- Exclude checks for links from commoncrawl.org
- -xav:
- Exclude checks for links from alienvault.com
- -xus:
- Exclude checks for links from urlscan.io
- -xvt:
- Exclude checks for links from virustotal.com
- -lcc LCC:
- Limit the number of Common Crawl index collections searched, e.g. '-lcc
10' will just search the latest 10 collections (default: 3). As of July
2023 there are currently 95 collections. Setting to 0 (default) will
search ALL collections. If you don't want to search Common Crawl at all,
use the -xcc option.
- -lcy LCY:
- Limit the number of Common Crawl index collections searched by the year of
the index data. The earliest index has data from 2008. Setting to 0
(default) will search collections or any year (but in conjunction with
-lcc). For example, if you are only interested in data from 2015 and
after, pass -lcy 2015. If you don't want to search Common Crawl at all,
use the -xcc option.
- -t <seconds>, --timeout
<seconds>:
- This is for archived responses only! How many seconds to wait for the
server to send data before giving up (default: 30 seconds)
- -p <integer>, --processes
<integer>:
- Basic multithreading is done when getting requests for a file of URLs.
This argument determines the number of processes (threads) used (default:
1)
- -r RETRIES, --retries
RETRIES:
- The number of retries for requests that get connection error or rate
limited (default: 1).
- -m <integer>,
--memory-threshold <integer>:
- The memory threshold percentage. If the machines memory goes above the
threshold, the program will be stopped and ended gracefully before running
out of memory (default: 95)
- -ko [KEYWORDS_ONLY],
--keywords-only [KEYWORDS_ONLY]:
- Only return links and responses that contain keywords that you are
interested in. This can reduce the time it takes to get results. If you
provide the flag with no value, Keywords are taken from the comma
separated list in the "config.yml" file with the
"FILTER_KEYWORDS" key, otherwise you can pass an specific Regex
value to use, e.g. -ko "admin" to only get links containing the
word admin, or -ko ".js(?|$)" to only get JS files. The Regex
check is NOT case sensitive.
- -lr LIMIT_REQUESTS,
--limit-requests LIMIT_REQUESTS:
- Limit the number of requests that will be made when getting links from a
source (this doesn't apply to Common Crawl). Some targets can return a
huge amount of requests needed that are just not feasible to get, so this
can be used to manage that situation. This defaults to 0 (Zero) which
means there is no limit.
- -ow,
--output-overwrite:
- If the URL output file (default waymore.txt) already exists, it will be
overwritten instead of being appended to.
- -nlf,
--new-links-file:
- If this argument is passed, a .new file will also be written that will
contain links for the latest run.
- -c CONFIG, --config
CONFIG:
- Path to the YML config file. If not passed, it looks for file 'config.yml'
in the same directory as runtime file 'waymore.py'.
- -wrlr
WAYBACK_RATE_LIMIT_RETRY, --wayback-rate-limit-retry
WAYBACK_RATE_LIMIT_RETRY:
- The number of minutes the user wants to wait for a rate limit pause on
Watback Machine (archive.org) instead of stopping with a 429 error
(default: 3).
- -urlr
URLSCAN_RATE_LIMIT_RETRY, --urlscan-rate-limit-retry
URLSCAN_RATE_LIMIT_RETRY:
- The number of minutes the user wants to wait for a rate limit pause on
URLScan.io instead of stopping with a 429 error (default: 1).
- -co,
--check-only:
- This will make a few minimal requests to show you how many requests, and
roughly how long it could take, to get URLs from the sources and
downloaded responses from Wayback Machine.
- -nd,
--notify-discord:
- Whether to send a notification to Discord when waymore completes. It
requires WEBHOOK_DISCORD to be provided in the config.yml file.
- -v, --verbose:
- Verbose output
- --version:
- Show version number
Common usage:
Just get the URLs from all sources for redbull.com (-mode U is
just for URLs, so no responses are downloaded):
$ waymore -i redbull.com -mode U
The URLs are saved in the same path as config.yml (typically
~/.config/waymore) under results/redbull.com/waymore.txt
Get ALL the URLs from Wayback for redbull.com (no filters are
applied in mode U with -f, and no URLs are retrieved from Common Crawl,
Alien Vault, URLScan and Virus Total, because -xcc, -xav, -xus, -xvt are
passed respectively). Save the FIRST 200 responses that are found starting
from 2022 (-l 200 -from 2022):
$ waymore -i redbull.com -f -xcc -xav -xus -xvt -l 200 -from 2022
You can pipe waymore to other tools. Any errors are sent to stderr
and any links found are sent to stdout. The output file is still created in
addition to the links being piped to the next program. However, archived
responses are not piped to the next program, but they are still written to
files. For example:
$ waymore -i redbull.com -mode U | unfurl keys | sort -u
You can also pass the input through stdin instead of -i:
$ cat redbull_subs.txt | waymore
Sometimes you may just want to check how many requests, and how
long waymore is likely to take if you ran it for a particular domain. You
can do a quick check by using the -co/--check-only argument. For
example:
$ waymore -i redbull.com --check-only
Aquila Macedo <aquilamacedo@riseup.net>