NAME

waymore - Tool to discover extensive data from online archives

SYNOPSIS

waymore [-h] [-i INPUT] [-n] [-mode {U,R,B}] [-oU OUTPUT_URLS] [-oR OUTPUT_RESPONSES] [-f] [-fc FC] [-mc MC] [-l <signed integer>] [-from <yyyyMMddhhmmss>] [-to <yyyyMMddhhmmss>]


        [-ci {h,d,m,none}] [-ra REGEX_AFTER] [-url-filename] [-xwm] [-xcc] [-xav] [-xus] [-xvt] [-lcc LCC] [-lcy LCY] [-t <seconds>] [-p <integer>] [-r RETRIES] [-m <integer>]


        [-ko [KEYWORDS_ONLY]] [-lr LIMIT_REQUESTS] [-ow] [-nlf] [-c CONFIG] [-wrlr WAYBACK_RATE_LIMIT_RETRY] [-urlr URLSCAN_RATE_LIMIT_RETRY] [-co] [-nd] [-v] [--version]

DESCRIPTION

waymore is a versatile tool designed to extract comprehensive information from various sources including the Wayback Machine, Common Crawl, Alien Vault OTX, URLScan, and VirusTotal. Whether you're searching for historical web data or analyzing security threats, waymore provides a seamless experience with its intuitive interface and extensive features.

OPTIONS

-h, --help:: Display command usage and options. Provides quick access to comprehensive assistance, including detailed explanations of available options.
-i INPUT, --input INPUT:: The target domain (or file of domains) to find links for. This can be a domain only, or a domain with a specific path. If it is a domain only to get everything for that domain, don't prefix with "www."
-n, --no-subs:: Don't include subdomains of the target domain (only used if input is not a domain with a specific path).
-mode {U,R,B}:: The mode to run: U (retrieve URLs only), R (download Responses only) or B (Both).
-oU OUTPUT_URLS, --output-urls OUTPUT_URLS:: The file to save the Links output to, including path if necessary. If the "-oR" argument is not passed, a "results" directory will be created in the path specified by the DEFAULT_OUTPUT_DIR key in config.yml file (typically defaults to "~/.config/waymore/"). Within that, a directory will be created with target domain (or domain with path) passed with "-i" (or for each line of a file passed with "-i").
-oR OUTPUT_RESPONSES, --output-responses OUTPUT_RESPONSES:: The directory to save the response output files to, including path if necessary. If the argument is not passed, a "results" directory will be created in the path specified by the DEFAULT_OUTPUT_DIR key in config.yml file (typically defaults to "~/.config/waymore/"). Within that, a directory will be created with target domain (or domain with path) passed with "-i" (or for each line of a file passed with "-i").
-f, --filter-responses-only:: The initial links from Wayback Machine will not be filtered (MIME Type and Response Code), only the responses that are downloaded, e.g. it maybe useful to still see all available paths from the links even if you don't want to check the content.
-fc FC:: Filter HTTP status codes for retrieved URLs and responses. Comma separated list of codes (default: the FILTER_CODE values from config.yml). Passing this argument will override the value from config.yml
-mc MC:: Only Match HTTP status codes for retrieved URLs and responses. Comma separated list of codes. Passing this argument overrides the config FILTER_CODE and -fc.
-l <signed integer>, --limit <signed integer>:: How many responses will be saved (if -mode is R or B). A positive value will get the first N results, a negative value will will get the last N results. A value of 0 will get ALL responses (default: 5000)
-from <yyyyMMddhhmmss>, --from-date <yyyyMMddhhmmss>:: What date to get responses from. If not specified it will get from the earliest possible results. A partial value can be passed, e.g. 2016, 201805, etc.
-to <yyyyMMddhhmmss>, --to-date <yyyyMMddhhmmss>:: What date to get responses to. If not specified it will get to the latest possible results. A partial value can be passed, e.g. 2016, 201805, etc.
-ci {h,d,m,none}, --capture-interval {h,d,m,none}:: Filters the search on Wayback Machine (archive.org) to only get at most 1 capture per hour (h), day (d) or month (m). This filter is used for responses only. The default is 'd' but can also be set to 'none' to not filter anything and get all responses.
-ra REGEX_AFTER, --regex-after REGEX_AFTER:: RegEx for filtering purposes against links found all sources of URLs AND responses downloaded. Only positive matches will be output.
-url-filename:: Set the file name of downloaded responses to the URL that generated the response, otherwise it will be set to the hash value of the response. Using the hash value means multiple URLs that generated the same response will only result in one file being saved for that response.
-xwm:: Exclude checks for links from Wayback Machine (archive.org)
-xcc:: Exclude checks for links from commoncrawl.org
-xav:: Exclude checks for links from alienvault.com
-xus:: Exclude checks for links from urlscan.io
-xvt:: Exclude checks for links from virustotal.com
-lcc LCC:: Limit the number of Common Crawl index collections searched, e.g. '-lcc 10' will just search the latest 10 collections (default: 3). As of July 2023 there are currently 95 collections. Setting to 0 (default) will search ALL collections. If you don't want to search Common Crawl at all, use the -xcc option.
-lcy LCY:: Limit the number of Common Crawl index collections searched by the year of the index data. The earliest index has data from 2008. Setting to 0 (default) will search collections or any year (but in conjunction with -lcc). For example, if you are only interested in data from 2015 and after, pass -lcy 2015. If you don't want to search Common Crawl at all, use the -xcc option.
-t <seconds>, --timeout <seconds>:: This is for archived responses only! How many seconds to wait for the server to send data before giving up (default: 30 seconds)
-p <integer>, --processes <integer>:: Basic multithreading is done when getting requests for a file of URLs. This argument determines the number of processes (threads) used (default: 1)
-r RETRIES, --retries RETRIES:: The number of retries for requests that get connection error or rate limited (default: 1).
-m <integer>, --memory-threshold <integer>:: The memory threshold percentage. If the machines memory goes above the threshold, the program will be stopped and ended gracefully before running out of memory (default: 95)
-ko [KEYWORDS_ONLY], --keywords-only [KEYWORDS_ONLY]:: Only return links and responses that contain keywords that you are interested in. This can reduce the time it takes to get results. If you provide the flag with no value, Keywords are taken from the comma separated list in the "config.yml" file with the "FILTER_KEYWORDS" key, otherwise you can pass an specific Regex value to use, e.g. -ko "admin" to only get links containing the word admin, or -ko ".js(?|$)" to only get JS files. The Regex check is NOT case sensitive.
-lr LIMIT_REQUESTS, --limit-requests LIMIT_REQUESTS:: Limit the number of requests that will be made when getting links from a source (this doesn't apply to Common Crawl). Some targets can return a huge amount of requests needed that are just not feasible to get, so this can be used to manage that situation. This defaults to 0 (Zero) which means there is no limit.
-ow, --output-overwrite:: If the URL output file (default waymore.txt) already exists, it will be overwritten instead of being appended to.
-nlf, --new-links-file:: If this argument is passed, a .new file will also be written that will contain links for the latest run.
-c CONFIG, --config CONFIG:: Path to the YML config file. If not passed, it looks for file 'config.yml' in the same directory as runtime file 'waymore.py'.
-wrlr WAYBACK_RATE_LIMIT_RETRY, --wayback-rate-limit-retry WAYBACK_RATE_LIMIT_RETRY:: The number of minutes the user wants to wait for a rate limit pause on Watback Machine (archive.org) instead of stopping with a 429 error (default: 3).
-urlr URLSCAN_RATE_LIMIT_RETRY, --urlscan-rate-limit-retry URLSCAN_RATE_LIMIT_RETRY:: The number of minutes the user wants to wait for a rate limit pause on URLScan.io instead of stopping with a 429 error (default: 1).
-co, --check-only:: This will make a few minimal requests to show you how many requests, and roughly how long it could take, to get URLs from the sources and downloaded responses from Wayback Machine.
-nd, --notify-discord:: Whether to send a notification to Discord when waymore completes. It requires WEBHOOK_DISCORD to be provided in the config.yml file.
-v, --verbose:: Verbose output
--version:: Show version number

EXAMPLES

Common usage:

•: Example 1:

Just get the URLs from all sources for redbull.com (-mode U is just for URLs, so no responses are downloaded):

$ waymore -i redbull.com -mode U

The URLs are saved in the same path as config.yml (typically ~/.config/waymore) under results/redbull.com/waymore.txt

•: Example 2:

Get ALL the URLs from Wayback for redbull.com (no filters are applied in mode U with -f, and no URLs are retrieved from Common Crawl, Alien Vault, URLScan and Virus Total, because -xcc, -xav, -xus, -xvt are passed respectively). Save the FIRST 200 responses that are found starting from 2022 (-l 200 -from 2022):

$ waymore -i redbull.com -f -xcc -xav -xus -xvt -l 200 -from 2022

•: Example 3:

You can pipe waymore to other tools. Any errors are sent to stderr and any links found are sent to stdout. The output file is still created in addition to the links being piped to the next program. However, archived responses are not piped to the next program, but they are still written to files. For example:

$ waymore -i redbull.com -mode U | unfurl keys | sort -u

You can also pass the input through stdin instead of -i:

$ cat redbull_subs.txt | waymore

•: Example 4:

Sometimes you may just want to check how many requests, and how long waymore is likely to take if you ran it for a particular domain. You can do a quick check by using the -co/--check-only argument. For example:

$ waymore -i redbull.com --check-only

AUTHOR

Aquila Macedo <aquilamacedo@riseup.net>

COPYRIGHT

Expat

2024-03-22