On the command-line#
Introduction#
Trafilatura includes a command-line interface and can be conveniently used without writing code.
For the very first steps please refer to this multilingual, step-by-step Introduction to the command-line interface and this section of the Introduction to Cultural Analytics & Python.
For instructions related to specific platforms see:
Comment Prompt (tutorial for Windows systems)
As well as these compendia:
Introduction to the Bash Command Line (The Programming Historian)
Basic Bash Command Line Tips You Should Know (freeCodeCamp)
Quickstart#
URLs can be used directly (-u/--URL
):
$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
# outputs main content and comments as plain text ...
$ trafilatura --xml --URL "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
$ # outputs main text with basic XML structure ...
$ trafilatura -h
# displays help message
You can also pipe a HTML document (and response body) to trafilatura:
# use the contents of an already existing file
$ cat myfile.html | trafilatura
# alternative syntax
$ < myfile.html trafilatura
# use a custom download utility and pipe it to trafilatura
$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura
Extraction parameters#
Choice of HTML elements#
Several elements can be included or discarded (see list of options below):
Text elements: comments, tables
Structural elements: formatting, images, links
Only comments and text extracted from HTML <table>
elements are extracted by default, --no-comments
and --no-tables
deactivate this setting.
Further options:
--formatting
Keep structural elements related to formatting (
<b>
/<strong>
,<i>
/<emph>
etc.)--links
Keep link targets (in
href="..."
)--images
Keep track of images along with their targets (
<img>
attributes: alt, src, title)
Note
Certain elements are only visible in the output if the chosen format allows it (e.g. images and XML).
Including extra elements works best with conversion to XML/XML-TEI. If the output is buggy removing a constraint (e.g. formatting) can greatly improve the result.
Output format#
Output as TXT without metadata is the default, another format can be selected in two different ways:
--csv
,--json
,--xml
or--xmltei
-out
or--output-format
{txt,csv,json,xml,xmltei}
Hint
Combining TXT, CSV and JSON formats with certain structural elements (e.g. formatting or links) triggers output in TXT+Markdown format.
Optimizing for precision and recall#
The arguments --precision
& --recall
can be passed to the extractor.
They slightly affect processing and volume of textual output, respectively concerning precision/accuracy (i.e. more selective extraction, yielding less and more central elements) and recall (i.e. more opportunistic extraction, taking more elements into account).
Language identification#
Passing the argument --target-language
along with a 2-letter code (ISO 639-1) will trigger language filtering of the output if the identification component has been installed and if the target language is available.
Note
Additional components are required: pip install trafilatura[all]
Process files locally#
In case web pages have already been downloaded and stored, it’s possible to process single files or directories as a whole.
Two major command line arguments are necessary here:
--inputdir
to select a directory to read files from-o
or--outputdir
to define a directory to eventually store the results
Note
In case no directory is selected, results are printed to standard output (STDOUT, e.g. in the terminal window).
Process a list of links#
Note
Beware that there should be a tacit scraping etiquette and that a server may block you after the download of a certain number of pages from the same website/domain in a short period of time.
In addition, some websites may block the requests
user-agent. Thus, trafilatura waits a few seconds per default between requests.
Two major command line arguments are necessary here:
-i
or--inputfile
to select an input list to read links from.This option allows for bulk download and processing of a list of URLs from a file listing one link per line. The input list will be read sequentially, only lines beginning with a valid URL will be read, the file can thus contain other information which will be discarded.
-o
or--outputdir
to define a directory to eventually store the results.The output directory can be created on demand, but it must be writable.
$ trafilatura -i list.txt -o txtfiles/ # output as raw text
$ trafilatura --xml -i list.txt -o xmlfiles/ # output in XML format
Hint
Backup of HTML sources can be useful for archival and further processing:
$ trafilatura --inputfile links.txt --outputdir converted/ --backup-dir html-sources/ --xml
Internet Archive#
Using the option --archived
will trigger queries to the Internet Archive for web pages which could not be downloaded.
There is a fair chance to find archived versions for larger websites, whereas pages of lesser-known websites may not have been preserved there. The retrieval process is slow as it depends on a single web portal only, it is best performed for a relatively small number of URLs.
Link discovery#
Link discovery can be performed over web feeds (Atom and RSS) or sitemaps.
Both homepages and particular sitemaps or feed URLs can be used as input.
The --list
option is useful to list URLs prior to processing. This option can be combined with an input file (-i
) containing a list of sources which will then be processed in parallel.
For more information please refer to the tutorial on content discovery.
Feeds#
# looking for feeds
$ trafilatura --feed "https://www.dwds.de/" --list
# already known feed
$ trafilatura --feed "https://www.dwds.de/api/feed/themenglossar/Corona" --list
# processing a list in parallel
$ trafilatura -i mylist.txt --feed --list
Youtube tutorial: Extracting links from web feeds
Sitemaps#
# run link discovery through a sitemap for sitemaps.org and store the resulting links in a file
$ trafilatura --sitemap "https://www.sitemaps.org/" --list > mylinks.txt
# using an already known sitemap URL
$ trafilatura --sitemap "https://www.sitemaps.org/sitemap.xml" --list
# targeting webpages in German
$ trafilatura --sitemap "https://www.sitemaps.org/" --list --target-language "de"
For more information on sitemap use and filters for lists of links see this blog post: Using sitemaps to crawl websites.
Youtube tutorial: Listing all website contents with sitemaps
URL inspection prior to download and processing#
$ trafilatura --sitemap "https://www.sitemaps.org/" --list --url-filter "https://www.sitemaps.org/de"
$ trafilatura --sitemap "https://www.sitemaps.org/" --list --url-filter "protocol"
Using a subpart of the site also acts like a filter, for example --sitemap "https://www.sitemaps.org/de/"
.
For more information on sitemap use and filters for lists of links see this blog post: Using sitemaps to crawl websites and this tutorial on link filtering.
Configuration#
Text extraction can be parametrized by providing a custom configuration file (that is a variant of settings.cfg) with the --config-file
option, which overrides the standard settings. Useful adjustments include download parameters, minimal extraction length, or de-duplication settings.
Further information#
For all usage instructions see trafilatura -h
:
trafilatura [-h] [-i INPUTFILE | --inputdir INPUTDIR | -u URL]
[--parallel PARALLEL] [-b BLACKLIST] [--list]
[-o OUTPUTDIR] [--backup-dir BACKUP_DIR] [--keep-dirs]
[--hash-as-name] [--feed [FEED] | --sitemap [SITEMAP] |
--crawl [CRAWL] | --explore [EXPLORE]] [--archived]
[--url-filter URL_FILTER [URL_FILTER ...]] [-f]
[--formatting] [--links] [--images] [--no-comments]
[--no-tables] [--only-with-metadata]
[--target-language TARGET_LANGUAGE] [--deduplicate]
[--config-file CONFIG_FILE]
[-out {txt,csv,json,xml,xmltei} | --csv | --json | --xml | --xmltei]
[--validate-tei] [-v] [--version]
Command-line interface for Trafilatura
- optional arguments:
- -h, --help
show this help message and exit
- -v, --verbose
increase logging verbosity (-v or -vv)
- --version
show version information and exit
- Input:
URLs, files or directories to process
- -i INPUTFILE, --inputfile INPUTFILE
name of input file for batch processing
- --inputdir INPUTDIR
read files from a specified directory (relative path)
- -u URL, --URL URL
custom URL download
- --parallel PARALLEL
specify a number of cores/threads for downloads and/or processing
- -b BLACKLIST, --blacklist BLACKLIST
file containing unwanted URLs to discard during processing
- Output:
Determines if and how files will be written
- --list
display a list of URLs without downloading them
- -o OUTPUTDIR, --outputdir OUTPUTDIR
write results in a specified directory (relative path)
- --backup-dir BACKUP_DIR
preserve a copy of downloaded files in a backup directory
- --keep-dirs
keep input directory structure and file names
- --hash-as-name
use hash value as output file name instead of random default
- Navigation:
Link discovery and web crawling
- --feed URL
look for feeds and/or pass a feed URL as input
- --sitemap URL
look for sitemaps for the given website and/or enter a sitemap URL
- --crawl URL
crawl a fixed number of pages within a website starting from the given URL
- --explore URL
explore the given websites (combination of sitemap and crawl)
- --archived
try to fetch URLs from the Internet Archive if downloads fail
- --url-filter URL_FILTER
only process/output URLs containing these patterns (space-separated strings)
- Extraction:
Customization of text and metadata processing
- -f, --fast
fast (without fallback detection)
- --formatting
include text formatting (bold, italic, etc.)
- --links
include links along with their targets (experimental)
- --images
include image sources in output (experimental)
- --no-comments
don’t output any comments
- --no-tables
don’t output any table elements
- --only-with-metadata
only output those documents with title, URL and date (for formats supporting metadata)
- --target-language TARGET_LANGUAGE
select a target language (ISO 639-1 codes)
- --deduplicate
filter out duplicate documents and sections
- --config-file CONFIG_FILE
override standard extraction parameters with a custom config file
- Format:
Selection of the output format
- -out, --output-format
determine output format, possible choices: txt, csv, json, xml, xmltei
- --csv
CSV output
- --json
JSON output
- --xml
XML output
- --xmltei
XML TEI output
- --validate-tei
validate XML TEI output