On the command-line#

Introduction#

Trafilatura includes a command-line interface and can be conveniently used without writing code.

For the very first steps please refer to this multilingual, step-by-step Introduction to the command-line interface and this section of the Introduction to Cultural Analytics & Python.

For instructions related to specific platforms see:

As well as these compendia:

Quickstart#

URLs can be used directly (-u/--URL):

$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
# outputs main content and comments as plain text ...
$ trafilatura --xml --URL "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
$ # outputs main text with basic XML structure ...
$ trafilatura -h
# displays help message

You can also pipe a HTML document (and response body) to trafilatura:

# use the contents of an already existing file
$ cat myfile.html | trafilatura
# alternative syntax
$ < myfile.html trafilatura
# use a custom download utility and pipe it to trafilatura
$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura

Extraction parameters#

Choice of HTML elements#

Several elements can be included or discarded (see list of options below):

  • Text elements: comments, tables

  • Structural elements: formatting, images, links

Only comments and text extracted from HTML <table> elements are extracted by default, --no-comments and --no-tables deactivate this setting.

Further options:

  • --formatting: Keep structural elements related to formatting (<b>/<strong>, <i>/<emph> etc.)

  • --links: Keep link targets (in href="...")

  • --images: Keep track of images along with their targets (<img> attributes: alt, src, title)

Note

Certain elements are only visible in the output if the chosen format allows it (e.g. images and XML).

Including extra elements works best with conversion to XML/XML-TEI.

Output format#

Output as TXT without metadata is the default, another format can be selected in two different ways:

  • --csv, --json, --xml or --xmltei

  • -out or --output-format {txt,csv,json,xml,xmltei}

Hint

Combining TXT, CSV and JSON formats with certain structural elements (e.g. formatting or links) triggers output in TXT+Markdown format.

Process files locally#

In case web pages have already been downloaded and stored, it’s possible to process single files or directories as a whole.

Two major command line arguments are necessary here:

  • --inputdir to select a directory to read files from

  • -o or --outputdir to define a directory to eventually store the results

Note

In case no directory is selected, results are printed to standard output (STDOUT, e.g. in the terminal window).

Configuration#

Text extraction can be parametrized by providing a custom configuration file (that is a variant of settings.cfg) with the --config-file option, which overrides the standard settings.

Further information#

For all usage instructions see trafilatura -h:

trafilatura [-h] [-i INPUTFILE | --inputdir INPUTDIR | -u URL]
               [--parallel PARALLEL] [-b BLACKLIST] [--list]
               [-o OUTPUTDIR] [--backup-dir BACKUP_DIR] [--keep-dirs]
               [--hash-as-name] [--feed [FEED] | --sitemap [SITEMAP] |
               --crawl [CRAWL] | --explore [EXPLORE]] [--archived]
               [--url-filter URL_FILTER [URL_FILTER ...]] [-f]
               [--formatting] [--links] [--images] [--no-comments]
               [--no-tables] [--only-with-metadata]
               [--target-language TARGET_LANGUAGE] [--deduplicate]
               [--config-file CONFIG_FILE]
               [-out {txt,csv,json,xml,xmltei} | --csv | --json | --xml | --xmltei]
               [--validate-tei] [-v]

Command-line interface for Trafilatura

optional arguments:
-h, --help

show this help message and exit

-v, --verbose

increase logging verbosity (-v or -vv)

Input:

URLs, files or directories to process

-i INPUTFILE, --inputfile INPUTFILE

name of input file for batch processing

--inputdir INPUTDIR

read files from a specified directory (relative path)

-u URL, --URL URL

custom URL download

--parallel PARALLEL

specify a number of cores/threads for downloads and/or processing

-b BLACKLIST, --blacklist BLACKLIST

file containing unwanted URLs to discard during processing

Output:

Determines if and how files will be written

--list

display a list of URLs without downloading them

-o OUTPUTDIR, --outputdir OUTPUTDIR

write results in a specified directory (relative path)

--backup-dir BACKUP_DIR

preserve a copy of downloaded files in a backup directory

--keep-dirs

keep input directory structure and file names

--hash-as-name

use hash value as output file name instead of random default

Navigation:

Link discovery and web crawling

--feed URL

look for feeds and/or pass a feed URL as input

--sitemap URL

look for sitemaps for the given website and/or enter a sitemap URL

--crawl URL

crawl a fixed number of pages within a website starting from the given URL

--explore URL

explore the given websites (combination of sitemap and crawl)

--archived

try to fetch URLs from the Internet Archive if downloads fail

--url-filter URL_FILTER

only process/output URLs containing these patterns (space-separated strings)

Extraction:

Customization of text and metadata processing

-f, --fast

fast (without fallback detection)

--formatting

include text formatting (bold, italic, etc.)

--links

include links along with their targets (experimental)

--images

include image sources in output (experimental)

--no-comments

don’t output any comments

--no-tables

don’t output any table elements

--only-with-metadata

only output those documents with title, URL and date (for formats supporting metadata)

--target-language TARGET_LANGUAGE

select a target language (ISO 639-1 codes)

--deduplicate

filter out duplicate documents and sections

--config-file CONFIG_FILE

override standard extraction parameters with a custom config file

Format:

Selection of the output format

-out, --output-format

determine output format, possible choices: txt, csv, json, xml, xmltei

--csv

CSV output

--json

JSON output

--xml

XML output

--xmltei

XML TEI output

--validate-tei

validate XML TEI output