Quickstart#
Trafilatura simplifies the process of turning raw HTML into structured, meaningful data. Getting started with it is straightforward. This page offers a walkthrough through its main functions.
Primary installation method is with a Python package manager: pip install trafilatura
. For more details see installation documentation.
With Python#
The only required argument is the input document (here a downloaded HTML file), the rest is optional.
# import the necessary functions
>>> from trafilatura import fetch_url, extract
# grab a HTML file to extract data from
>>> downloaded = fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
# output main content and comments as plain text
>>> result = extract(downloaded)
# change the output format to XML (allowing for preservation of document structure)
>>> result = extract(downloaded, output_format="xml")
# discard potential comment and change the output to JSON
>>> extract(downloaded, output_format="json", include_comments=False)
The use of fallback algorithms can also be bypassed in fast mode:
# faster mode without backup extraction
>>> result = extract(downloaded, no_fallback=True)
For a full list of options see Python usage.
The extraction targets the main text part of a webpage. To extract all text content in a html2txt
manner use this function:
>>> from trafilatura import html2txt
>>> html2txt(downloaded)
On the command-line#
URLs can be used directly (-u/–URL):
# outputs main content and comments as plain text
$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
# displays help message with all possible options
$ trafilatura -h
You can also pipe a HTML document (and response body) to trafilatura:
$ cat myfile.html | trafilatura # use the contents of an already existing file
$ < myfile.html trafilatura # same here
Extraction options are also available on the command-line, they can be combined:
$ < myfile.html trafilatura --json --no-tables
Further steps#
For more information please refer to usage documentation and tutorials.
Hint
Explore Trafilatura’s features interactively with this Python Notebook: Trafilatura overview