Quickstart¶
Primary installation method is with a Python package manager: pip install trafilatura
. See installation documentation.
With Python¶
The only required argument is the input document (here a downloaded HTML file), the rest is optional.
>>> import trafilatura
>>> downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> trafilatura.extract(downloaded)
# outputs main content and comments as plain text ...
>>> result = trafilatura.extract(downloaded, output_format="xml")
>>> print(result)
# formatting preserved in XML structure ...
>>> trafilatura.extract(downloaded, xml_output=True, include_comments=False)
# outputs main content without comments as XML ...
The use of fallback algorithms can also be bypassed in fast mode:
>>> result = trafilatura.extract(downloaded, no_fallback=True)
# shorter alternative to import and use the functions
>>> from trafilatura import fetch_url, extract
>>> extract(fetch_url('...'))
On the command-line¶
URLs can be used directly (-u/–URL):
$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
# outputs main content and comments as plain text ...
$ trafilatura -h
# displays help message
You can also pipe a HTML document (and response body) to trafilatura:
$ cat myfile.html | trafilatura # use the contents of an already existing file
$ < myfile.html trafilatura # same here
For more information please refer to usage documentation and tutorials.