Installation ============ Python ------ Trafilatura runs using `Python `_, currently one of the most frequently used programming languages. This software library/package is tested on Linux, macOS and Windows systems. It is compatible with Python 3 (3.5 upwards): - `Installing Python 3 on Mac OS X `_ (& `official documentation for Mac `_) - `Installing Python 3 on Windows `_ (& `official documentation for Windows `_) - `Installing Python 3 on Linux `_ (& `official documentation for Unix `_) - Beginners guide: `downloading Python `_ Then you need a version of Python to interact with as well as the Python packages needed for the task. A recent version of Python 3 is necessary. Some systems already have such an environment installed, to check it just run the following command in a terminal window: .. code-block:: bash $ python3 --version Python 3.8.6 # version 3.6 or higher is fine In case Python is not installed, please refer to the excellent `Djangogirls tutorial: Python installation `_. Trafilatura package ------------------- Trafilatura is packaged as a software library available from the package repository `PyPI `_. As such it can notably be installed with ``pip`` or ``pipenv``. Installing Python packages ~~~~~~~~~~~~~~~~~~~~~~~~~~ - Straightforward: `Installing packages in python using pip `_ (& `official documentation `_) - `Using pip on Windows `_ - Advanced: `Pipenv & Virtual Environments `_ Basics ~~~~~~ Please refer to `this section `_ for an introduction on command-line usage. .. code-block:: bash $ pip install trafilatura # pip3 where applicable This project is under active development, please make sure you keep it up-to-date to benefit from latest improvements: .. code-block:: bash # to make sure you have the latest version $ pip install -U trafilatura # latest available code base $ pip install -U git+https://github.com/adbar/trafilatura.git On **Mac OS** it can be necessary to install certificates by hand if you get errors like ``[SSL: CERTIFICATE_VERIFY_FAILED]`` while downloading webpages: execute ``pip install certifi`` and perform the post-installation step by clicking on ``/Applications/Python 3.X/Install Certificates.command``. For more information see this `help page on SSL errors `_. Command-line tool ~~~~~~~~~~~~~~~~~ If you installed the library successfully but cannot start the command-line tool, try adding the user-level ``bin`` directory to your ``PATH`` environment variable. If you are using a Unix derivative (e.g. Linux, OS X), you can achieve this by running the following command: ``export PATH="$HOME/.local/bin:$PATH"``. For local or user installations where trafilatura cannot be used from the command-line, please refer to `the official Python documentation `_ and this page on `finding executables from the command-line `_. Graphical user interface ~~~~~~~~~~~~~~~~~~~~~~~~ See `this link `_ for installation instructions. Additional functionality ------------------------ A few additional libraries can be installed for extended functionality and faster processing: language detection and faster encoding detection: the ``cchardet`` package may not work on all systems but it is highly recommended. .. code-block:: bash $ pip install cchardet # speed-up only $ pip install trafilatura[all] # all additional functionality For improved date extraction you can use ``pip install htmldate[speed]``. You can also install or update relevant packages separately, *trafilatura* will detect which ones are present on your system and opt for the best available combination. *For infos on dependency management of Python packages see* `this discussion thread `_