Installation ============ Python ------ Trafilatura runs using `Python `_, currently one of the most frequently used programming languages. This software library/package is tested on Linux, macOS and Windows systems. It is compatible with all recent versions of Python: - `Installing Python 3 on Mac OS X `_ (& `official documentation for Mac `_) - `Installing Python 3 on Windows `_ (& `official documentation for Windows `_) - `Installing Python 3 on Linux `_ (& `official documentation for Unix `_) - Beginners guide: `downloading Python `_ Then you need a version of Python to interact with as well as the Python packages needed for the task. A recent version of Python 3 is necessary. Some systems already have such an environment installed, to check it just run the following command in a terminal window: .. code-block:: bash $ python3 --version Python 3.8.6 # version 3.6 or higher is fine In case Python is not installed, please refer to the excellent `Djangogirls tutorial: Python installation `_. Trafilatura package ------------------- Trafilatura is packaged as a software library available from the package repository `PyPI `_. As such it can notably be installed with ``pip`` or ``pipenv``. Installing Python packages ~~~~~~~~~~~~~~~~~~~~~~~~~~ - Straightforward: `Installing packages in python using pip `_ (& `official documentation `_) - `Using pip on Windows `_ - Advanced: `Pipenv & Virtual Environments `_ Basics ~~~~~~ Please refer to `this section `_ for an introduction on command-line usage. .. code-block:: bash $ pip install trafilatura # pip3 where applicable This project is under active development, please make sure you keep it up-to-date to benefit from latest improvements: .. code-block:: bash # to make sure you have the latest version $ pip install -U trafilatura # latest available code base $ pip install --force-reinstall -U git+https://github.com/adbar/trafilatura On **Mac OS** it can be necessary to install certificates by hand if you get errors like ``[SSL: CERTIFICATE_VERIFY_FAILED]`` while downloading webpages: execute ``pip install certifi`` and perform the post-installation step by clicking on ``/Applications/Python 3.X/Install Certificates.command``. For more information see this `help page on SSL errors `_. Older Python versions ~~~~~~~~~~~~~~~~~~~~~ - Last version for Python 3.5: ``pip install trafilatura==0.9.3`` - Last version for Python 3.4: ``pip install trafilatura==0.8.2`` Command-line tool ~~~~~~~~~~~~~~~~~ If you installed the library successfully but cannot start the command-line tool, try adding the user-level ``bin`` directory to your ``PATH`` environment variable. If you are using a Unix derivative (e.g. Linux, OS X), you can achieve this by running the following command: ``export PATH="$HOME/.local/bin:$PATH"``. For local or user installations where trafilatura cannot be used from the command-line, please refer to `the official Python documentation `_ and this page on `finding executables from the command-line `_. Additional functionality ------------------------ Optional modules ~~~~~~~~~~~~~~~~ A few additional libraries can be installed for extended functionality and faster processing: language detection and faster encoding detection: the ``cchardet`` package may not work on all systems but it is highly recommended. .. code-block:: bash $ pip install cchardet # single package only $ pip install trafilatura[all] # all additional functionality *For infos on dependency management of Python packages see* `this discussion thread `_. .. hint:: Everything works even if not all packages are installed (e.g. because installation fails). You can also install or update relevant packages separately, *trafilatura* will detect which ones are present on your system and opt for the best available combination. cchardet Faster encoding detection, also possibly more accurate (especially for encodings used in Asia) htmldate[all] Faster and more precise date extraction with a series of dedicated packages py3langid Language detection on extracted main text pycurl Faster downloads, possibly less robust though urllib3[brotli] Potentially faster file downloads (not essential) Graphical user interface ------------------------ .. toctree:: :maxdepth: 2 installation-gui