Troubleshooting#
Content extraction#
Something is missing#
The extractor uses several fallbacks to make sure enough text is returned. Content extraction is a tradeoff between precision and recall, that is between desired and undesirable content. Being ready to accept more unwanted text makes it easier to gather more of the relevant text in the output. Here are ways to tackle the issue:
Opting for
favor_recall
(Python) or--recall
(CLI)Changing the minimum acceptable length in the settings
Using the more basic baseline or html2txt functions instead (which is also faster)
(see also reported issues with The New Yorker)
Beyond raw HTML#
While downloading and processing raw HTML documents is much faster, it can be necessary to fully render the web page before further processing, e.g. because a page makes exhaustive use of JavaScript or because content is injected from multiple sources.
In such cases the way to go is to use a browser automation library like Playwright. For available alternatives see this list of headless browsers.
Bypassing paywalls#
A browser automation library can also be useful to bypass issues related to cookies and paywalls as it can be combined with a corresponding browser extension, e.g. bypass-paywalls-chrome.
Downloads#
HTTP library#
Using another download utility (see pycurl
with Python and wget
or curl
on the command-line).
Installing the additional download utility
pycurl
manually or usingpip3 install trafilatura[all]
can alleviate the problem: another download library is used, leading to different results.Several alternatives are available on the command-line, e.g.
wget -O - "my_url" | trafilatura
instead oftrafilatura -u "my_url"
.
Note
Downloads may fail because your IP or user agent are blocked. Trafilatura’s crawling and download capacities do not bypass such restrictions.
Web page no longer available on the Internet#
Download issues can be addressed by retrieving the files somewhere else, i.e. from already existing internet archives like the Internet Archive or the CommonCrawl.
Download first and extract later#
Since the they have distinct characteristics it can be useful to separate the infrastructure needed for download from the extraction. Using a custom IP or network infrastructure can also prevent your usual IP from getting banned.