Core functions¶

Extraction¶

trafilatura.extract(filecontent, url=None, record_id=None, no_fallback=False, favor_precision=False, favor_recall=False, include_comments=True, output_format='txt', tei_validation=False, target_language=None, include_tables=True, include_images=False, include_formatting=False, include_links=False, deduplicate=False, date_extraction_params=None, only_with_metadata=False, with_metadata=False, max_tree_size=None, url_blacklist=None, author_blacklist=None, settingsfile=None, config=<configparser.ConfigParser object>)[source]¶

Main function exposed by the package:: Wrapper for text extraction and conversion to chosen output format.

Parameters

filecontent – HTML code as string.
url – URL of the webpage.
record_id – Add an ID to the metadata.
no_fallback – Skip the backup extraction with readability-lxml and justext.
favor_precision – prefer less text but correct extraction (weak effect).
favor_recall – when unsure, prefer more text (experimental).
include_comments – Extract comments along with the main text.
output_format – Define an output format: ‘txt’, ‘csv’, ‘json’, ‘xml’, or ‘xmltei’.
tei_validation – Validate the XML-TEI output with respect to the TEI standard.
target_language – Define a language to discard invalid documents (ISO 639-1 format).
include_tables – Take into account information within the HTML <table> element.
include_images – Take images into account (experimental).
include_formatting – Keep structural elements related to formatting (only valuable if output_format is set to XML).
include_links – Keep links along with their targets (experimental).
deduplicate – Remove duplicate segments and documents.
date_extraction_params – Provide extraction parameters to htmldate as dict().
only_with_metadata – Only keep documents featuring all essential metadata (date, title, url).
with_metadata – similar (will be deprecated).
max_tree_size – Discard documents with too many elements.
url_blacklist – Provide a blacklist of URLs as set() to filter out documents.
author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.
settingsfile – Use a configuration file to override the standard settings.
config – Directly provide a configparser configuration.

Returns

A string in the desired format or None.

trafilatura.bare_extraction(filecontent, url=None, no_fallback=False, favor_precision=False, favor_recall=False, include_comments=True, output_format='python', target_language=None, include_tables=True, include_images=False, include_formatting=False, include_links=False, deduplicate=False, date_extraction_params=None, only_with_metadata=False, with_metadata=False, max_tree_size=None, url_blacklist=None, author_blacklist=None, as_dict=True, config=<configparser.ConfigParser object>)[source]¶

Internal function for text extraction returning bare Python variables.

Parameters

filecontent – HTML code as string.
url – URL of the webpage.
no_fallback – Skip the backup extraction with readability-lxml and justext.
favor_precision – prefer less text but correct extraction (weak effect).
favor_recall – prefer more text even when unsure (experimental).
include_comments – Extract comments along with the main text.
output_format – Define an output format, Python being the default and the interest of this internal function. Other values: ‘txt’, ‘csv’, ‘json’, ‘xml’, or ‘xmltei’.
target_language – Define a language to discard invalid documents (ISO 639-1 format).
include_tables – Take into account information within the HTML <table> element.
include_images – Take images into account (experimental).
include_formatting – Keep structural elements related to formatting (present in XML format, converted to markdown otherwise).
include_links – Keep links along with their targets (experimental).
deduplicate – Remove duplicate segments and documents.
date_extraction_params – Provide extraction parameters to htmldate as dict().
only_with_metadata – Only keep documents featuring all essential metadata (date, title, url).
with_metadata – Similar (will be deprecated).
max_tree_size – Discard documents with too many elements.
url_blacklist – Provide a blacklist of URLs as set() to filter out documents.
author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.
as_dict – Legacy option, return a dictionary instead of a class with attributes.
config – Directly provide a configparser configuration.

Returns

A Python dict() containing all the extracted information or None.

Raises

ValueError – Extraction problem.

trafilatura.baseline(filecontent)[source]¶

Use baseline extraction function targeting text paragraphs and/or JSON metadata.

Parameters: filecontent – HTML code as binary string or string.
Returns: A LXML <body> element containing the extracted paragraphs, the main text as string, and its length as integer.

trafilatura.extract_metadata(filecontent, default_url=None, date_config=None, fastmode=False, author_blacklist=None)[source]¶

Main process for metadata extraction.

Parameters

filecontent – HTML code as string.
default_url – Previously known URL of the downloaded document.
date_config – Provide extraction parameters to htmldate as dict().
author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.

Returns

A dict() containing the extracted metadata information or None.

Link discovery¶

trafilatura.sitemaps.sitemap_search(url, target_lang=None)[source]¶

Look for sitemaps for the given URL and gather links.

Parameters

url – Webpage or sitemap URL as string. Triggers URL-based filter if the webpage isn’t a homepage.
target_lang – Define a language to filter URLs based on heuristics (two-letter string, ISO 639-1 format).

Returns

The extracted links as a list (sorted list of unique links).

trafilatura.feeds.find_feed_urls(url, target_lang=None)[source]¶

Try to find feed URLs.

Parameters

url – Webpage or feed URL as string. Triggers URL-based filter if the webpage isn’t a homepage.
target_lang – Define a language to filter URLs based on heuristics (two-letter string, ISO 639-1 format).

Returns

The extracted links as a list (sorted list of unique links).

trafilatura.spider.focused_crawler(homepage, max_seen_urls=10, max_known_urls=100000, todo=None, known_links=None, lang=None, config=<configparser.ConfigParser object>, rules=None)[source]¶

Basic crawler targeting pages of interest within a website.

Parameters

homepage – URL of the page to first page to fetch, preferably the homepage of a website.
max_seen_urls – maximum number of pages to visit, stop iterations at this number or at the exhaustion of pages on the website, whichever comes first.
max_known_urls – stop if the total number of pages “known” exceeds this number.
todo – provide a previously generated list of pages to visit / crawl frontier, must be in collections.deque format.
known_links – provide a previously generated set of links.
lang – try to target links according to language heuristics.
config – use a different configuration (configparser format).
rules – provide politeness rules (urllib.robotparser.RobotFileParser() format). New in version 0.9.1.

Returns

List of pages to visit, deque format, possibly empty if there are no further pages to visit. Set of known links.

Helpers¶

trafilatura.fetch_url(url, decode=True, no_ssl=False, config=<configparser.ConfigParser object>)[source]¶

Fetches page using urllib3 and decodes the response.

Parameters

url – URL of the page to fetch.
decode – Decode response instead of returning urllib3 response object (boolean).
no_ssl – Don’t try to establish a secure connection (to prevent SSLError).
config – Pass configuration values for output control.

Returns

HTML code as string, or Urllib3 response object (headers + body), or empty string in case the result is invalid, or None if there was a problem with the network.

trafilatura.utils.decode_response(response)[source]¶: Read the urllib3 object corresponding to the server response, check if it could be GZip and eventually decompress it, then try to guess its encoding and decode it to return a unicode string

trafilatura.load_html(htmlobject)[source]¶: Load object given as input and validate its type (accepted: LXML tree, trafilatura/urllib3 response, bytestring and string)

trafilatura.utils.sanitize(text)[source]¶: Convert text and discard incompatible and invalid characters

trafilatura.utils.trim(string)[source]¶: Remove unnecessary spaces within a text string

XML processing¶

trafilatura.xml.xmltotxt(xmloutput, include_formatting, include_links)[source]¶: Convert to plain text format and optionally preserve formatting as markdown.

trafilatura.xml.validate_tei(tei)[source]¶: Check if an XML document is conform to the guidelines of the Text Encoding Initiative

External processing¶

trafilatura.external.try_readability(htmlinput, url)[source]¶: Safety net: try with the generic algorithm readability

trafilatura.external.try_justext(tree, url, target_language)[source]¶: Second safety net: try with the generic algorithm justext

Evaluation

Uses & citations