Core functions#
Table of contents
Extraction#
extract()
#
- trafilatura.extract(filecontent, url=None, record_id=None, no_fallback=False, favor_precision=False, favor_recall=False, include_comments=True, output_format='txt', tei_validation=False, target_language=None, include_tables=True, include_images=False, include_formatting=False, include_links=False, deduplicate=False, date_extraction_params=None, only_with_metadata=False, with_metadata=False, max_tree_size=None, url_blacklist=None, author_blacklist=None, settingsfile=None, config=<configparser.ConfigParser object>, **kwargs)[source]#
- Main function exposed by the package:
Wrapper for text extraction and conversion to chosen output format.
- Parameters:
filecontent – HTML code as string.
url – URL of the webpage.
record_id – Add an ID to the metadata.
no_fallback – Skip the backup extraction with readability-lxml and justext.
favor_precision – prefer less text but correct extraction.
favor_recall – when unsure, prefer more text.
include_comments – Extract comments along with the main text.
output_format – Define an output format: ‘txt’, ‘csv’, ‘json’, ‘xml’, or ‘xmltei’.
tei_validation – Validate the XML-TEI output with respect to the TEI standard.
target_language – Define a language to discard invalid documents (ISO 639-1 format).
include_tables – Take into account information within the HTML <table> element.
include_images – Take images into account (experimental).
include_formatting – Keep structural elements related to formatting (only valuable if output_format is set to XML).
include_links – Keep links along with their targets (experimental).
deduplicate – Remove duplicate segments and documents.
date_extraction_params – Provide extraction parameters to htmldate as dict().
only_with_metadata – Only keep documents featuring all essential metadata (date, title, url).
max_tree_size – Discard documents with too many elements.
url_blacklist – Provide a blacklist of URLs as set() to filter out documents.
author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.
settingsfile – Use a configuration file to override the standard settings.
config – Directly provide a configparser configuration.
- Returns:
A string in the desired format or None.
bare_extraction()
#
- trafilatura.bare_extraction(filecontent, url=None, no_fallback=False, favor_precision=False, favor_recall=False, include_comments=True, output_format='python', target_language=None, include_tables=True, include_images=False, include_formatting=False, include_links=False, deduplicate=False, date_extraction_params=None, only_with_metadata=False, with_metadata=False, max_tree_size=None, url_blacklist=None, author_blacklist=None, as_dict=True, config=<configparser.ConfigParser object>)[source]#
Internal function for text extraction returning bare Python variables.
- Parameters:
filecontent – HTML code as string.
url – URL of the webpage.
no_fallback – Skip the backup extraction with readability-lxml and justext.
favor_precision – prefer less text but correct extraction.
favor_recall – prefer more text even when unsure.
include_comments – Extract comments along with the main text.
output_format – Define an output format, Python being the default and the interest of this internal function. Other values: ‘txt’, ‘csv’, ‘json’, ‘xml’, or ‘xmltei’.
target_language – Define a language to discard invalid documents (ISO 639-1 format).
include_tables – Take into account information within the HTML <table> element.
include_images – Take images into account (experimental).
include_formatting – Keep structural elements related to formatting (present in XML format, converted to markdown otherwise).
include_links – Keep links along with their targets (experimental).
deduplicate – Remove duplicate segments and documents.
date_extraction_params – Provide extraction parameters to htmldate as dict().
only_with_metadata – Only keep documents featuring all essential metadata (date, title, url).
max_tree_size – Discard documents with too many elements.
url_blacklist – Provide a blacklist of URLs as set() to filter out documents.
author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.
as_dict – Legacy option, return a dictionary instead of a class with attributes.
config – Directly provide a configparser configuration.
- Returns:
A Python dict() containing all the extracted information or None.
- Raises:
ValueError – Extraction problem.
baseline()
#
- trafilatura.baseline(filecontent)[source]#
Use baseline extraction function targeting text paragraphs and/or JSON metadata.
- Parameters:
filecontent – HTML code as binary string or string.
- Returns:
A LXML <body> element containing the extracted paragraphs, the main text as string, and its length as integer.
html2txt()
#
try_readability()
#
try_justext()
#
extract_metadata()
#
- trafilatura.extract_metadata(filecontent, default_url=None, date_config=None, fastmode=False, author_blacklist=None)[source]#
Main process for metadata extraction.
- Parameters:
filecontent – HTML code as string.
default_url – Previously known URL of the downloaded document.
date_config – Provide extraction parameters to htmldate as dict().
author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.
- Returns:
A dict() containing the extracted metadata information or None.
extract_comments()
#
Link discovery#
sitemap_search()
#
- trafilatura.sitemaps.sitemap_search(url, target_lang=None)[source]#
Look for sitemaps for the given URL and gather links.
- Parameters:
url – Webpage or sitemap URL as string. Triggers URL-based filter if the webpage isn’t a homepage.
target_lang – Define a language to filter URLs based on heuristics (two-letter string, ISO 639-1 format).
- Returns:
The extracted links as a list (sorted list of unique links).
find_feed_urls()
#
- trafilatura.feeds.find_feed_urls(url, target_lang=None)[source]#
Try to find feed URLs.
- Parameters:
url – Webpage or feed URL as string. Triggers URL-based filter if the webpage isn’t a homepage.
target_lang – Define a language to filter URLs based on heuristics (two-letter string, ISO 639-1 format).
- Returns:
The extracted links as a list (sorted list of unique links).
focused_crawler()
#
- trafilatura.spider.focused_crawler(homepage, max_seen_urls=10, max_known_urls=100000, todo=None, known_links=None, lang=None, config=<configparser.ConfigParser object>, rules=None)[source]#
Basic crawler targeting pages of interest within a website.
- Parameters:
homepage – URL of the page to first page to fetch, preferably the homepage of a website.
max_seen_urls – maximum number of pages to visit, stop iterations at this number or at the exhaustion of pages on the website, whichever comes first.
max_known_urls – stop if the total number of pages “known” exceeds this number.
todo – provide a previously generated list of pages to visit / crawl frontier, must be in collections.deque format.
known_links – provide a previously generated set of links.
lang – try to target links according to language heuristics.
config – use a different configuration (configparser format).
rules – provide politeness rules (urllib.robotparser.RobotFileParser() format). New in version 0.9.1.
- Returns:
List of pages to visit, deque format, possibly empty if there are no further pages to visit. Set of known links.
Helpers#
fetch_url()
#
- trafilatura.fetch_url(url, decode=True, no_ssl=False, config=<configparser.ConfigParser object>)[source]#
Fetches page using urllib3 and decodes the response.
- Parameters:
url – URL of the page to fetch.
decode – Decode response instead of returning urllib3 response object (boolean).
no_ssl – Don’t try to establish a secure connection (to prevent SSLError).
config – Pass configuration values for output control.
- Returns:
HTML code as string, or Urllib3 response object (headers + body), or empty string in case the result is invalid, or None if there was a problem with the network.