Core functions

Extraction

trafilatura.extract(filecontent, url=None, record_id=None, no_fallback=False, favor_precision=False, favor_recall=False, include_comments=True, output_format='txt', tei_validation=False, target_language=None, include_tables=True, include_images=False, include_formatting=False, include_links=False, deduplicate=False, date_extraction_params=None, only_with_metadata=False, with_metadata=False, max_tree_size=None, url_blacklist=None, author_blacklist=None, settingsfile=None, config=<configparser.ConfigParser object>)[source]
Main function exposed by the package:

Wrapper for text extraction and conversion to chosen output format.

Parameters
  • filecontent – HTML code as string.

  • url – URL of the webpage.

  • record_id – Add an ID to the metadata.

  • no_fallback – Skip the backup extraction with readability-lxml and justext.

  • favor_precision – prefer less text but correct extraction (weak effect).

  • favor_recall – when unsure, prefer more text (experimental).

  • include_comments – Extract comments along with the main text.

  • output_format – Define an output format: ‘txt’, ‘csv’, ‘json’, ‘xml’, or ‘xmltei’.

  • tei_validation – Validate the XML-TEI output with respect to the TEI standard.

  • target_language – Define a language to discard invalid documents (ISO 639-1 format).

  • include_tables – Take into account information within the HTML <table> element.

  • include_images – Take images into account (experimental).

  • include_formatting – Keep structural elements related to formatting (only valuable if output_format is set to XML).

  • include_links – Keep links along with their targets (experimental).

  • deduplicate – Remove duplicate segments and documents.

  • date_extraction_params – Provide extraction parameters to htmldate as dict().

  • only_with_metadata – Only keep documents featuring all essential metadata (date, title, url).

  • with_metadata – similar (will be deprecated).

  • max_tree_size – Discard documents with too many elements.

  • url_blacklist – Provide a blacklist of URLs as set() to filter out documents.

  • author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.

  • settingsfile – Use a configuration file to override the standard settings.

  • config – Directly provide a configparser configuration.

Returns

A string in the desired format or None.

trafilatura.bare_extraction(filecontent, url=None, no_fallback=False, favor_precision=False, favor_recall=False, include_comments=True, output_format='python', target_language=None, include_tables=True, include_images=False, include_formatting=False, include_links=False, deduplicate=False, date_extraction_params=None, only_with_metadata=False, with_metadata=False, max_tree_size=None, url_blacklist=None, author_blacklist=None, as_dict=True, config=<configparser.ConfigParser object>)[source]

Internal function for text extraction returning bare Python variables.

Parameters
  • filecontent – HTML code as string.

  • url – URL of the webpage.

  • no_fallback – Skip the backup extraction with readability-lxml and justext.

  • favor_precision – prefer less text but correct extraction (weak effect).

  • favor_recall – prefer more text even when unsure (experimental).

  • include_comments – Extract comments along with the main text.

  • output_format – Define an output format, Python being the default and the interest of this internal function. Other values: ‘txt’, ‘csv’, ‘json’, ‘xml’, or ‘xmltei’.

  • target_language – Define a language to discard invalid documents (ISO 639-1 format).

  • include_tables – Take into account information within the HTML <table> element.

  • include_images – Take images into account (experimental).

  • include_formatting – Keep structural elements related to formatting (present in XML format, converted to markdown otherwise).

  • include_links – Keep links along with their targets (experimental).

  • deduplicate – Remove duplicate segments and documents.

  • date_extraction_params – Provide extraction parameters to htmldate as dict().

  • only_with_metadata – Only keep documents featuring all essential metadata (date, title, url).

  • with_metadata – Similar (will be deprecated).

  • max_tree_size – Discard documents with too many elements.

  • url_blacklist – Provide a blacklist of URLs as set() to filter out documents.

  • author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.

  • as_dict – Legacy option, return a dictionary instead of a class with attributes.

  • config – Directly provide a configparser configuration.

Returns

A Python dict() containing all the extracted information or None.

Raises

ValueError – Extraction problem.

trafilatura.baseline(filecontent)[source]

Use baseline extraction function targeting text paragraphs and/or JSON metadata.

Parameters

filecontent – HTML code as binary string or string.

Returns

A LXML <body> element containing the extracted paragraphs, the main text as string, and its length as integer.

trafilatura.extract_metadata(filecontent, default_url=None, date_config=None, fastmode=False, author_blacklist=None)[source]

Main process for metadata extraction.

Parameters
  • filecontent – HTML code as string.

  • default_url – Previously known URL of the downloaded document.

  • date_config – Provide extraction parameters to htmldate as dict().

  • author_blacklist – Provide a blacklist of Author Names as set() to filter out authors.

Returns

A dict() containing the extracted metadata information or None.

Helpers

trafilatura.fetch_url(url, decode=True, no_ssl=False, config=<configparser.ConfigParser object>)[source]

Fetches page using urllib3 and decodes the response.

Parameters
  • url – URL of the page to fetch.

  • decode – Decode response instead of returning urllib3 response object (boolean).

  • no_ssl – Don’t try to establish a secure connection (to prevent SSLError).

  • config – Pass configuration values for output control.

Returns

HTML code as string, or Urllib3 response object (headers + body), or empty string in case the result is invalid, or None if there was a problem with the network.

trafilatura.utils.decode_response(response)[source]

Read the urllib3 object corresponding to the server response, check if it could be GZip and eventually decompress it, then try to guess its encoding and decode it to return a unicode string

trafilatura.load_html(htmlobject)[source]

Load object given as input and validate its type (accepted: LXML tree, trafilatura/urllib3 response, bytestring and string)

trafilatura.utils.sanitize(text)[source]

Convert text and discard incompatible and invalid characters

trafilatura.utils.trim(string)[source]

Remove unnecessary spaces within a text string

XML processing

trafilatura.xml.xmltotxt(xmloutput, include_formatting, include_links)[source]

Convert to plain text format and optionally preserve formatting as markdown.

trafilatura.xml.validate_tei(tei)[source]

Check if an XML document is conform to the guidelines of the Text Encoding Initiative

External processing

trafilatura.external.try_readability(htmlinput, url)[source]

Safety net: try with the generic algorithm readability

trafilatura.external.try_justext(tree, url, target_language)[source]

Second safety net: try with the generic algorithm justext