Uses & citations#

Trafilatura is used by other software packages and cited in research publications. This page lists projects mentioning the library.

To add further projects, please edit this page and suggest changes.

Notable projects using this software#

Institutional users#

Various repositories#

  • Benson, to turn a list of URLs into mp3s of the contents of each web page

  • CommonCrawl downloader, to derive massive amounts of language data

  • GLAM Workbench for cultural heritage (web archives section)

  • Obsei, a text collection and analysis tool

  • Vulristics, a framework for analyzing publicly available information about vulnerabilities

Citations in papers#

To reference this software in a publication please cite the following paper:

@inproceedings{barbaresi-2021-trafilatura,
  title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
  author = "Barbaresi, Adrien",
  booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
  pages = "122--131",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-demo.15",
  year = 2021,
}

The date extraction component htmldate is referenced in the following publication:

@article{barbaresi-2020-htmldate,
  title = {{htmldate: A Python package to extract publication dates from web pages}},
  author = "Barbaresi, Adrien",
  journal = "Journal of Open Source Software",
  volume = 5,
  number = 51,
  pages = 2439,
  url = {https://doi.org/10.21105/joss.02439},
  publisher = {The Open Journal},
  year = 2020,
}

Research using Trafilatura#

Research using Htmldate#

Ports#

Go port

go-trafilatura