Uses & citations#

Trafilatura is used by other software packages and cited in research publications. This page lists projects mentioning the library.

To add further projects, please edit this page and suggest changes.

Notable projects using this software#

CommonCrawl downloader, a tool to derive massive amounts of language data
GLAM Workbench for cultural heritage and web archives
Obsei, a text analysis tool
The Internet Archive’s sandcrawler which crawls and processes the scholarly web for the Fatcat catalog of research publications.
SciencesPo’s médialab through its Minet webmining
Vulristics, a framework for analyzing publicly available information about vulnerabilities

Ports#

Go port: go-trafilatura

Citations in papers#

To reference this software in a publication please cite the following paper:

Barbaresi, A. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction, in Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131.

@inproceedings{barbaresi-2021-trafilatura,
  title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
  author = "Barbaresi, Adrien",
  booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
  pages = "122--131",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-demo.15",
  year = 2021,
}

Research using Trafilatura#

Bozarth, L., & Budak, C. (2021). An Analysis of the Partnership between Retailers and Low-credibility News Publishers. Journal of Quantitative Description: Digital Media, 1.
Braun, D. (2021). Automated Semantic Analysis, Legal Assessment, and Summarization of Standard Form Contracts (Doctoral dissertation, Universität München).
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., … & Leahy, C. (2020). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027.
Karabulut, M., & Mayda, İ. (2020). Development of Browser Extension for HTML Web Page Content Extraction. In 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1-6). IEEE.
Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., & Abdurakhmonova, N. First Results of the “TurkLang-7” Project: Creating Russian-Turkic Parallel Corpora and MT Systems.
Laippala, V., Rönnqvist, S., Hellström, S., Luotolahti, J., Repo, L., Salmela, A., … & Pyysalo, S. (2020). From Web Crawl to Clean Register-Annotated Corpora. In Proceedings of the 12th Web as Corpus Workshop (pp. 14-22).
Madrid-Morales, D. (2021). Who Set the Narrative? Assessing the Influence of Chinese Media in News Coverage of COVID-19 in 30 African Countries.
Robertson, F., Lagus, J., & Kajava, K. (2021). A COVID-19 news coverage mood map of Europe. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (pp. 110-115).