Uses & citations#

Trafilatura is used at several institutions, included in other software packages and cited in research publications. This page lists projects and publications mentioning the library.

To add further references, please edit this page and suggest changes.

Notable projects using this software#

Institutional users#

Various repositories#

  • Benson, to turn a list of URLs into mp3s of the contents of each web page

  • CommonCrawl downloader, to derive massive amounts of language data

  • GLAM Workbench for cultural heritage (web archives section)

  • Obsei, a text collection and analysis tool

  • Vulristics, a framework for analyzing publicly available information about vulnerabilities

Citations in papers#

Trafilatura as a whole#

To reference this software in a publication please cite the following paper:

Reference DOI: 10.18653/v1/2021.acl-demo.15 Zenodo archive DOI: 10.5281/zenodo.3460969
@inproceedings{barbaresi-2021-trafilatura,
  title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
  author = "Barbaresi, Adrien",
  booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
  pages = "122--131",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-demo.15",
  year = 2021,
}

Date extraction (htmldate)#

The date extraction component htmldate is referenced in the following publication:

JOSS article Zenodo archive
@article{barbaresi-2020-htmldate,
  title = {{htmldate: A Python package to extract publication dates from web pages}},
  author = "Barbaresi, Adrien",
  journal = "Journal of Open Source Software",
  volume = 5,
  number = 51,
  pages = 2439,
  url = {https://doi.org/10.21105/joss.02439},
  publisher = {The Open Journal},
  year = 2020,
}

Publications citing Trafilatura#

  • Alakukku, L. (2022). “Domain specific boilerplate removal from web pages with entropy and clustering”, Master’s thesis, University of Aalto.

  • Alhamzeh, A., Bouhaouel, M., Egyed-Zsigmond, E., & Mitrović, J. (2021). “DistilBERT-based Argumentation Retrieval for Answering Comparative Questions”, Proceedings of CLEF 2021 – Conference and Labs of the Evaluation Forum.

  • Bozarth, L., & Budak, C. (2021). “An Analysis of the Partnership between Retailers and Low-credibility News Publishers”, Journal of Quantitative Description: Digital Media, 1.

  • Braun, D. (2021). “Automated Semantic Analysis, Legal Assessment, and Summarization of Standard Form Contracts”, PhD Thesis, Technische Universität München.

  • Fröbe, M., Hagen, M., Bevendorff, J., Völske, M., Stein, B., Schröder, C., … & Potthast, M. (2021). “The Impact of Main Content Extraction on Near-Duplicate Detection”. arXiv preprint arXiv:2111.10864.

  • Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., … & Leahy, C. (2020). “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”, arXiv preprint arXiv:2101.00027.

  • Harrando, I., & Troncy, R. (2021). “Explainable Zero-Shot Topic Extraction Using a Common-Sense Knowledge Graph”, In 3rd Conference on Language, Data and Knowledge (LDK 2021). OpenAccess Series in Informatics, Dagstuhl Publishing.

  • Karabulut, M., & Mayda, İ. (2020). “Development of Browser Extension for HTML Web Page Content Extraction”, In 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1-6). IEEE.

  • Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., & Abdurakhmonova, N. “First Results of the “TurkLang-7” Project: Creating Russian-Turkic Parallel Corpora and MT Systems”, In CMCL (pp. 90-101).

  • Laippala, V., Rönnqvist, S., Hellström, S., Luotolahti, J., Repo, L., Salmela, A., … & Pyysalo, S. (2020). “From Web Crawl to Clean Register-Annotated Corpora”, Proceedings of the 12th Web as Corpus Workshop (pp. 14-22).

  • Madrid-Morales, D. (2021). “Who Set the Narrative? Assessing the Influence of Chinese Media in News Coverage of COVID-19 in 30 African Countries”, Global Media and China, 6(2), 129-151.

  • Meier-Vieracker, S. (2022). “Fußballwortschatz digital–Korpuslinguistische Ressourcen für den Sprachunterricht.” Korpora Deutsch als Fremdsprache (KorDaF), 2022/01 (pre-print).

  • Meng, K. (2021). “An End-to-End Computational System for Monitoring and Verifying Factual Claims” (pre-print).

  • Robertson, F., Lagus, J., & Kajava, K. (2021). “A COVID-19 news coverage mood map of Europe”, Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (pp. 110-115).

  • Salmela, A. (2022). “Distinguishing Noise and Main Text Content from Web-Sourced Plain Text Documents Using Sequential Neural Networks”, Master’s thesis, University of Turku.

  • Sawczyn, A., Binkowski, J., Janiak, D., Augustyniak, Ł., & Kajdanowicz, T. (2021). “Fact-checking: relevance assessment of references in the Polish political domain”, Procedia Computer Science, 192, 1285-1293.

  • Ter-Akopyan, B. (2022). “Identification of Political Leaning in German News”, Master’s thesis, Ludwig Maximilian University of Munich.

  • Zinn, J. O., & Müller, M. (2021). “Understanding discourse and language of risk”, Journal of Risk Research, 1-14.

Publications citing Htmldate#

  • Grabovoy, A., Bakhteev, O., & Chekhovich, Y. (2021). “The automatic approach for scientific papers dating,” 2021 Ivannikov Ispras Open Conference (ISPRAS), pp. 107-113, IEEE, DOI: 10.1109/ISPRAS53967.2021.00020.

  • Hanley, H. W., Kumar, D., & Durumeric, Z. (2022). “Happenstance: Utilizing Semantic Search to Track Russian State Media Narratives about the Russo-Ukrainian War On Reddit”. arXiv preprint arXiv:2205.14484.

  • Kupi, M. (2021). “Late to the Party? Agile Methods in British and German Government Institutions”, Master’s thesis, Hertie School Berlin.

  • Smits, T., & Ros, R. (2021). “Distant reading 940,000 online circulations of 26 iconic photographs”, New Media & Society, DOI: 10.1177/14614448211049.

Ports#

Go port

go-trafilatura