Uses & citations¶
Trafilatura is used by other software packages and cited in research publications. This page lists projects mentioning the library.
To add further projects, please edit this page and suggest changes.
Notable projects using this software¶
CommonCrawl downloader, a tool to derive massive amounts of language data
GLAM Workbench for cultural heritage and web archives
Obsei, a text analysis tool
The Internet Archive’s sandcrawler which crawls and processes the scholarly web for the Fatcat catalog of research publications.
SciencesPo’s médialab through its Minet webmining
Vulristics, a framework for analyzing publicly available information about vulnerabilities
Citations in papers¶
To reference this software in a publication please cite the following paper:
Barbaresi, A. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction, in Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131. DOI: 10.18653/v1/2021.acl-demo.15
@inproceedings{barbaresi-2021-trafilatura,
title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
author = "Barbaresi, Adrien",
booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
pages = "122--131",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-demo.15",
year = 2021,
}
The date extraction component htmldate
is referenced in the following publication:
Barbaresi, A. “htmldate: A Python package to extract publication dates from web pages”, Journal of Open Source Software, 5(51), 2439, 2020. DOI: 10.21105/joss.02439
@article{barbaresi-2020-htmldate,
title = {{htmldate: A Python package to extract publication dates from web pages}},
author = "Barbaresi, Adrien",
journal = "Journal of Open Source Software",
volume = 5,
number = 51,
pages = 2439,
url = {https://doi.org/10.21105/joss.02439},
publisher = {The Open Journal},
year = 2020,
}
Research using Trafilatura¶
Alhamzeh, A., Bouhaouel, M., Egyed-Zsigmond, E., & Mitrović, J. (2021). DistilBERT-based Argumentation Retrieval for Answering Comparative Questions. Proceedings of CLEF 2021 – Conference and Labs of the Evaluation Forum.
Bozarth, L., & Budak, C. (2021). An Analysis of the Partnership between Retailers and Low-credibility News Publishers. Journal of Quantitative Description: Digital Media, 1.
Braun, D. (2021). Automated Semantic Analysis, Legal Assessment, and Summarization of Standard Form Contracts (Doctoral dissertation, Universität München).
Fröbe, M., Hagen, M., Bevendorff, J., Völske, M., Stein, B., Schröder, C., … & Potthast, M. (2021). The Impact of Main Content Extraction on Near-Duplicate Detection. arXiv preprint arXiv:2111.10864.
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., … & Leahy, C. (2020). The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027.
Harrando, I., & Troncy, R. (2021). Explainable Zero-Shot Topic Extraction Using a Common-Sense Knowledge Graph. In 3rd Conference on Language, Data and Knowledge (LDK 2021). OpenAccess Series in Informatics, Dagstuhl Publishing.
Karabulut, M., & Mayda, İ. (2020). Development of Browser Extension for HTML Web Page Content Extraction. In 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1-6). IEEE.
Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., & Abdurakhmonova, N. First Results of the “TurkLang-7” Project: Creating Russian-Turkic Parallel Corpora and MT Systems.
Laippala, V., Rönnqvist, S., Hellström, S., Luotolahti, J., Repo, L., Salmela, A., … & Pyysalo, S. (2020). From Web Crawl to Clean Register-Annotated Corpora. In Proceedings of the 12th Web as Corpus Workshop (pp. 14-22).
Madrid-Morales, D. (2021). Who Set the Narrative? Assessing the Influence of Chinese Media in News Coverage of COVID-19 in 30 African Countries.
Meng, K. (2021). An End-to-End Computational System for Monitoring and Verifying Factual Claims. (pre-print)
Robertson, F., Lagus, J., & Kajava, K. (2021). A COVID-19 news coverage mood map of Europe. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (pp. 110-115).
Sawczyn, A., Binkowski, J., Janiak, D., Augustyniak, Ł., & Kajdanowicz, T. (2021). Fact-checking: relevance assessment of references in the Polish political domain. Procedia Computer Science, 192, 1285-1293.
Zinn, J. O., & Müller, M. (2021). Understanding discourse and language of risk. Journal of Risk Research, 1-14.
Research using Htmldate¶
Kupi, M. (2021). Late to the Party? Agile Methods in British and German Government Institutions, Master’s Thesis, Hertie School Berlin.
Smits, T., & Ros, R. (2021). Distant reading 940,000 online circulations of 26 iconic photographs. New Media & Society, DOI: 10.1177/14614448211049.
Ports¶
- Go port