Uses & citations#

Trafilatura is used at several institutions, included in other software packages and cited in research publications. This page lists projects and publications mentioning the library.

To add further references, please edit this page and suggest changes.

Notable projects using this software#

Known institutional users#

Data against Feminicide
Media Cloud platform for media analysis
The Internet Archive’s sandcrawler which crawls and processes the scholarly web for the Fatcat catalog of research publications
SciencesPo médialab through its Minet webmining package

Various software repositories#

Benson, to turn a list of URLs into mp3s of the contents of each web page
CommonCrawl downloader, to derive massive amounts of language data
GLAM Workbench for cultural heritage (web archives section)
Obsei, a text collection and analysis tool
Vulristics, a framework for analyzing publicly available information about vulnerabilities

For more see this list of software using Trafilatura.

Citations in papers#

Trafilatura as a whole#

To reference this software in a publication please cite the following paper:

Barbaresi, A. “Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction”, in Proceedings of ACL/IJCNLP 2021: System Demonstrations, 2021, p. 122-131. DOI: 10.18653/v1/2021.acl-demo.15

Reference DOI: 10.18653/v1/2021.acl-demo.15

@inproceedings{barbaresi-2021-trafilatura,
  title = {{Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction}},
  author = "Barbaresi, Adrien",
  booktitle = "Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
  pages = "122--131",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2021.acl-demo.15",
  year = 2021,
}

Date extraction (htmldate)#

The date extraction component htmldate is referenced in the following publication:

Barbaresi, A. “htmldate: A Python package to extract publication dates from web pages”, Journal of Open Source Software, 5(51), 2439, 2020. DOI: 10.21105/joss.02439

@article{barbaresi-2020-htmldate,
  title = {{htmldate: A Python package to extract publication dates from web pages}},
  author = "Barbaresi, Adrien",
  journal = "Journal of Open Source Software",
  volume = 5,
  number = 51,
  pages = 2439,
  url = {https://doi.org/10.21105/joss.02439},
  publisher = {The Open Journal},
  year = 2020,
}

Publications citing Trafilatura#

Alakukku, L. (2022). “Domain specific boilerplate removal from web pages with entropy and clustering”, Master’s thesis, University of Aalto.
Alhamzeh, A., Bouhaouel, M., Egyed-Zsigmond, E., & Mitrović, J. (2021). “DistilBERT-based Argumentation Retrieval for Answering Comparative Questions”, Proceedings of CLEF 2021 – Conference and Labs of the Evaluation Forum.
Bender, M., Bubenhofer, N., Dreesen, P., Georgi, C., Rüdiger, J. O., & Vogel, F. (2022). Techniken und Praktiken der Verdatung. Diskurse–digital, 135-158.
Bozarth, L., & Budak, C. (2021). “An Analysis of the Partnership between Retailers and Low-credibility News Publishers”, Journal of Quantitative Description: Digital Media, 1.
Braun, D. (2021). “Automated Semantic Analysis, Legal Assessment, and Summarization of Standard Form Contracts”, PhD Thesis, Technische Universität München.
Chen, X., Zeynali, A., Camargo, C., Flöck, F., Gaffney, D., Grabowicz, P., … & Samory, M. (2022). SemEval-2022 Task 8: Multilingual news article similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (pp. 1094-1106).
Di Giovanni, M., Tasca, T., & Brambilla, M. (2022). DataScience-Polimi at SemEval-2022 Task 8: Stacking Language Models to Predict News Article Similarity. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022) (pp. 1229-1234).
Fröbe, M., Hagen, M., Bevendorff, J., Völske, M., Stein, B., Schröder, C., … & Potthast, M. (2021). “The Impact of Main Content Extraction on Near-Duplicate Detection”. arXiv preprint arXiv:2111.10864.
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., … & Leahy, C. (2020). “The Pile: An 800GB Dataset of Diverse Text for Language Modeling”, arXiv preprint arXiv:2101.00027.
Harrando, I., & Troncy, R. (2021). “Explainable Zero-Shot Topic Extraction Using a Common-Sense Knowledge Graph”, In 3rd Conference on Language, Data and Knowledge (LDK 2021). OpenAccess Series in Informatics, Dagstuhl Publishing.
Indig, B., Sárközi-Lindner, Z., & Nagy, M. (2022). Use the Metadata, Luke!–An Experimental Joint Metadata Search and N-gram Trend Viewer for Personal Web Archives. In Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities (pp. 47-52).
Jung, G., Han, S., Kim, H., Kim, K., & Cha, J. (2022). Extracting the Main Content of Web Pages Using the First Impression Area. IEEE Access.
Karabulut, M., & Mayda, İ. (2020). “Development of Browser Extension for HTML Web Page Content Extraction”, In 2020 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1-6). IEEE.
Khusainov, A., Suleymanov, D., Gilmullin, R., Minsafina, A., Kubedinova, L., & Abdurakhmonova, N. “First Results of the “TurkLang-7” Project: Creating Russian-Turkic Parallel Corpora and MT Systems”, In CMCL (pp. 90-101).
Laippala, V., Rönnqvist, S., Hellström, S., Luotolahti, J., Repo, L., Salmela, A., … & Pyysalo, S. (2020). “From Web Crawl to Clean Register-Annotated Corpora”, Proceedings of the 12th Web as Corpus Workshop (pp. 14-22).
Madrid-Morales, D. (2021). “Who Set the Narrative? Assessing the Influence of Chinese Media in News Coverage of COVID-19 in 30 African Countries”, Global Media and China, 6(2), 129-151.
Meier-Vieracker, S. (2022). “Fußballwortschatz digital–Korpuslinguistische Ressourcen für den Sprachunterricht.” Korpora Deutsch als Fremdsprache (KorDaF), 2022/01 (pre-print).
Meng, K. (2021). “An End-to-End Computational System for Monitoring and Verifying Factual Claims” (pre-print).
Miquelina, N., Quaresma, P., & Nogueira, V. B. (2022). Generating a European Portuguese BERT Based Model Using Content from Arquivo. pt Archive. In International Conference on Intelligent Data Engineering and Automated Learning (pp. 280-288). Springer, Cham.
Robertson, F., Lagus, J., & Kajava, K. (2021). “A COVID-19 news coverage mood map of Europe”, Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation (pp. 110-115).
Salmela, A. (2022). “Distinguishing Noise and Main Text Content from Web-Sourced Plain Text Documents Using Sequential Neural Networks”, Master’s thesis, University of Turku.
Sawczyn, A., Binkowski, J., Janiak, D., Augustyniak, Ł., & Kajdanowicz, T. (2021). “Fact-checking: relevance assessment of references in the Polish political domain”, Procedia Computer Science, 192, 1285-1293.
Schamel, T., Braun, D., & Matthes, F. (2022). Structured Extraction of Terms and Conditions from German and English Online Shops. In Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5) (pp. 181-190).
Sutter, T., Bozkir, A. S., Gehring, B., & Berlich, P. (2022). Avoiding the Hook: Influential Factors of Phishing Awareness Training on Click-Rates and a Data-Driven Approach to Predict Email Difficulty Perception. IEEE Access, 10, 100540-100565.
Ter-Akopyan, B. (2022). “Identification of Political Leaning in German News”, Master’s thesis, Ludwig Maximilian University of Munich.
Waheed, A., Qunaibi, S., Barradas, D., & Weinberg, Z. (2022). Darwin’s Theory of Censorship: Analysing the Evolution of Censored Topics with Dynamic Topic Models. In Proceedings of the 21st Workshop on Privacy in the Electronic Society (pp. 103-108).
Zinn, J. O., & Müller, M. (2021). “Understanding discourse and language of risk”, Journal of Risk Research, 1-14.

Publications citing Htmldate#

See citation page of htmldate’s documentation.

Ports#

Go port: go-trafilatura