Working with corpus data¶

After gathering texts from the Web, what to do next? This page lists options to work with output generated by Trafilatura.

Generic solutions in Python¶

Data science¶

Load the input into the data analysis library Pandas:
- read_csv
- read_json

Natural language processing¶

For a first hand approach to NLP pipelines, see Textblob or the Natural Language Toolkit (NTLK).

Accessible tutorials:

Specific tools:

Topic modeling, including word2vec models: Gensim tutorials
Scattertext is a tool for finding distinguishing terms in corpora, and presenting them in an interactive scatter plot.

Formats and software used in corpus linguistics¶

Input/Output formats: TXT, XML and XML-TEI are quite frequent in corpus linguistics.

Han., N.-R. (2022). “Transforming Data”, The Open Handbook of Linguistic Data.

The XML and XML-TEI formats¶

See A Gentle Introduction to XML or the Python package xmltodict which provide ways to directly read the files and work with the data as if it were in JSON format.

Corpus analysis tools¶

Antconc is expected to work with TXT files
CorpusExplorer supports CSV, TXT and various XML formats
Corpus Workbench (CWB) uses verticalized texts whose origin can be in TXT or XML format
LancsBox support various formats, notably TXT & XML
TXM (textometry platform) can take TXT, XML & XML-TEI files as input
Voyant support various formats, notably TXT, XML & XML-TEI
Wmatrix can work with TXT and XML
WordSmith supports TXT and XML

Further corpus analysis software can be found on corpus-analysis.com.

Generic NLP solutions¶

For natural language processing see this list of open-source/off-the-shelf NLP tools for German and further lists for other languages.

Finding sources for web corpora