Work with the data¶
Generic solutions in Python¶
Topic modeling, including word2vec models: Gensim tutorials
XML and XML-TEI¶
See A Gentle Introduction to XML or the module xmltodict which provide ways to directly read the files and work with the data as if it were in JSON format.
The textometry platform TXM can read both XML and TEI-XML files and perform annotation and exploration of corpus data.
NLP¶
For a first hand approach to NLP pipelines, see Textblob.
Scattertext is a tool for finding distinguishing terms in corpora, and presenting them in an interactive scatter plot.
For natural language processing see this list of open-source/off-the-shelf NLP tools for German and further lists for other languages.