Work with the data ================== Generic solutions in Python --------------------------- - `Part-of-Speech Tagging `_ - `Natural Language Toolkit (NTLK) `_ - `TF-IDF with Scikit-Learn `_ - Topic modeling, including word2vec models: `Gensim tutorials `_ - Load the input into the data analysis library Pandas: - `read_csv `_ - `read_json `_ XML and XML-TEI --------------- See `A Gentle Introduction to XML `_ or the module `xmltodict `_ which provide ways to directly read the files and work with the data as if it were in JSON format. The textometry platform `TXM `_ can read both XML and TEI-XML files and perform annotation and exploration of corpus data. NLP --- For a first hand approach to NLP pipelines, see `Textblob `_. `Scattertext `_ is a tool for finding distinguishing terms in corpora, and presenting them in an interactive scatter plot. For natural language processing see this list of open-source/off-the-shelf `NLP tools for German `_ and `further lists for other languages `_.