Work with the data
==================

Generic solutions in Python
---------------------------

- `Part-of-Speech Tagging <https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/POS-Keywords.html>`_
- `Natural Language Toolkit (NTLK) <https://www.nltk.org/>`_
- `TF-IDF with Scikit-Learn <https://melaniewalsh.github.io/Intro-Cultural-Analytics/Text-Analysis/TF-IDF-Scikit-Learn.html>`_
- Topic modeling, including word2vec models: `Gensim tutorials <https://radimrehurek.com/gensim/auto_examples/>`_
- Load the input into the data analysis library Pandas:
   - `read_csv <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html>`_
   - `read_json <https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html>`_


XML and XML-TEI
---------------

See `A Gentle Introduction to XML <https://tei-c.org/release/doc/tei-p5-doc/en/html/SG.html>`_ or the module `xmltodict <https://github.com/martinblech/xmltodict>`_ which provide ways to directly read the files and work with the data as if it were in JSON format.


The textometry platform `TXM <https://txm.gitpages.huma-num.fr/textometrie/en/>`_ can read both XML and TEI-XML files and perform annotation and exploration of corpus data.


NLP
---

For a first hand approach to NLP pipelines, see `Textblob <https://textblob.readthedocs.io/en/dev/>`_.

`Scattertext <https://github.com/JasonKessler/scattertext>`_ is a tool for finding distinguishing terms in corpora, and presenting them in an interactive scatter plot.


For natural language processing see this list of open-source/off-the-shelf `NLP tools for German <https://github.com/adbar/German-NLP>`_ and `further lists for other languages <https://github.com/adbar/German-NLP#Comparable-lists>`_.