2. Quickstart¶
2.1. Installation¶
Install via Pip:
$ pip install cltk
2.2. Use¶
cltk.nlp.NLP()
has pre-configured processing pipelines for a number of Languages. Executing cltk.nlp.NLP.analyze()
returns a cltk.core.data_types.Doc
object, which contains all processed information.
To process text:
>>> from cltk import NLP
>>> vitruvius = "Architecti est scientia pluribus disciplinis et variis eruditionibus ornata, quae ab ceteris artibus perficiuntur. Opera ea nascitur et fabrica et ratiocinatione."
>>> cltk_nlp = NLP(language="lat")
𐤀 CLTK version '1.0.11'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinNERProcess`, `LatinLexiconProcess`.
>>> cltk_doc = cltk_nlp.analyze(text=vitruvius)
Some NLP Process
require downloaded models, for which you will be prompted to download. You may then inspect the output Doc
, which contains the information produced by each Process
step:
>>> cltk_doc.tokens[:5]
['Architecti', 'est', 'scientia', 'pluribus', 'disciplinis']
>>> cltk_doc.lemmata[:5]
['mrchiteo', 'sum', 'scientia', 'multus', 'disciplina']
>>> cltk_doc.morphosyntactic_features[2] # 'scientia'
{Case: [nominative], Degree: [positive], Gender: [feminine], Number: [singular]}
>>> cltk_doc.pos[:5]
['VERB', 'AUX', 'NOUN', 'ADJ', 'NOUN']
>>> cltk_doc.sentences_tokens
[['Architecti', 'est', 'scientia', 'pluribus', 'disciplinis', ...], ...]
Most processes add their information to a list of Word
objects at Doc.words
:
>>> cltk_doc.words[1].string
'est'
>>> cltk_doc.words[1].stop
True
>>> cltk_doc.words[1].lemma
'sum'
>>> cltk_doc.words[4].definition[:200]
- ‘disciplīnannn ae, nfnndiscipulus, ninstruction, tuition, teaching, training, educationn: puerilis: adulescentīs in disciplinam ei tradere:n te in disciplinam meam tradere: in disciplina’
>>> cltk_doc.words[4].pos 'NOUN' >>> cltk_doc.words[4].category {F: [neg], N: [pos], V: [neg]} >>> cltk_doc.words[4].features {Case: [ablative], Degree: [positive], Gender: [feminine], Number: [plural]} >>> cltk_doc.words[4].dependency_relation 'obl' >>> cltk_doc.words[4].governor # this word's "parent" 8 >>> cltk_doc.words[8].string # looking at this word 'ornata' >>> cltk_doc.words[4].embedding[:5] array([-0.10924 , -0.048127, 0.15953 , -0.19465 , 0.17935 ], dtype=float32) >>> cltk_doc.words[2].embedding[:5] # 'scientia' array([-0.28462 , 0.64238 , -0.40037 , 0.39382 , 0.060418], dtype=float32) >>> cltk_doc.words[5].index_sentence # sentence to which a token belongs 0 >>> cltk_doc.words[20].index_sentence 1
For more, see Pipelines, Processes, Docs, and Words.
2.3. Tutorials¶
Demonstration notebooks available at https://github.com/cltk/cltk/blob/master/notebooks.