6. Pipelines, Processes, Docs, and Words¶
Tip
See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20data%20types.ipynb for a detailed walkthrough of CLTK data types.
The CLTK contains four important, native data types:
cltk.core.data_types.Word
: Contains all processed information for each word token. Has attributes includingWord.string
,Word.lemma
,Word.pos
,Word.governor
, andWord.embedding
. AProcess
adds data to eachWord
. See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb for a full demonstration of what kind of information is stored inWord
.cltk.core.data_types.Sentence
: Containssentence_embeddings
a weighted average of the word embeddings of the sentence.cltk.core.data_types.Doc
: ContainsDoc.raw
, which is the original input string toNLP().analyze()
, andDoc.words
, which is a list ofWord
objects. It is the input and output of eachProcess
and final output ofNLP()
. See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb for a full demonstration of what kind of information is stored inDoc
cltk.core.data_types.Process
: Takes and returns aDoc
. Each process does some processing of information within theDoc
, then annotates eachWord
object atDoc.words
.cltk.core.data_types.Pipeline
: Has a list ofProcess
objects atPipeline.processes
. Predefined pipelines have been made for some languages (Languages), while custom pipelines may be created for these languages or other, different languages. See notebook https://github.com/cltk/cltk/blob/master/notebooks/Make%20custom%20Process%20and%20add%20to%20Pipeline.ipynb for an example creating a newProcess
and adding it to a customPipeline
. For an illustration of howProcess
objects inherit from one another, see figure Inheritance of Pipeline class.