6. Pipelines, Processes, Docs, and Words¶
Tip
See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20data%20types.ipynb for a detailed walkthrough of CLTK data types.
The CLTK contains four important, native data types:
cltk.core.data_types.Word
: Contains all processed information for each word token. Has attributes includingWord.string
,Word.lemma
,Word.pos
,Word.governor
, andWord.embedding
. AProcess
adds data to eachWord
. See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb for a full demonstration of what kind of information is stored inWord
.cltk.core.data_types.Doc
: ContainsDoc.raw
, which is the original input string toNLP().analyze()
, andDoc.words
, which is a list ofWord
objects. It is the input and output of eachProcess
and final output ofNLP()
. See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb for a full demonstration of what kind of information is stored inDoc
cltk.core.data_types.Process
: Takes and returns aDoc
. Each process does some processing of information within theDoc
, then annotates eachWord
object atDoc.words
.cltk.core.data_types.Pipeline
: Has a list ofProcess
objects atPipeline.processes
. Predefined pipelines have been made for some languages (Languages), while custom pipelines may be created for these languages or other, different languages. See notebook https://github.com/cltk/cltk/blob/master/notebooks/Make%20custom%20Process%20and%20add%20to%20Pipeline.ipynb for an example creating a newProcess
and adding it to a customPipeline
. For an illustration of howProcess
objects inherit from one another, see figure Inheritance of Pipeline class.
![digraph Pipeline {
fontname = "Bitstream Vera Sans"
fontsize = 8
node [
fontname = "Bitstream Vera Sans"
fontsize = 8
shape = "record"
]
edge [
arrowtail = "empty"
]
Pipeline [
label = "{Pipeline|\l| run(): Doc}"
]
LatinPipeline [
label = "{LatinPipeline|\l|processes: [LatinStanzaProcess,\l LatinEmbeddingsProcess,\l StopsProcess,\l LatinNERProcess]}"
]
GreekPipeline [
label = "{GreekPipeline|\l|processes: [GreekStanzaProcess,\l GreekEmbeddingsProcess,\l StopsProcess,\l GreekNERProcess]}"
]
EtcPipeline [
label = "{…|\l|processes: List[Process]}"
]
Pipeline -> LatinPipeline [dir=back]
Pipeline -> GreekPipeline [dir=back]
Pipeline -> EtcPipeline [dir=back]
}](../_images/graphviz-227c2615216fb31fe89b403cbbcf22e80f018946.png)
Inheritance of Pipeline
class¶