6. Pipelines, Processes, Docs, and Words¶
Tip
See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20data%20types.ipynb for a detailed walkthrough of CLTK data types.
The CLTK contains four important, native data types:
cltk.core.data_types.Word: Contains all processed information for each word token. Has attributes includingWord.string,Word.lemma,Word.pos,Word.governor, andWord.embedding. AProcessadds data to eachWord. See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb for a full demonstration of what kind of information is stored inWord.cltk.core.data_types.Sentence: Containssentence_embeddingsa weighted average of the word embeddings of the sentence.cltk.core.data_types.Doc: ContainsDoc.raw, which is the original input string toNLP().analyze(), andDoc.words, which is a list ofWordobjects. It is the input and output of eachProcessand final output ofNLP(). See notebook https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb for a full demonstration of what kind of information is stored inDoccltk.core.data_types.Process: Takes and returns aDoc. Each process does some processing of information within theDoc, then annotates eachWordobject atDoc.words.cltk.core.data_types.Pipeline: Has a list ofProcessobjects atPipeline.processes. Predefined pipelines have been made for some languages (Languages), while custom pipelines may be created for these languages or other, different languages. See notebook https://github.com/cltk/cltk/blob/master/notebooks/Make%20custom%20Process%20and%20add%20to%20Pipeline.ipynb for an example creating a newProcessand adding it to a customPipeline. For an illustration of howProcessobjects inherit from one another, see figure Inheritance of Pipeline class.
![digraph Pipeline {
fontname = "Bitstream Vera Sans"
fontsize = 8
node [
fontname = "Bitstream Vera Sans"
fontsize = 8
shape = "record"
]
edge [
arrowtail = "empty"
]
Pipeline [
label = "{Pipeline|\l| run(): Doc}"
]
LatinPipeline [
label = "{LatinPipeline|\l|processes: [LatinStanzaProcess,\l LatinEmbeddingsProcess,\l StopsProcess,\l LatinNERProcess]}"
]
GreekPipeline [
label = "{GreekPipeline|\l|processes: [GreekStanzaProcess,\l GreekEmbeddingsProcess,\l StopsProcess,\l GreekNERProcess]}"
]
EtcPipeline [
label = "{…|\l|processes: List[Process]}"
]
Pipeline -> LatinPipeline [dir=back]
Pipeline -> GreekPipeline [dir=back]
Pipeline -> EtcPipeline [dir=back]
}](../_images/graphviz-227c2615216fb31fe89b403cbbcf22e80f018946.png)
Inheritance of Pipeline class¶