8.1.5. cltk.dependency package¶
Init for cltk.dependency
.
8.1.5.1. Submodules¶
8.1.5.2. cltk.dependency.processes module¶
Process
classes for accessing the Stanza project.
- class cltk.dependency.processes.StanzaProcess(language=None)[source]¶
Bases:
Process
A
Process
type to capture everything that thestanza
project can do for a given language.Note
stanza
has only partial functionality available for some languages.>>> from cltk.languages.example_texts import get_example_text >>> process_stanza = StanzaProcess(language="lat") >>> isinstance(process_stanza, StanzaProcess) True >>> from stanza.models.common.doc import Document >>> output_doc = process_stanza.run(Doc(raw=get_example_text("lat"))) >>> isinstance(output_doc.stanza_doc, Document) True
- language: str = None¶
- algorithm¶
- static stanza_to_cltk_word_type(stanza_doc)[source]¶
Take an entire
stanza
document, extract each word, and encode it in the way expected by the CLTK’sWord
type.>>> from cltk.dependency.processes import StanzaProcess >>> from cltk.languages.example_texts import get_example_text >>> process_stanza = StanzaProcess(language="lat") >>> cltk_words = process_stanza.run(Doc(raw=get_example_text("lat"))).words >>> isinstance(cltk_words, list) True >>> isinstance(cltk_words[0], Word) True >>> cltk_words[0] Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='Gallia', pos=noun, lemma='Gallia', stem=None, scansion=None, xpos='A1|grn1|casA|gen2', upos='NOUN', dependency_relation='nsubj', governor=1, features={Case: [nominative], Gender: [feminine], InflClass: [ind_eur_a], Number: [singular]}, category={F: [neg], N: [pos], V: [neg]}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)
- class cltk.dependency.processes.GreekStanzaProcess(language='grc', description='Default process for Stanza for the Ancient Greek language.')[source]¶
Bases:
StanzaProcess
Stanza processor for Ancient Greek.
- language: str = 'grc'¶
- description: str = 'Default process for Stanza for the Ancient Greek language.'¶
- class cltk.dependency.processes.LatinStanzaProcess(language='lat', description='Default process for Stanza for the Latin language.')[source]¶
Bases:
StanzaProcess
Stanza processor for Latin.
- language: str = 'lat'¶
- description: str = 'Default process for Stanza for the Latin language.'¶
- class cltk.dependency.processes.OCSStanzaProcess(language='chu', description='Default process for Stanza for the Old Church Slavonic language.')[source]¶
Bases:
StanzaProcess
Stanza processor for Old Church Slavonic.
- language: str = 'chu'¶
- description: str = 'Default process for Stanza for the Old Church Slavonic language.'¶
- class cltk.dependency.processes.OldFrenchStanzaProcess(language='fro', description='Default process for Stanza for the Old French language.')[source]¶
Bases:
StanzaProcess
Stanza processor for Old French.
- language: str = 'fro'¶
- description: str = 'Default process for Stanza for the Old French language.'¶
- class cltk.dependency.processes.GothicStanzaProcess(language='got', description='Default process for Stanza for the Gothic language.')[source]¶
Bases:
StanzaProcess
Stanza processor for Gothic.
- language: str = 'got'¶
- description: str = 'Default process for Stanza for the Gothic language.'¶
- class cltk.dependency.processes.CopticStanzaProcess(language='cop', description='Default process for Stanza for the Coptic language.')[source]¶
Bases:
StanzaProcess
Stanza processor for Coptic.
- language: str = 'cop'¶
- description: str = 'Default process for Stanza for the Coptic language.'¶
- class cltk.dependency.processes.ChineseStanzaProcess(language='lzh', description='Default process for Stanza for the Classical Chinese language.')[source]¶
Bases:
StanzaProcess
Stanza processor for Classical Chinese.
- language: str = 'lzh'¶
- description: str = 'Default process for Stanza for the Classical Chinese language.'¶
- class cltk.dependency.processes.TreeBuilderProcess(language=None)[source]¶
Bases:
Process
A
Process
that takes a doc containing sentences of CLTK words and returns a dependency tree for each sentence.TODO: JS help to make this work, illustrate better.
>>> from cltk import NLP >>> nlp = NLP(language="got", suppress_banner=True) >>> from cltk.dependency.processes import TreeBuilderProcess
>>> nlp.pipeline.add_process(TreeBuilderProcess) >>> from cltk.languages.example_texts import get_example_text >>> doc = nlp.analyze(text=get_example_text("got")) >>> len(doc.trees) 4
8.1.5.3. cltk.dependency.stanza_wrapper module¶
Wrapper for the Python Stanza package. About: https://github.com/stanfordnlp/stanza.
- class cltk.dependency.stanza_wrapper.StanzaWrapper(language, treebank=None, stanza_debug_level='ERROR', interactive=True, silent=False)[source]¶
Bases:
object
CLTK’s wrapper for the Stanza project.
- nlps = {}¶
- parse(text)[source]¶
Run all available
stanza
parsing on input text.>>> from cltk.languages.example_texts import get_example_text >>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> greek_nlp = stanza_wrapper.parse(get_example_text("grc")) >>> from stanza.models.common.doc import Document, Token >>> isinstance(greek_nlp, Document) True
>>> nlp_greek_first_sent = greek_nlp.sentences[0] >>> isinstance(nlp_greek_first_sent.tokens[0], Token) True >>> nlp_greek_first_sent.tokens[0].text 'ὅτι' >>> nlp_greek_first_sent.tokens[0].words [{ "id": 1, "text": "ὅτι", "lemma": "ὅτι", "upos": "ADV", "xpos": "Df", "head": 13, "deprel": "advmod", "start_char": 0, "end_char": 3 }] >>> nlp_greek_first_sent.tokens[0].start_char 0 >>> nlp_greek_first_sent.tokens[0].end_char 3 >>> nlp_greek_first_sent.tokens[0].misc >>> nlp_greek_first_sent.tokens[0].pretty_print() '<Token id=1;words=[<Word id=1;text=ὅτι;lemma=ὅτι;upos=ADV;xpos=Df;head=13;deprel=advmod>]>' >>> nlp_greek_first_sent.tokens[0].to_dict() [{'id': 1, 'text': 'ὅτι', 'lemma': 'ὅτι', 'upos': 'ADV', 'xpos': 'Df', 'head': 13, 'deprel': 'advmod', 'start_char': 0, 'end_char': 3}]
>>> first_word = nlp_greek_first_sent.tokens[0].words[0] >>> first_word.id 1 >>> first_word.text 'ὅτι' >>> first_word.lemma 'ὅτι' >>> first_word.upos 'ADV' >>> first_word.xpos 'Df' >>> first_word.feats >>> first_word.head 13 >>> first_word.parent [ { "id": 1, "text": "ὅτι", "lemma": "ὅτι", "upos": "ADV", "xpos": "Df", "head": 13, "deprel": "advmod", "start_char": 0, "end_char": 3 } ] >>> first_word.misc >>> first_word.deprel 'advmod' >>> first_word.pos 'ADV'
- _load_pipeline()[source]¶
Instantiate
stanza.Pipeline()
.TODO: Make sure that logging captures what it should from the default stanza printout. TODO: Make note that full lemmatization is not possible for Old French
>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> with suppress_stdout(): nlp_obj = stanza_wrapper._load_pipeline() >>> isinstance(nlp_obj, stanza.pipeline.core.Pipeline) True >>> stanza_wrapper = StanzaWrapper(language='fro', stanza_debug_level="INFO", interactive=False, silent=True) >>> with suppress_stdout(): nlp_obj = stanza_wrapper._load_pipeline() >>> isinstance(nlp_obj, stanza.pipeline.core.Pipeline) True
- _is_model_present()[source]¶
Checks if the model is already downloaded.
>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> stanza_wrapper._is_model_present() True
- Return type:
bool
- _get_default_treebank()[source]¶
Return description of a language’s default treebank if none supplied.
>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> stanza_wrapper._get_default_treebank() 'proiel'
- Return type:
str
- _is_valid_treebank()[source]¶
Check whether for chosen language, optional treebank value is valid.
>>> stanza_wrapper = StanzaWrapper(language='grc', treebank='proiel', stanza_debug_level="INFO", interactive=False, silent=True) >>> stanza_wrapper._is_valid_treebank() True
- Return type:
bool
- is_wrapper_available()[source]¶
Maps an ISO 639-3 language id (e.g.,
lat
for Latin) to that used bystanza
(la
); confirms that this is a language the CLTK supports (i.e., is it pre-modern or not).>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> stanza_wrapper.is_wrapper_available() True
- Return type:
bool
- _get_stanza_code()[source]¶
Using known-supported language, use the CLTK’s internal code to look up the code used by Stanza.
>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> stanza_wrapper._get_stanza_code() 'grc' >>> stanza_wrapper.language = "xxx" >>> stanza_wrapper._get_stanza_code() Traceback (most recent call last): ... KeyError: 'Somehow ``StanzaWrapper.language`` got renamed to something invalid. This should never happen.'
- Return type:
str
8.1.5.4. cltk.dependency.tree module¶
A data structure for representing dependency tree graphs.
- class cltk.dependency.tree.Form(form, form_id=0)[source]¶
Bases:
Element
For the word (ie, node) of a dependency tree and its attributes. Inherits from the
Element
class of Python’sxml.etree
library.>>> desc_form = Form('described') >>> desc_form described_0 >>> desc_form.set('Tense', 'Past') >>> desc_form described_0 >>> desc_form / 'VBN' described_0/VBN >>> desc_form.full_str() 'described_0 [Tense=Past,pos=VBN]'
- get_dependencies(relation)[source]¶
Extract dependents of this form for the specified dependency relation.
>>> john = Form('John', 1) / 'NNP' >>> loves = Form('loves', 2) / 'VRB' >>> mary = Form('Mary', 3) / 'NNP' >>> loves >> john | 'subj' subj(loves_2/VRB, John_1/NNP) >>> loves >> mary | 'obj' obj(loves_2/VRB, Mary_3/NNP) >>> loves.get_dependencies('subj') [subj(loves_2/VRB, John_1/NNP)] >>> loves.get_dependencies('obj') [obj(loves_2/VRB, Mary_3/NNP)]
- Return type:
List
[Dependency
]
- full_str(include_relation=True)[source]¶
Returns a string containing all features of the Form. The ID is attached to the text, and the relation is optionally suppressed.
>>> loves = Form('loves', 2) / 'VRB' >>> loves.full_str() 'loves_2 [pos=VRB]' >>> john = Form('John', 1) / 'NNP' >>> loves >> john | 'subj' subj(loves_2/VRB, John_1/NNP) >>> john.full_str(True) 'John_1 [pos=NNP,relation=subj]'
- Return type:
str
- static to_form(word)[source]¶
Converts a
CLTK
Word
object to aForm
.TODO: The Form info that prints is incomplete/ugly; correct str repr of
Form
TODO: Fix these doctests; it’s ugly to import so many Forms, but is this required?>>> from cltk.morphology.universal_dependencies_features import Case, Gender, Number, POS >>> noun = POS.noun >>> nominative = Case.nominative >>> feminine = Gender.feminine >>> singular = Number.singular >>> cltk_word = Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='Gallia', pos=noun, lemma='Gallia', stem=None, scansion=None, xpos='A1|grn1|casA|gen2', upos='NOUN', dependency_relation='nsubj', governor=1, features={Case: [nominative], Gender: [feminine], Number: [singular]}, category={F: [neg], N: [pos], V: [neg]}, stop=False, named_entity='LOCATION', syllables=None, phonetic_transcription=None, definition='') >>> cltk_word.features[Case] = Case.nominative >>> cltk_word.features[Gender] = Gender.feminine >>> cltk_word.features[Number] = Number.singular >>> f = Form.to_form(cltk_word) >>> f.full_str() 'Gallia_0 [lemma=mallis,pos=NOUN,upos=NOUN,xpos=A1|grn1|casA|gen2,Case=nominative,Gender=feminine,Number=singular]'
- Return type:
- class cltk.dependency.tree.Dependency(head, dep, relation=None)[source]¶
Bases:
object
The asymmetric binary relationship (or edge) between a governing Form (the “head”) and a subordinate Form (the “dependent”).
In principle the relationship could capture any form-to-form relation that the systems deems of interest, be it syntactic, semantic, or discursive.
If the relation attribute is not speficied, then the dependency simply states that there’s some asymmetric relationship between the head and the dependenent. This is an untyped dependency.
For a typed dependency, a string value is supplied for the relation attribute.
- class cltk.dependency.tree.DependencyTree(root)[source]¶
Bases:
ElementTree
The hierarchical tree representing the entirety of a parse.
- get_dependencies()[source]¶
Returns a list of all the dependency relations in the tree, generated by depth-first search.
>>> from cltk.languages.example_texts import get_example_text >>> from cltk.dependency.processes import StanzaProcess >>> process_stanza = StanzaProcess(language="lat") >>> output_doc = process_stanza.run(Doc(raw=get_example_text("lat"))) >>> a_sentence = output_doc.sentences[0] >>> t = DependencyTree.to_tree(a_sentence) >>> len(t.get_dependencies()) 34
- Return type:
List
[Dependency
]
- print_tree(all_features=False)[source]¶
Prints a pretty-printed (indented) representation of the dependency tree. If all_features is True, then each node is printed with its complete feature bundles.
- static to_tree(sentence)[source]¶
Factory method to create trees from sentences parses, i.e. lists of words.
>>> from cltk.languages.example_texts import get_example_text >>> from cltk.dependency.processes import StanzaProcess >>> process_stanza = StanzaProcess(language="lat") >>> output_doc = process_stanza.run(Doc(raw=get_example_text("lat"))) >>> a_sentence = output_doc.sentences[0] >>> t = DependencyTree.to_tree(a_sentence) >>> t.findall(".") [divisa_3/adjective]
- Return type:
8.1.5.5. cltk.dependency.utils module¶
Misc helper functions for extracting dependency info from CLTK data structures.