8.1.5. cltk.dependency package¶
Init for cltk.dependency.
8.1.5.1. Submodules¶
8.1.5.2. cltk.dependency.processes module¶
Process classes for accessing the Stanza project.
- class cltk.dependency.processes.StanzaProcess(language=None)[source]¶
Bases:
ProcessA
Processtype to capture everything that thestanzaproject can do for a given language.Note
stanzahas only partial functionality available for some languages.>>> from cltk.languages.example_texts import get_example_text >>> process_stanza = StanzaProcess(language="lat") >>> isinstance(process_stanza, StanzaProcess) True >>> from stanza.models.common.doc import Document >>> output_doc = process_stanza.run(Doc(raw=get_example_text("lat"))) >>> isinstance(output_doc.stanza_doc, Document) True
- language: str = None¶
- algorithm¶
- static stanza_to_cltk_word_type(stanza_doc)[source]¶
Take an entire
stanzadocument, extract each word, and encode it in the way expected by the CLTK’sWordtype.>>> from cltk.dependency.processes import StanzaProcess >>> from cltk.languages.example_texts import get_example_text >>> process_stanza = StanzaProcess(language="lat") >>> cltk_words = process_stanza.run(Doc(raw=get_example_text("lat"))).words >>> isinstance(cltk_words, list) True >>> isinstance(cltk_words[0], Word) True >>> cltk_words[0] Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='Gallia', pos=noun, lemma='Gallia', stem=None, scansion=None, xpos='A1|grn1|casA|gen2', upos='NOUN', dependency_relation='nsubj', governor=1, features={Case: [nominative], Gender: [feminine], InflClass: [ind_eur_a], Number: [singular]}, category={F: [neg], N: [pos], V: [neg]}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)
- class cltk.dependency.processes.GreekStanzaProcess(language='grc', description='Default process for Stanza for the Ancient Greek language.', authorship_info='``LatinSpacyProcess`` using Stanza model by Stanford University from https://stanfordnlp.github.io/stanza/ . Please cite: https://arxiv.org/abs/2003.07082')[source]¶
Bases:
StanzaProcessStanza processor for Ancient Greek.
- language: str = 'grc'¶
- description: str = 'Default process for Stanza for the Ancient Greek language.'¶
- authorship_info: str = '``LatinSpacyProcess`` using Stanza model by Stanford University from https://stanfordnlp.github.io/stanza/ . Please cite: https://arxiv.org/abs/2003.07082'¶
- class cltk.dependency.processes.LatinStanzaProcess(language='lat', description='Default process for Stanza for the Latin language.')[source]¶
Bases:
StanzaProcessStanza processor for Latin.
- language: str = 'lat'¶
- description: str = 'Default process for Stanza for the Latin language.'¶
- class cltk.dependency.processes.OCSStanzaProcess(language='chu', description='Default process for Stanza for the Old Church Slavonic language.')[source]¶
Bases:
StanzaProcessStanza processor for Old Church Slavonic.
- language: str = 'chu'¶
- description: str = 'Default process for Stanza for the Old Church Slavonic language.'¶
- class cltk.dependency.processes.OldFrenchStanzaProcess(language='fro', description='Default process for Stanza for the Old French language.')[source]¶
Bases:
StanzaProcessStanza processor for Old French.
- language: str = 'fro'¶
- description: str = 'Default process for Stanza for the Old French language.'¶
- class cltk.dependency.processes.GothicStanzaProcess(language='got', description='Default process for Stanza for the Gothic language.')[source]¶
Bases:
StanzaProcessStanza processor for Gothic.
- language: str = 'got'¶
- description: str = 'Default process for Stanza for the Gothic language.'¶
- class cltk.dependency.processes.CopticStanzaProcess(language='cop', description='Default process for Stanza for the Coptic language.')[source]¶
Bases:
StanzaProcessStanza processor for Coptic.
- language: str = 'cop'¶
- description: str = 'Default process for Stanza for the Coptic language.'¶
- class cltk.dependency.processes.ChineseStanzaProcess(language='lzh', description='Default process for Stanza for the Classical Chinese language.')[source]¶
Bases:
StanzaProcessStanza processor for Classical Chinese.
- language: str = 'lzh'¶
- description: str = 'Default process for Stanza for the Classical Chinese language.'¶
- class cltk.dependency.processes.TreeBuilderProcess(language=None)[source]¶
Bases:
ProcessA
Processthat takes a doc containing sentences of CLTK words and returns a dependency tree for each sentence.TODO: JS help to make this work, illustrate better.
>>> from cltk import NLP >>> nlp = NLP(language="got", suppress_banner=True) >>> from cltk.dependency.processes import TreeBuilderProcess
>>> nlp.pipeline.add_process(TreeBuilderProcess) >>> from cltk.languages.example_texts import get_example_text >>> doc = nlp.analyze(text=get_example_text("got")) >>> len(doc.trees) 4
- class cltk.dependency.processes.SpacyProcess(language=None)[source]¶
Bases:
ProcessA
Processtype to capture everything, that thespaCyproject can do for a given language.Note
spacyhas only partial functionality available for some languages.>>> from cltk.languages.example_texts import get_example_text >>> process_spacy = SpacyProcess(language="lat") >>> isinstance(process_spacy, SpacyProcess) True
# >>> from spacy.models.common.doc import Document # >>> output_doc = process_spacy.run(Doc(raw=get_example_text(“lat”))) # >>> isinstance(output_doc.spacy_doc, Document) True
- algorithm¶
- static spacy_to_cltk_word_type(spacy_doc)[source]¶
Take an entire
spacydocument, extract each word, and encode it in the way expected by the CLTK’sWordtype.It works only if there is some sentence boundaries has been set by the loaded model.
See note in code about starting word token index at 1
>>> from cltk.dependency.processes import SpacyProcess >>> from cltk.languages.example_texts import get_example_text >>> process_spacy = SpacyProcess(language="lat") >>> cltk_words = process_spacy.run(Doc(raw=get_example_text("lat"))).words >>> isinstance(cltk_words, list) True >>> isinstance(cltk_words[0], Word) True >>> cltk_words[0] Word(index_char_start=0, index_char_stop=6, index_token=0, index_sentence=0, string='Gallia', pos=None, lemma='Gallia', stem=None, scansion=None, xpos='proper_noun', upos='PROPN', dependency_relation='nsubj', governor=None, features={}, category={}, stop=False, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)
- class cltk.dependency.processes.LatinSpacyProcess(language='lat', description="Process for Spacy for Patrick Burn's Latin model.", authorship_info='``LatinSpacyProcess`` using LatinCy model by Patrick Burns from https://arxiv.org/abs/2305.04365 . Please cite: https://arxiv.org/abs/2305.04365')[source]¶
Bases:
SpacyProcessRun a Spacy model.
<https://huggingface.co/latincy>_
- language: Literal['lat'] = 'lat'¶
- description: str = "Process for Spacy for Patrick Burn's Latin model."¶
- authorship_info: str = '``LatinSpacyProcess`` using LatinCy model by Patrick Burns from https://arxiv.org/abs/2305.04365 . Please cite: https://arxiv.org/abs/2305.04365'¶
8.1.5.3. cltk.dependency.spacy_wrapper module¶
Wrapper for spaCy NLP software and models.
- class cltk.dependency.spacy_wrapper.SpacyWrapper(language, nlp=None, interactive=True, silent=False)[source]¶
Bases:
objectSpacyWrapper has been made to be an interface between spaCy and CLTK.
- nlps: dict[str, Any] = {}¶
- parse(text)[source]¶
>>> from cltk.languages.example_texts import get_example_text >>> spacy_wrapper: SpacyWrapper = SpacyWrapper(language="lat") >>> latin_spacy_doc: SpacyDoc = spacy_wrapper.parse(get_example_text("lat"))
- Parameters:
text (
str) – Text to analyze.- Return type:
Doc- Returns:
- classmethod get_nlp(language)[source]¶
- Parameters:
language (
str) – Language parameter to retrieve an already-loaded model or the default model.- Returns:
A saved instance of SpacyWrapper.
- is_wrapper_available()[source]¶
Maps an ISO 639-3 language id (e.g.,
latfor Latin) to that used byspacy(la); confirms that this is a language the CLTK supports (i.e., is it pre-modern or not).>>> spacy_wrapper: SpacyWrapper = SpacyWrapper(language='lat', interactive=False, silent=True) >>> spacy_wrapper.is_wrapper_available() True
- Return type:
bool
8.1.5.4. cltk.dependency.stanza_wrapper module¶
Wrapper for the Python Stanza package. About: https://github.com/stanfordnlp/stanza.
- class cltk.dependency.stanza_wrapper.StanzaWrapper(language, treebank=None, stanza_debug_level='ERROR', interactive=True, silent=False)[source]¶
Bases:
objectCLTK’s wrapper for the Stanza project.
- nlps = {}¶
- parse(text)[source]¶
Run all available
stanzaparsing on input text.>>> from cltk.languages.example_texts import get_example_text >>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> greek_nlp = stanza_wrapper.parse(get_example_text("grc")) >>> from stanza.models.common.doc import Document, Token >>> isinstance(greek_nlp, Document) True
>>> nlp_greek_first_sent = greek_nlp.sentences[0] >>> isinstance(nlp_greek_first_sent.tokens[0], Token) True >>> nlp_greek_first_sent.tokens[0].text 'ὅτι' >>> nlp_greek_first_sent.tokens[0].words [{ "id": 1, "text": "ὅτι", "lemma": "ὅτι", "upos": "ADV", "xpos": "Df", "head": 13, "deprel": "advmod", "start_char": 0, "end_char": 3 }] >>> nlp_greek_first_sent.tokens[0].start_char 0 >>> nlp_greek_first_sent.tokens[0].end_char 3 >>> nlp_greek_first_sent.tokens[0].misc >>> nlp_greek_first_sent.tokens[0].pretty_print() '<Token id=1;words=[<Word id=1;text=ὅτι;lemma=ὅτι;upos=ADV;xpos=Df;head=13;deprel=advmod>]>' >>> nlp_greek_first_sent.tokens[0].to_dict() [{'id': 1, 'text': 'ὅτι', 'lemma': 'ὅτι', 'upos': 'ADV', 'xpos': 'Df', 'head': 13, 'deprel': 'advmod', 'start_char': 0, 'end_char': 3}]
>>> first_word = nlp_greek_first_sent.tokens[0].words[0] >>> first_word.id 1 >>> first_word.text 'ὅτι' >>> first_word.lemma 'ὅτι' >>> first_word.upos 'ADV' >>> first_word.xpos 'Df' >>> first_word.feats >>> first_word.head 13 >>> first_word.parent [ { "id": 1, "text": "ὅτι", "lemma": "ὅτι", "upos": "ADV", "xpos": "Df", "head": 13, "deprel": "advmod", "start_char": 0, "end_char": 3 } ] >>> first_word.misc >>> first_word.deprel 'advmod' >>> first_word.pos 'ADV'
- _load_pipeline()[source]¶
Instantiate
stanza.Pipeline().TODO: Make sure that logging captures what it should from the default stanza printout. TODO: Make note that full lemmatization is not possible for Old French
>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> with suppress_stdout(): nlp_obj = stanza_wrapper._load_pipeline() >>> isinstance(nlp_obj, stanza.pipeline.core.Pipeline) True >>> stanza_wrapper = StanzaWrapper(language='fro', stanza_debug_level="INFO", interactive=False, silent=True) >>> with suppress_stdout(): nlp_obj = stanza_wrapper._load_pipeline() >>> isinstance(nlp_obj, stanza.pipeline.core.Pipeline) True
- _is_model_present()[source]¶
Checks if the model is already downloaded.
>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> stanza_wrapper._is_model_present() True
- Return type:
bool
- _get_default_treebank()[source]¶
Return description of a language’s default treebank if none supplied.
>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> stanza_wrapper._get_default_treebank() 'proiel'
- Return type:
str
- _is_valid_treebank()[source]¶
Check whether for chosen language, optional treebank value is valid.
>>> stanza_wrapper = StanzaWrapper(language='grc', treebank='proiel', stanza_debug_level="INFO", interactive=False, silent=True) >>> stanza_wrapper._is_valid_treebank() True
- Return type:
bool
- is_wrapper_available()[source]¶
Maps an ISO 639-3 language id (e.g.,
latfor Latin) to that used bystanza(la); confirms that this is a language the CLTK supports (i.e., is it pre-modern or not).>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> stanza_wrapper.is_wrapper_available() True
- Return type:
bool
- _get_stanza_code()[source]¶
Using known-supported language, use the CLTK’s internal code to look up the code used by Stanza.
>>> stanza_wrapper = StanzaWrapper(language='grc', stanza_debug_level="INFO", interactive=False, silent=True) >>> stanza_wrapper._get_stanza_code() 'grc' >>> stanza_wrapper.language = "xxx" >>> stanza_wrapper._get_stanza_code() Traceback (most recent call last): ... KeyError: 'Somehow ``StanzaWrapper.language`` got renamed to something invalid. This should never happen.'
- Return type:
str
8.1.5.5. cltk.dependency.tree module¶
A data structure for representing dependency tree graphs.
- class cltk.dependency.tree.Form(form, form_id=0)[source]¶
Bases:
ElementFor the word (ie, node) of a dependency tree and its attributes. Inherits from the
Elementclass of Python’sxml.etreelibrary.>>> desc_form = Form('described') >>> desc_form described_0 >>> desc_form.set('Tense', 'Past') >>> desc_form described_0 >>> desc_form / 'VBN' described_0/VBN >>> desc_form.full_str() 'described_0 [Tense=Past,pos=VBN]'
- get_dependencies(relation)[source]¶
Extract dependents of this form for the specified dependency relation.
>>> john = Form('John', 1) / 'NNP' >>> loves = Form('loves', 2) / 'VRB' >>> mary = Form('Mary', 3) / 'NNP' >>> loves >> john | 'subj' subj(loves_2/VRB, John_1/NNP) >>> loves >> mary | 'obj' obj(loves_2/VRB, Mary_3/NNP) >>> loves.get_dependencies('subj') [subj(loves_2/VRB, John_1/NNP)] >>> loves.get_dependencies('obj') [obj(loves_2/VRB, Mary_3/NNP)]
- Return type:
List[Dependency]
- full_str(include_relation=True)[source]¶
Returns a string containing all features of the Form. The ID is attached to the text, and the relation is optionally suppressed.
>>> loves = Form('loves', 2) / 'VRB' >>> loves.full_str() 'loves_2 [pos=VRB]' >>> john = Form('John', 1) / 'NNP' >>> loves >> john | 'subj' subj(loves_2/VRB, John_1/NNP) >>> john.full_str(True) 'John_1 [pos=NNP,relation=subj]'
- Return type:
str
- static to_form(word)[source]¶
Converts a
CLTKWordobject to aForm.TODO: The Form info that prints is incomplete/ugly; correct str repr of
FormTODO: Fix these doctests; it’s ugly to import so many Forms, but is this required?>>> from cltk.morphology.universal_dependencies_features import Case, Gender, Number, POS >>> noun = POS.noun >>> nominative = Case.nominative >>> feminine = Gender.feminine >>> singular = Number.singular >>> cltk_word = Word(index_char_start=None, index_char_stop=None, index_token=0, index_sentence=0, string='Gallia', pos=noun, lemma='Gallia', stem=None, scansion=None, xpos='A1|grn1|casA|gen2', upos='NOUN', dependency_relation='nsubj', governor=1, features={Case: [nominative], Gender: [feminine], Number: [singular]}, category={F: [neg], N: [pos], V: [neg]}, stop=False, named_entity='LOCATION', syllables=None, phonetic_transcription=None, definition='') >>> cltk_word.features[Case] = Case.nominative >>> cltk_word.features[Gender] = Gender.feminine >>> cltk_word.features[Number] = Number.singular >>> f = Form.to_form(cltk_word) >>> f.full_str() 'Gallia_0 [lemma=mallis,pos=NOUN,upos=NOUN,xpos=A1|grn1|casA|gen2,Case=nominative,Gender=feminine,Number=singular]'
- Return type:
- class cltk.dependency.tree.Dependency(head, dep, relation=None)[source]¶
Bases:
objectThe asymmetric binary relationship (or edge) between a governing Form (the “head”) and a subordinate Form (the “dependent”).
In principle the relationship could capture any form-to-form relation that the systems deems of interest, be it syntactic, semantic, or discursive.
If the relation attribute is not speficied, then the dependency simply states that there’s some asymmetric relationship between the head and the dependenent. This is an untyped dependency.
For a typed dependency, a string value is supplied for the relation attribute.
- class cltk.dependency.tree.DependencyTree(root)[source]¶
Bases:
ElementTreeThe hierarchical tree representing the entirety of a parse.
- get_dependencies()[source]¶
Returns a list of all the dependency relations in the tree, generated by depth-first search.
>>> from cltk.languages.example_texts import get_example_text >>> from cltk.dependency.processes import StanzaProcess >>> process_stanza = StanzaProcess(language="lat") >>> output_doc = process_stanza.run(Doc(raw=get_example_text("lat"))) >>> a_sentence = output_doc.sentences[0] >>> t = DependencyTree.to_tree(a_sentence) >>> len(t.get_dependencies()) 34
- Return type:
List[Dependency]
- print_tree(all_features=False)[source]¶
Prints a pretty-printed (indented) representation of the dependency tree. If all_features is True, then each node is printed with its complete feature bundles.
- static to_tree(sentence)[source]¶
Factory method to create trees from sentences parses, i.e. lists of words.
>>> from cltk.languages.example_texts import get_example_text >>> from cltk.dependency.processes import StanzaProcess >>> process_stanza = StanzaProcess(language="lat") >>> output_doc = process_stanza.run(Doc(raw=get_example_text("lat"))) >>> a_sentence = output_doc.sentences[0] >>> t = DependencyTree.to_tree(a_sentence) >>> t.findall(".") [divisa_3/adjective]
- Return type:
8.1.5.6. cltk.dependency.utils module¶
Misc helper functions for extracting dependency info from CLTK data structures.
- cltk.dependency.utils.get_governor_word(word, sentence)[source]¶
Submit a
Wordand a sentence (being a list ofWord) and then return the governing word.- Return type:
Optional[Word]