8.1.8. cltk.lemmatize package¶
Init for cltk.lemmatize.
8.1.8.1. Submodules¶
8.1.8.2. cltk.lemmatize.ang module¶
-
class
cltk.lemmatize.ang.
OldEnglishDictionaryLemmatizer
[source]¶ Bases:
cltk.lemmatize.naive_lemmatizer.DictionaryRegexLemmatizer
Naive lemmatizer for Old English.
>>> lemmatizer = OldEnglishDictionaryLemmatizer() >>> lemmatizer.lemmatize_token('ġesāƿen') 'geseon' >>> lemmatizer.lemmatize_token('ġesāƿen', return_frequencies=True) ('geseon', -6.519245611523386) >>> lemmatizer.lemmatize_token('ġesāƿen', return_frequencies=True, best_guess=False) [('geseon', -6.519245611523386), ('gesaƿan', 0), ('saƿan', 0)] >>> lemmatizer.lemmatize(['Same', 'men', 'cweþaþ', 'on', 'Englisc', 'þæt', 'hit', 'sie', 'feaxede', 'steorra', 'forþæm', 'þær', 'stent', 'lang', 'leoma', 'of', 'hwilum', 'on', 'ane', 'healfe', 'hwilum', 'on', 'ælce', 'healfe'], return_frequencies=True, best_guess=False) [[('same', -8.534148632065651), ('sum', -5.166852802079177)], [('mann', -6.829400539827225)], [('cweþan', -9.227295812625597)], [('an', -5.02260319323463), ('on', -2.210686128731377)], [('englisc', -8.128683523957486)], [('þæt', -2.365584472144866), ('se', -2.9011463394704973)], [('hit', -4.300042127468392)], [('wesan', -7.435536343397541)], [('feaxede', -9.227295812625597)], [('steorra', -8.534148632065651)], [('forðam', -6.282856833459156)], [('þær', -3.964605623720711)], [('standan', -7.617857900191496)], [('lang', -6.829400539827225)], [('leoma', -7.841001451505705)], [('of', -3.9440920838876075)], [('hwilum', -6.282856833459156)], [('an', -5.02260319323463), ('on', -2.210686128731377)], [('an', -5.02260319323463)], [('healf', -7.841001451505705)], [('hwilum', -6.282856833459156)], [('an', -5.02260319323463), ('on', -2.210686128731377)], [('ælc', -7.841001451505705)], [('healf', -7.841001451505705)]]
8.1.8.3. cltk.lemmatize.backoff module¶
Lemmatization module—includes several classes for different lemmatizing approaches–based on training data, regex pattern matching, etc. These can be chained together using the backoff parameter. Also, includes a pre-built chain that uses models in cltk_data.
The logic behind the backoff lemmatizer is based on backoff POS-tagging in NLTK and repurposes several of the tagging classes for lemmatization tasks. See here for more info on sequential backoff tagging in NLTK: http://www.nltk.org/_modules/nltk/tag/sequential.html
PJB: The Latin lemmatizer modules were completed as part of Google Summer of Code 2016. I have written up a detailed report of the summer work here: https://gist.github.com/diyclassics/fc80024d65cc237f185a9a061c5d4824.
-
class
cltk.lemmatize.backoff.
SequentialBackoffLemmatizer
(backoff, verbose=False)[source]¶ Bases:
nltk.tag.sequential.SequentialBackoffTagger
Abstract base class for lemmatizers created as a subclass of NLTK’s SequentialBackoffTagger. Lemmatizers in this class “[tag] words sequentially, left to right. Tagging of individual words is performed by the
choose_tag()
method, which should be defined by subclasses. If a tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.”See: https://www.nltk.org/_modules/nltk/tag/sequential.html#SequentialBackoffTagger
- Variables:
_taggers – A list of all the taggers in the backoff chain, inc. self.
_repr – An instance of Repr() from reprlib to handle list and dict length in subclass __repr__’s
-
tag
(tokens)[source]¶ Docs (mostly) inherited from TaggerI; cf. https://www.nltk.org/_modules/nltk/tag/api.html#TaggerI.tag
Two tweaks: 1. Properly handle ‘verbose’ listing of current tagger in the case of None (i.e.
if tag: etc.
) 2. Keep track of taggers and change return depending on ‘verbose’ flag:rtype list :type tokens: list :type tokens:
List
[str
] :param tokens: List of tokens to tag
-
tag_one
(tokens, index, history)[source]¶ Determine an appropriate tag for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.
- Return type:
tuple
- Parameters:
tokens (
List
[str
]) – The list of words that are being tagged.index (
int
) – The index of the word whose tag should be returned.history (
List
[str
]) – A list of the tags for all words before index.
-
class
cltk.lemmatize.backoff.
DefaultLemmatizer
(lemma=None, backoff=None, verbose=False)[source]¶ Bases:
cltk.lemmatize.backoff.SequentialBackoffLemmatizer
Lemmatizer that assigns the same lemma to every token. Useful as the final tagger in chain, e.g. to assign ‘UNK’ to all remaining unlemmatized tokens. :type lemma: str :type lemma:
Optional
[str
] :param lemma: Lemma to assign to each token>>> default_lemmatizer = DefaultLemmatizer('UNK') >>> list(default_lemmatizer.lemmatize('arma virumque cano'.split())) [('arma', 'UNK'), ('virumque', 'UNK'), ('cano', 'UNK')]
-
choose_tag
(tokens, index, history)[source]¶ Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
- Return type:
str
- Parameters:
tokens (
List
[str
]) – The list of words that are being tagged.index (
int
) – The index of the word whose tag should be returned.history (
List
[str
]) – A list of the tags for all words before index.
-
-
class
cltk.lemmatize.backoff.
IdentityLemmatizer
(backoff=None, verbose=False)[source]¶ Bases:
cltk.lemmatize.backoff.SequentialBackoffLemmatizer
Lemmatizer that returns a given token as its lemma. Like DefaultLemmatizer, useful as the final tagger in a chain, e.g. to assign a possible form to all remaining unlemmatized tokens, increasing the chance of a successful match.
>>> identity_lemmatizer = IdentityLemmatizer() >>> list(identity_lemmatizer.lemmatize('arma virumque cano'.split())) [('arma', 'arma'), ('virumque', 'virumque'), ('cano', 'cano')]
-
choose_tag
(tokens, index, history)[source]¶ Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
- Return type:
str
- Parameters:
tokens (
List
[str
]) – The list of words that are being tagged.index (
int
) – The index of the word whose tag should be returned.history (
List
[str
]) – A list of the tags for all words before index.
-
-
class
cltk.lemmatize.backoff.
DictLemmatizer
(lemmas, backoff=None, source=None, verbose=False)[source]¶ Bases:
cltk.lemmatize.backoff.SequentialBackoffLemmatizer
Standalone version of ‘model’ function found in UnigramTagger; by defining as its own class, it is clearer that this lemmatizer is based on dictionary lookup and does not use training data.
-
choose_tag
(tokens, index, history)[source]¶ Looks up token in
lemmas
dict and returns the corresponding value as lemma. :rtype: str :type tokens: list :type tokens:List
[str
] :param tokens: List of tokens to be lemmatized :type index: int :type index:int
:param index: Int with current token :type history: list :type history:List
[str
] :param history: List with tokens that have already been lemmatized; NOT USED
-
-
class
cltk.lemmatize.backoff.
UnigramLemmatizer
(train=None, model=None, backoff=None, source=None, cutoff=0, verbose=False)[source]¶ Bases:
cltk.lemmatize.backoff.SequentialBackoffLemmatizer
,nltk.tag.sequential.UnigramTagger
Standalone version of ‘train’ function found in UnigramTagger; by defining as its own class, it is clearer that this lemmatizer is based on training data and not on dictionary.
-
class
cltk.lemmatize.backoff.
RegexpLemmatizer
(regexps=None, source=None, backoff=None, verbose=False)[source]¶ Bases:
cltk.lemmatize.backoff.SequentialBackoffLemmatizer
,nltk.tag.sequential.RegexpTagger
Regular expression tagger, inheriting from
SequentialBackoffLemmatizer
andRegexpTagger
.-
choose_tag
(tokens, index, history)[source]¶ Use regular expressions for rules-based lemmatizing based on word endings; tokens are matched for patterns with the base kept as a group; an word ending replacement is added to the (base) group. :rtype: str :type tokens: list :type tokens:
List
[str
] :param tokens: List of tokens to be lemmatized :type index: int :type index:int
:param index: Int with current token :type history: list :type history:List
[str
] :param history: List with tokens that have already been lemmatized; NOT USED
-
8.1.8.4. cltk.lemmatize.fro module¶
Lemmatizer for Old French. Rules are based on Brunot & Bruneau (1949).
-
class
cltk.lemmatize.fro.
OldFrenchDictionaryLemmatizer
[source]¶ Bases:
cltk.lemmatize.naive_lemmatizer.DictionaryRegexLemmatizer
Naive lemmatizer for Old French.
>>> lemmatizer = OldFrenchDictionaryLemmatizer() >>> lemmatizer.lemmatize_token('corant') 'corant' >>> lemmatizer.lemmatize_token('corant', return_frequencies=True) ('corant', -9.319508628976836) >>> lemmatizer.lemmatize_token('corant', return_frequencies=True, best_guess=False) [('corir', 0), ('corant', -9.319508628976836)] >>> lemmatizer.lemmatize(['corant', '.', 'vult', 'premir'], return_frequencies=True, best_guess=False) [[('corir', 0), ('corant', -9.319508628976836)], [('PUNK', 0)], [('vout', -7.527749159748781)], [('premir', 0)]]
8.1.8.5. cltk.lemmatize.grc module¶
Module for lemmatizing Ancient Greek.
-
class
cltk.lemmatize.grc.
GreekBackoffLemmatizer
(train=None, seed=3, verbose=False)[source]¶ Bases:
object
Suggested backoff chain; includes at least on of each type of major sequential backoff class from backoff.py.
-
lemmatize
(tokens)[source]¶ Lemmatize a list of words.
>>> lemmatizer = GreekBackoffLemmatizer() >>> from cltk.alphabet.text_normalization import cltk_normalize >>> word = cltk_normalize('διοτρεφές') >>> lemmatizer.lemmatize([word]) [('διοτρεφές', 'διοτρεφής')] >>> republic = cltk_normalize("κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος") >>> lemmatizer.lemmatize(republic.split()) [('κατέβην', 'καταβαίνω'), ('χθὲς', 'χθές'), ('εἰς', 'εἰς'), ('Πειραιᾶ', 'Πειραιεύς'), ('μετὰ', 'μετά'), ('Γλαύκωνος', 'Γλαύκων'), ('τοῦ', 'ὁ'), ('Ἀρίστωνος', 'Ἀρίστων')]
-
8.1.8.6. cltk.lemmatize.lat module¶
Module for lemmatizing Latin
-
class
cltk.lemmatize.lat.
RomanNumeralLemmatizer
(default=None, backoff=None)[source]¶ Bases:
cltk.lemmatize.backoff.RegexpLemmatizer
Lemmatizer for identifying roman numerals in Latin text based on regex.
>>> lemmatizer = RomanNumeralLemmatizer() >>> lemmatizer.lemmatize("i ii iii iv v vi vii vii ix x xx xxx xl l lx c cc".split()) [('i', 'NUM'), ('ii', 'NUM'), ('iii', 'NUM'), ('iv', 'NUM'), ('v', 'NUM'), ('vi', 'NUM'), ('vii', 'NUM'), ('vii', 'NUM'), ('ix', 'NUM'), ('x', 'NUM'), ('xx', 'NUM'), ('xxx', 'NUM'), ('xl', 'NUM'), ('l', 'NUM'), ('lx', 'NUM'), ('c', 'NUM'), ('cc', 'NUM')]
>>> lemmatizer = RomanNumeralLemmatizer(default="RN") >>> lemmatizer.lemmatize('i ii iii'.split()) [('i', 'RN'), ('ii', 'RN'), ('iii', 'RN')]
-
choose_tag
(tokens, index, history)[source]¶ Use regular expressions for rules-based lemmatizing based on word endings; tokens are matched for patterns with the base kept as a group; an word ending replacement is added to the (base) group. :rtype: str :type tokens: list :type tokens:
List
[str
] :param tokens: List of tokens to be lemmatized :type index: int :type index:int
:param index: Int with current token :type history: list :type history:List
[str
] :param history: List with tokens that have already been lemmatized; NOT USED
-
-
class
cltk.lemmatize.lat.
LatinBackoffLemmatizer
(train=None, seed=3, verbose=False)[source]¶ Bases:
object
Suggested backoff chain; includes at least on of each type of major sequential backoff class from backoff.py
### Putting it all together ### BETA Version of the Backoff Lemmatizer AKA BackoffLatinLemmatizer ### For comparison, there is also a TrainLemmatizer that replicates the ### original Latin lemmatizer from cltk.stem
8.1.8.7. cltk.lemmatize.naive_lemmatizer module¶
-
class
cltk.lemmatize.naive_lemmatizer.
DictionaryRegexLemmatizer
[source]¶ Bases:
abc.ABC
Implementation of a lemmatizer based on a dictionary of lemmas and forms, backing off to regex rules. Since a given form may map to multiple lemmas, a corpus-based frequency disambiguator is employed.
Subclasses must provide methods to load dictionary and corpora, and to specify regular expressions.
-
_relative_frequency
(word)[source]¶ Computes the log relative frequency for a word form
- Return type:
float
-
_apply_regex
(token)[source]¶ Looks for a match in the regex rules with the token. If found, applies the replacement part of the rule to the token and returns the result. Else just returns the token unchanged.
-
lemmatize_token
(token, best_guess=True, return_frequencies=False)[source]¶ Lemmatize a single token. If best_guess is true, then take the most frequent lemma when a form has multiple possible lemmatizations. If the form is not found, just return it. If best_guess is false, then always return the full set of possible lemmas, or the empty list if none found. If return_frequencies is true ,then also return the relative frequency of the lemma in a corpus.
>>> from cltk.lemmatize.ang import OldEnglishDictionaryLemmatizer >>> lemmatizer = OldEnglishDictionaryLemmatizer() >>> lemmatizer.lemmatize_token('fōrestepeþ') 'foresteppan' >>> lemmatizer.lemmatize_token('Caesar', return_frequencies=True, best_guess=True) ('Caesar', 0)
- Return type:
Union
[str
,List
[Union
[str
,Tuple
[str
,float
]]]]
-
lemmatize
(tokens, best_guess=True, return_frequencies=False)[source]¶ Lemmatize tokens in a list of strings.
>>> from cltk.lemmatize.ang import OldEnglishDictionaryLemmatizer >>> lemmatizer = OldEnglishDictionaryLemmatizer() >>> lemmatizer.lemmatize(['eotenas','ond','ylfe','ond','orcneas'], return_frequencies=True, best_guess=True) [('eoten', -9.227295812625597), ('and', -2.8869365088978443), ('ylfe', -9.227295812625597), ('and', -2.8869365088978443), ('orcneas', -9.227295812625597)]
- Return type:
Union
[str
,List
[Union
[str
,Tuple
[str
,float
]]]]
-
8.1.8.8. cltk.lemmatize.processes module¶
Processes for lemmatization.
-
class
cltk.lemmatize.processes.
LemmatizationProcess
(language: str = None)[source]¶ Bases:
cltk.core.data_types.Process
To be inherited for each language’s lemmatization declarations.
Example:
LemmatizationProcess
->LatinLemmatizationProcess
>>> from cltk.lemmatize.processes import LemmatizationProcess >>> from cltk.core.data_types import Process >>> issubclass(LemmatizationProcess, Process) True
-
class
cltk.lemmatize.processes.
GreekLemmatizationProcess
(language: str = None)[source]¶ Bases:
cltk.lemmatize.processes.LemmatizationProcess
The default Ancient Greek lemmatization algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import MultilingualTokenizationProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Greek pipeline", processes=[MultilingualTokenizationProcess, GreekLemmatizationProcess], language=get_lang("grc")) >>> nlp = NLP(language='grc', custom_pipeline=pipe, suppress_banner=True) >>> nlp(get_example_text("grc")).lemmata[30:40] ['ἔλεγον.', 'καίτοι', 'ἀληθές', 'γε', 'ὡς', 'ἔπος', 'εἰπεῖν', 'οὐδὲν', 'εἰρήκασιν.', 'μάλιστα']
-
description
= 'Lemmatization process for Ancient Greek'¶
-
algorithm
¶
-
-
class
cltk.lemmatize.processes.
LatinLemmatizationProcess
(language: str = None)[source]¶ Bases:
cltk.lemmatize.processes.LemmatizationProcess
The default Latin lemmatization algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import LatinTokenizationProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Latin pipeline", processes=[LatinTokenizationProcess, LatinLemmatizationProcess], language=get_lang("lat")) >>> nlp = NLP(language='lat', custom_pipeline=pipe, suppress_banner=True) >>> nlp(get_example_text("lat")).lemmata[30:40] ['institutis', ',', 'legibus', 'inter', 'se', 'differunt', '.', 'Gallos', 'ab', 'Aquitanis']
-
description
= 'Lemmatization process for Latin'¶
-
algorithm
¶
-
-
class
cltk.lemmatize.processes.
OldEnglishLemmatizationProcess
(language: str = None)[source]¶ Bases:
cltk.lemmatize.processes.LemmatizationProcess
The default Old English lemmatization algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import MultilingualTokenizationProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Old English pipeline", processes=[MultilingualTokenizationProcess, OldEnglishLemmatizationProcess], language=get_lang("ang")) >>> nlp = NLP(language='ang', custom_pipeline=pipe, suppress_banner=True) >>> nlp(get_example_text("ang")).lemmata[30:40] ['siððan', 'ær', 'weorþan', 'feasceaft', 'findan', ',', 'he', 'se', 'frofre', 'gebidan']
-
description
= 'Lemmatization process for Old English'¶
-
algorithm
¶
-
-
class
cltk.lemmatize.processes.
OldFrenchLemmatizationProcess
(language: str = None)[source]¶ Bases:
cltk.lemmatize.processes.LemmatizationProcess
The default Old French lemmatization algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import MultilingualTokenizationProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Old French pipeline", processes=[MultilingualTokenizationProcess, OldFrenchLemmatizationProcess], language=get_lang("fro")) >>> nlp = NLP(language='fro', custom_pipeline=pipe, suppress_banner=True) >>> nlp(get_example_text("fro")).lemmata[30:40] ['avenir', 'jadis', 'en', 'bretaingne', 'avoir', '.I.', 'molt', 'riche', 'chevalier', 'PUNK']
-
description
= 'Lemmatization process for Old French'¶
-
algorithm
¶
-