8.1.8. cltk.lemmatize package¶
Init for cltk.lemmatize.
8.1.8.1. Submodules¶
8.1.8.2. cltk.lemmatize.ang module¶
- class cltk.lemmatize.ang.OldEnglishDictionaryLemmatizer[source]¶
Bases:
DictionaryRegexLemmatizer
Naive lemmatizer for Old English.
TODO: Add silent and non-interactive options to this class
>>> lemmatizer = OldEnglishDictionaryLemmatizer() >>> lemmatizer.lemmatize_token('ġesāƿen') 'geseon' >>> lemmatizer.lemmatize_token('ġesāƿen', return_frequencies=True) ('geseon', -6.519245611523386) >>> lemmatizer.lemmatize_token('ġesāƿen', return_frequencies=True, best_guess=False) [('geseon', -6.519245611523386), ('gesaƿan', 0), ('saƿan', 0)] >>> lemmatizer.lemmatize(['Same', 'men', 'cweþaþ', 'on', 'Englisc', 'þæt', 'hit', 'sie', 'feaxede', 'steorra', 'forþæm', 'þær', 'stent', 'lang', 'leoma', 'of', 'hwilum', 'on', 'ane', 'healfe', 'hwilum', 'on', 'ælce', 'healfe'], return_frequencies=True, best_guess=False) [[('same', -8.534148632065651), ('sum', -5.166852802079177)], [('mann', -6.829400539827225)], [('cweþan', -9.227295812625597)], [('an', -5.02260319323463), ('on', -2.210686128731377)], [('englisc', -8.128683523957486)], [('þæt', -2.365584472144866), ('se', -2.9011463394704973)], [('hit', -4.300042127468392)], [('wesan', -7.435536343397541)], [('feaxede', -9.227295812625597)], [('steorra', -8.534148632065651)], [('forðam', -6.282856833459156)], [('þær', -3.964605623720711)], [('standan', -7.617857900191496)], [('lang', -6.829400539827225)], [('leoma', -7.841001451505705)], [('of', -3.9440920838876075)], [('hwilum', -6.282856833459156)], [('an', -5.02260319323463), ('on', -2.210686128731377)], [('an', -5.02260319323463)], [('healf', -7.841001451505705)], [('hwilum', -6.282856833459156)], [('an', -5.02260319323463), ('on', -2.210686128731377)], [('ælc', -7.841001451505705)], [('healf', -7.841001451505705)]]
8.1.8.3. cltk.lemmatize.backoff module¶
Lemmatization module—includes several classes for different lemmatizing approaches–based on training data, regex pattern matching, etc. These can be chained together using the backoff parameter. Also, includes a pre-built chain that uses models in cltk_data.
The logic behind the backoff lemmatizer is based on backoff POS-tagging in NLTK and repurposes several of the tagging classes for lemmatization tasks. See here for more info on sequential backoff tagging in NLTK: http://www.nltk.org/_modules/nltk/tag/sequential.html
PJB: The Latin lemmatizer modules were completed as part of Google Summer of Code 2016. I have written up a detailed report of the summer work here: https://gist.github.com/diyclassics/fc80024d65cc237f185a9a061c5d4824.
- class cltk.lemmatize.backoff.SequentialBackoffLemmatizer(backoff, verbose=False)[source]¶
Bases:
SequentialBackoffTagger
Abstract base class for lemmatizers created as a subclass of NLTK’s SequentialBackoffTagger. Lemmatizers in this class “[tag] words sequentially, left to right. Tagging of individual words is performed by the
choose_tag()
method, which should be defined by subclasses. If a tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.”See: https://www.nltk.org/_modules/nltk/tag/sequential.html#SequentialBackoffTagger
- Variables:
_taggers – A list of all the taggers in the backoff chain, inc. self.
_repr – An instance of Repr() from reprlib to handle list and dict length in subclass __repr__’s
- tag(tokens)[source]¶
Docs (mostly) inherited from TaggerI; cf. https://www.nltk.org/_modules/nltk/tag/api.html#TaggerI.tag
Two tweaks: 1. Properly handle ‘verbose’ listing of current tagger in the case of None (i.e.
if tag: etc.
) 2. Keep track of taggers and change return depending on ‘verbose’ flag:rtype list :type tokens: list :type tokens:
List
[str
] :param tokens: List of tokens to tag
- tag_one(tokens, index, history)[source]¶
Determine an appropriate tag for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.
- Return type:
tuple
- Parameters:
tokens (
List
[str
]) – The list of words that are being tagged.index (
int
) – The index of the word whose tag should be returned.history (
List
[str
]) – A list of the tags for all words before index.
- class cltk.lemmatize.backoff.DefaultLemmatizer(lemma=None, backoff=None, verbose=False)[source]¶
Bases:
SequentialBackoffLemmatizer
Lemmatizer that assigns the same lemma to every token. Useful as the final tagger in chain, e.g. to assign ‘UNK’ to all remaining unlemmatized tokens. :type lemma: str :type lemma:
str
:param lemma: Lemma to assign to each token>>> default_lemmatizer = DefaultLemmatizer('UNK') >>> list(default_lemmatizer.lemmatize('arma virumque cano'.split())) [('arma', 'UNK'), ('virumque', 'UNK'), ('cano', 'UNK')]
- choose_tag(tokens, index, history)[source]¶
Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
- Return type:
str
- Parameters:
tokens (
List
[str
]) – The list of words that are being tagged.index (
int
) – The index of the word whose tag should be returned.history (
List
[str
]) – A list of the tags for all words before index.
- class cltk.lemmatize.backoff.IdentityLemmatizer(backoff=None, verbose=False)[source]¶
Bases:
SequentialBackoffLemmatizer
Lemmatizer that returns a given token as its lemma. Like DefaultLemmatizer, useful as the final tagger in a chain, e.g. to assign a possible form to all remaining unlemmatized tokens, increasing the chance of a successful match.
>>> identity_lemmatizer = IdentityLemmatizer() >>> list(identity_lemmatizer.lemmatize('arma virumque cano'.split())) [('arma', 'arma'), ('virumque', 'virumque'), ('cano', 'cano')]
- choose_tag(tokens, index, history)[source]¶
Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.
- Return type:
str
- Parameters:
tokens (
List
[str
]) – The list of words that are being tagged.index (
int
) – The index of the word whose tag should be returned.history (
List
[str
]) – A list of the tags for all words before index.
- class cltk.lemmatize.backoff.DictLemmatizer(lemmas, backoff=None, source=None, verbose=False)[source]¶
Bases:
SequentialBackoffLemmatizer
Standalone version of ‘model’ function found in UnigramTagger; by defining as its own class, it is clearer that this lemmatizer is based on dictionary lookup and does not use training data.
- choose_tag(tokens, index, history)[source]¶
Looks up token in
lemmas
dict and returns the corresponding value as lemma. :rtype: str :type tokens: list :type tokens:List
[str
] :param tokens: List of tokens to be lemmatized :type index: int :type index:int
:param index: Int with current token :type history: list :type history:List
[str
] :param history: List with tokens that have already been lemmatized; NOT USED
- class cltk.lemmatize.backoff.UnigramLemmatizer(train=None, model=None, backoff=None, source=None, cutoff=0, verbose=False)[source]¶
Bases:
SequentialBackoffLemmatizer
,UnigramTagger
Standalone version of ‘train’ function found in UnigramTagger; by defining as its own class, it is clearer that this lemmatizer is based on training data and not on dictionary.
- class cltk.lemmatize.backoff.RegexpLemmatizer(regexps=None, source=None, backoff=None, verbose=False)[source]¶
Bases:
SequentialBackoffLemmatizer
,RegexpTagger
Regular expression tagger, inheriting from
SequentialBackoffLemmatizer
andRegexpTagger
.- choose_tag(tokens, index, history)[source]¶
Use regular expressions for rules-based lemmatizing based on word endings; tokens are matched for patterns with the base kept as a group; an word ending replacement is added to the (base) group. :rtype: str :type tokens: list :type tokens:
List
[str
] :param tokens: List of tokens to be lemmatized :type index: int :type index:int
:param index: Int with current token :type history: list :type history:List
[str
] :param history: List with tokens that have already been lemmatized; NOT USED
8.1.8.4. cltk.lemmatize.fro module¶
Lemmatizer for Old French. Rules are based on Brunot & Bruneau (1949).
- class cltk.lemmatize.fro.OldFrenchDictionaryLemmatizer[source]¶
Bases:
DictionaryRegexLemmatizer
Naive lemmatizer for Old French.
>>> lemmatizer = OldFrenchDictionaryLemmatizer() >>> lemmatizer.lemmatize_token('corant') 'corant' >>> lemmatizer.lemmatize_token('corant', return_frequencies=True) ('corant', -9.319508628976836) >>> lemmatizer.lemmatize_token('corant', return_frequencies=True, best_guess=False) [('corir', 0), ('corant', -9.319508628976836)] >>> lemmatizer.lemmatize(['corant', '.', 'vult', 'premir'], return_frequencies=True, best_guess=False) [[('corir', 0), ('corant', -9.319508628976836)], [('PUNK', 0)], [('vout', -7.527749159748781)], [('premir', 0)]]
8.1.8.5. cltk.lemmatize.grc module¶
Module for lemmatizing Ancient Greek.
- class cltk.lemmatize.grc.GreekBackoffLemmatizer(train=None, seed=3, verbose=False)[source]¶
Bases:
object
Suggested backoff chain; includes at least on of each type of major sequential backoff class from backoff.py.
- lemmatize(tokens)[source]¶
Lemmatize a list of words.
>>> lemmatizer = GreekBackoffLemmatizer() >>> from cltk.alphabet.text_normalization import cltk_normalize >>> word = cltk_normalize('διοτρεφές') >>> lemmatizer.lemmatize([word]) [('διοτρεφές', 'διοτρεφής')] >>> republic = cltk_normalize("κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος") >>> lemmatizer.lemmatize(republic.split()) [('κατέβην', 'καταβαίνω'), ('χθὲς', 'χθές'), ('εἰς', 'εἰς'), ('Πειραιᾶ', 'Πειραιεύς'), ('μετὰ', 'μετά'), ('Γλαύκωνος', 'Γλαύκων'), ('τοῦ', 'ὁ'), ('Ἀρίστωνος', 'Ἀρίστων')]
8.1.8.6. cltk.lemmatize.lat module¶
Module for lemmatizing Latin
- class cltk.lemmatize.lat.RomanNumeralLemmatizer(default=None, backoff=None)[source]¶
Bases:
RegexpLemmatizer
Lemmatizer for identifying roman numerals in Latin text based on regex.
>>> lemmatizer = RomanNumeralLemmatizer() >>> lemmatizer.lemmatize("i ii iii iv v vi vii vii ix x xx xxx xl l lx c cc".split()) [('i', 'NUM'), ('ii', 'NUM'), ('iii', 'NUM'), ('iv', 'NUM'), ('v', 'NUM'), ('vi', 'NUM'), ('vii', 'NUM'), ('vii', 'NUM'), ('ix', 'NUM'), ('x', 'NUM'), ('xx', 'NUM'), ('xxx', 'NUM'), ('xl', 'NUM'), ('l', 'NUM'), ('lx', 'NUM'), ('c', 'NUM'), ('cc', 'NUM')]
>>> lemmatizer = RomanNumeralLemmatizer(default="RN") >>> lemmatizer.lemmatize('i ii iii'.split()) [('i', 'RN'), ('ii', 'RN'), ('iii', 'RN')]
- choose_tag(tokens, index, history)[source]¶
Use regular expressions for rules-based lemmatizing based on word endings; tokens are matched for patterns with the base kept as a group; an word ending replacement is added to the (base) group. :rtype: str :type tokens: list :type tokens:
List
[str
] :param tokens: List of tokens to be lemmatized :type index: int :type index:int
:param index: Int with current token :type history: list :type history:List
[str
] :param history: List with tokens that have already been lemmatized; NOT USED
- class cltk.lemmatize.lat.LatinBackoffLemmatizer(train=None, seed=3, verbose=False)[source]¶
Bases:
object
Suggested backoff chain; includes at least on of each type of major sequential backoff class from backoff.py
### Putting it all together ### BETA Version of the Backoff Lemmatizer AKA BackoffLatinLemmatizer ### For comparison, there is also a TrainLemmatizer that replicates the ### original Latin lemmatizer from cltk.stem
8.1.8.7. cltk.lemmatize.naive_lemmatizer module¶
- class cltk.lemmatize.naive_lemmatizer.DictionaryRegexLemmatizer[source]¶
Bases:
ABC
Implementation of a lemmatizer based on a dictionary of lemmas and forms, backing off to regex rules. Since a given form may map to multiple lemmas, a corpus-based frequency disambiguator is employed.
Subclasses must provide methods to load dictionary and corpora, and to specify regular expressions.
- _relative_frequency(word)[source]¶
Computes the log relative frequency for a word form
- Return type:
float
- _apply_regex(token)[source]¶
Looks for a match in the regex rules with the token. If found, applies the replacement part of the rule to the token and returns the result. Else just returns the token unchanged.
- lemmatize_token(token, best_guess=True, return_frequencies=False)[source]¶
Lemmatize a single token. If best_guess is true, then take the most frequent lemma when a form has multiple possible lemmatizations. If the form is not found, just return it. If best_guess is false, then always return the full set of possible lemmas, or the empty list if none found. If return_frequencies is true ,then also return the relative frequency of the lemma in a corpus.
>>> from cltk.lemmatize.ang import OldEnglishDictionaryLemmatizer >>> lemmatizer = OldEnglishDictionaryLemmatizer() >>> lemmatizer.lemmatize_token('fōrestepeþ') 'foresteppan' >>> lemmatizer.lemmatize_token('Caesar', return_frequencies=True, best_guess=True) ('Caesar', 0)
- Return type:
Union
[str
,List
[Union
[str
,Tuple
[str
,float
]]]]
- lemmatize(tokens, best_guess=True, return_frequencies=False)[source]¶
Lemmatize tokens in a list of strings.
>>> from cltk.lemmatize.ang import OldEnglishDictionaryLemmatizer >>> lemmatizer = OldEnglishDictionaryLemmatizer() >>> lemmatizer.lemmatize(['eotenas','ond','ylfe','ond','orcneas'], return_frequencies=True, best_guess=True) [('eoten', -9.227295812625597), ('and', -2.8869365088978443), ('ylfe', -9.227295812625597), ('and', -2.8869365088978443), ('orcneas', -9.227295812625597)]
- Return type:
Union
[str
,List
[Union
[str
,Tuple
[str
,float
]]]]
8.1.8.8. cltk.lemmatize.processes module¶
Processes for lemmatization.
- class cltk.lemmatize.processes.LemmatizationProcess(language=None)[source]¶
Bases:
Process
To be inherited for each language’s lemmatization declarations.
Example:
LemmatizationProcess
->LatinLemmatizationProcess
>>> from cltk.lemmatize.processes import LemmatizationProcess >>> from cltk.core.data_types import Process >>> issubclass(LemmatizationProcess, Process) True
- class cltk.lemmatize.processes.GreekLemmatizationProcess(language=None)[source]¶
Bases:
LemmatizationProcess
The default Ancient Greek lemmatization algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import MultilingualTokenizationProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Greek pipeline", processes=[MultilingualTokenizationProcess, GreekLemmatizationProcess], language=get_lang("grc")) >>> nlp = NLP(language='grc', custom_pipeline=pipe, suppress_banner=True) >>> nlp(get_example_text("grc")).lemmata[30:40] ['ἔλεγον.', 'καίτοι', 'ἀληθές', 'γε', 'ὡς', 'ἔπος', 'εἰπεῖν', 'οὐδὲν', 'εἰρήκασιν.', 'μάλιστα']
- description = 'Lemmatization process for Ancient Greek'¶
- algorithm¶
- class cltk.lemmatize.processes.LatinLemmatizationProcess(language=None)[source]¶
Bases:
LemmatizationProcess
The default Latin lemmatization algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import LatinTokenizationProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Latin pipeline", processes=[LatinTokenizationProcess, LatinLemmatizationProcess], language=get_lang("lat")) >>> nlp = NLP(language='lat', custom_pipeline=pipe, suppress_banner=True) >>> nlp(get_example_text("lat")).lemmata[30:40] ['institutis', ',', 'legibus', 'inter', 'se', 'differunt', '.', 'Gallos', 'ab', 'Aquitanis']
- description = 'Lemmatization process for Latin'¶
- algorithm¶
- class cltk.lemmatize.processes.OldEnglishLemmatizationProcess(language=None)[source]¶
Bases:
LemmatizationProcess
The default Old English lemmatization algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import MultilingualTokenizationProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Old English pipeline", processes=[MultilingualTokenizationProcess, OldEnglishLemmatizationProcess], language=get_lang("ang")) >>> nlp = NLP(language='ang', custom_pipeline=pipe, suppress_banner=True) >>> nlp(get_example_text("ang")).lemmata[30:40] ['siððan', 'ær', 'weorþan', 'feasceaft', 'findan', ',', 'he', 'se', 'frofre', 'gebidan']
- description = 'Lemmatization process for Old English'¶
- algorithm¶
- class cltk.lemmatize.processes.OldFrenchLemmatizationProcess(language=None)[source]¶
Bases:
LemmatizationProcess
The default Old French lemmatization algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import MultilingualTokenizationProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Old French pipeline", processes=[MultilingualTokenizationProcess, OldFrenchLemmatizationProcess], language=get_lang("fro")) >>> nlp = NLP(language='fro', custom_pipeline=pipe, suppress_banner=True) >>> nlp(get_example_text("fro")).lemmata[30:40] ['avenir', 'jadis', 'en', 'bretaingne', 'avoir', '.I.', 'molt', 'riche', 'chevalier', 'PUNK']
- description = 'Lemmatization process for Old French'¶
- algorithm¶