8.1.15. cltk.stem package

Init for cltk.tokenize.

8.1.15.1. Submodules

8.1.15.2. cltk.stem.akk module

Stemmer for Akkadian.

cltk.stem.akk.stem(noun, gender, mimation=True)[source]

Return the stem of a noun, given a declined form and its gender

>>> stem("šarrū", 'm')
'šarr'
Return type:

str

8.1.15.3. cltk.stem.enm module

cltk.stem.enm.stem(word, exception_list={}, strip_pref=True, strip_suf=True)[source]
Parameters:

words – string list

The affix stemmer works by rule-based stripping. It can work on prefixes,

>>> stem('yesterday')
'day'

suffixes,

>>> stem('likingnes')
'liking'

or both

>>> stem('yisterdayes')
'day'

You can also define whether the stemmer will strip suffixes

>>> stem('yisterdayes', strip_suf = False)
'dayes'

or prefixes

>>> stem('yisterdayes', strip_pref = False)
'yisterday'

The stemmer also accepts a user-defined dictionary, that essentially serves the function of a dictionary look-up stemmer

>>> stem('arisnesse', exception_list = {'arisnesse':'rise'})
'rise'
Return type:

str

8.1.15.4. cltk.stem.fro module

Stemmer for Old French.

cltk.stem.fro._matchremove_noun_endings(word)[source]

Remove the noun and adverb word endings

Return type:

str

cltk.stem.fro._matchremove_verb_endings(word)[source]

Remove the verb endings

Return type:

str

cltk.stem.fro.stem(word)[source]

Stem a word of Old French.

>>> stem('departissent')
'depart'
>>> stem('talent')
'talent'
Return type:

str

8.1.15.5. cltk.stem.gmh module

cltk.stem.gmh._stem_helper(word, rem_umlaut=True)[source]

rem_umlat: Remove umlaut from text

cltk.stem.gmh.stem(word, exceptions={}, rem_umlauts=True)[source]

Stem a Middle High German word.

rem_umlauts: choose whether to remove umlauts from string exceptions: hard-coded dictionary for the cases the algorithm fails

>>> stem('tagen')
'tag'
Return type:

str

8.1.15.6. cltk.stem.lat module

Stem Latin words with an implementation of the Schinke Latin stemming algorithm (Schinke R, Greengrass M, Robertson AM and Willett P. (1996). ‘A stemming algorithm for Latin text databases’. Journal of Documentation, 52: 172-187).

Todo

Make this stemmer like lemma, with import from stem dir.

cltk.stem.lat._checkremove_que(word)[source]

If word ends in -que and if word is not in pass list, strip -que

cltk.stem.lat._matchremove_simple_endings(word)[source]

Remove the noun, adjective, adverb word endings

cltk.stem.lat._matchremove_verb_endings(word)[source]

Remove the verb endings

cltk.stem.lat.stem(word)[source]

Stem each word of the Latin text.

>>> stem('interdum')
'interd'
>>> stem('mercaturis')
'mercatur'
Return type:

str

8.1.15.7. cltk.stem.processes module

Processes for stemming.

class cltk.stem.processes.StemmingProcess(language: str = None)[source]

Bases: cltk.core.data_types.Process

To be inherited for each language’s stemming declarations.

Example: StemmingProcess -> LatinStemmingProcess

>>> from cltk.stem.processes import StemmingProcess
>>> from cltk.core.data_types import Process
>>> issubclass(StemmingProcess, Process)
True
run(input_doc)[source]
Return type:

Doc

class cltk.stem.processes.LatinStemmingProcess(language: str = None)[source]

Bases: cltk.stem.processes.StemmingProcess

The default Latin stemming algorithm.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers import LatinTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.languages.utils import get_lang
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Latin pipeline", processes=[LatinTokenizationProcess, LatinStemmingProcess], language=get_lang("lat"))
>>> nlp = NLP(language='lat', custom_pipeline = pipe, suppress_banner=True)
>>> nlp(get_example_text("lat")[:23]).stems
['Gall', 'est', 'omn', 'divis']
description = 'Default stemmer for the Latin language.'
static algorithm(word)[source]
Return type:

str

class cltk.stem.processes.MiddleEnglishStemmingProcess(language: str = None)[source]

Bases: cltk.stem.processes.StemmingProcess

The default Middle English stemming algorithm.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers import MiddleEnglishTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.languages.utils import get_lang
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Middle English pipeline",     processes=[MiddleEnglishTokenizationProcess, MiddleEnglishStemmingProcess],     language=get_lang("enm"))
>>> nlp = NLP(language='enm', custom_pipeline = pipe, suppress_banner=True)
>>> nlp(get_example_text("enm")[:29]).stems
['Whil', ',', 'as', 'olde', 'stor', 'lle']
description = 'Default stemmer for the Middle English language.'
static algorithm(word)[source]
Return type:

str

class cltk.stem.processes.MiddleHighGermanStemmingProcess(language: str = None)[source]

Bases: cltk.stem.processes.StemmingProcess

The default Middle High German stemming algorithm.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers import MiddleHighGermanTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.languages.utils import get_lang
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom MHG pipeline",     processes=[MiddleHighGermanTokenizationProcess, MiddleHighGermanStemmingProcess],     language=get_lang("gmh"))
>>> nlp = NLP(language='gmh', custom_pipeline = pipe, suppress_banner=True)
>>> nlp(get_example_text("gmh")[:29]).stems
['uns', 'ist', 'in', 'alten', 'mær', 'wund']
description = 'Default stemmer for the Middle High German language.'
static algorithm(word)[source]
Return type:

str

class cltk.stem.processes.OldFrenchStemmingProcess(language: str = None)[source]

Bases: cltk.stem.processes.StemmingProcess

The default Old French stemming algorithm.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.tokenizers import OldFrenchTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.languages.utils import get_lang
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Old French pipeline",     processes=[OldFrenchTokenizationProcess, OldFrenchStemmingProcess],     language=get_lang("fro"))
>>> nlp = NLP(language='fro', custom_pipeline = pipe, suppress_banner=True)
>>> nlp(get_example_text("fro")[:29]).stems
['un', 'aventu', 'vos', 'voil', 'di', 'mo']
description = 'Default stemmer for the Old French language.'
static algorithm(word)[source]
Return type:

str