8.1.15. cltk.stem package¶
Init for cltk.tokenize.
8.1.15.1. Submodules¶
8.1.15.2. cltk.stem.akk module¶
Stemmer for Akkadian.
8.1.15.3. cltk.stem.enm module¶
- cltk.stem.enm.stem(word, exception_list={}, strip_pref=True, strip_suf=True)[source]¶
- Parameters:
words – string list
The affix stemmer works by rule-based stripping. It can work on prefixes,
>>> stem('yesterday') 'day'
suffixes,
>>> stem('likingnes') 'liking'
or both
>>> stem('yisterdayes') 'day'
You can also define whether the stemmer will strip suffixes
>>> stem('yisterdayes', strip_suf = False) 'dayes'
or prefixes
>>> stem('yisterdayes', strip_pref = False) 'yisterday'
The stemmer also accepts a user-defined dictionary, that essentially serves the function of a dictionary look-up stemmer
>>> stem('arisnesse', exception_list = {'arisnesse':'rise'}) 'rise'
- Return type:
str
8.1.15.4. cltk.stem.fro module¶
Stemmer for Old French.
8.1.15.5. cltk.stem.gmh module¶
8.1.15.6. cltk.stem.lat module¶
Stem Latin words with an implementation of the Schinke Latin stemming algorithm (Schinke R, Greengrass M, Robertson AM and Willett P. (1996). ‘A stemming algorithm for Latin text databases’. Journal of Documentation, 52: 172-187).
Todo
Make this stemmer like lemma, with import from stem
dir.
- cltk.stem.lat._checkremove_que(word)[source]¶
If word ends in -que and if word is not in pass list, strip -que
8.1.15.7. cltk.stem.processes module¶
Processes for stemming.
- class cltk.stem.processes.StemmingProcess(language=None)[source]¶
Bases:
Process
To be inherited for each language’s stemming declarations.
Example:
StemmingProcess
->LatinStemmingProcess
>>> from cltk.stem.processes import StemmingProcess >>> from cltk.core.data_types import Process >>> issubclass(StemmingProcess, Process) True
- class cltk.stem.processes.LatinStemmingProcess(language=None)[source]¶
Bases:
StemmingProcess
The default Latin stemming algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import LatinTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> from cltk.languages.utils import get_lang >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Latin pipeline", processes=[LatinTokenizationProcess, LatinStemmingProcess], language=get_lang("lat")) >>> nlp = NLP(language='lat', custom_pipeline = pipe, suppress_banner=True) >>> nlp(get_example_text("lat")[:23]).stems ['Gall', 'est', 'omn', 'divis']
- description = 'Default stemmer for the Latin language.'¶
- class cltk.stem.processes.MiddleEnglishStemmingProcess(language=None)[source]¶
Bases:
StemmingProcess
The default Middle English stemming algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import MiddleEnglishTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> from cltk.languages.utils import get_lang >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Middle English pipeline", processes=[MiddleEnglishTokenizationProcess, MiddleEnglishStemmingProcess], language=get_lang("enm")) >>> nlp = NLP(language='enm', custom_pipeline = pipe, suppress_banner=True) >>> nlp(get_example_text("enm")[:29]).stems ['Whil', ',', 'as', 'olde', 'stor', 'lle']
- description = 'Default stemmer for the Middle English language.'¶
- class cltk.stem.processes.MiddleHighGermanStemmingProcess(language=None)[source]¶
Bases:
StemmingProcess
The default Middle High German stemming algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import MiddleHighGermanTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> from cltk.languages.utils import get_lang >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom MHG pipeline", processes=[MiddleHighGermanTokenizationProcess, MiddleHighGermanStemmingProcess], language=get_lang("gmh")) >>> nlp = NLP(language='gmh', custom_pipeline = pipe, suppress_banner=True) >>> nlp(get_example_text("gmh")[:29]).stems ['uns', 'ist', 'in', 'alten', 'mær', 'wund']
- description = 'Default stemmer for the Middle High German language.'¶
- class cltk.stem.processes.OldFrenchStemmingProcess(language=None)[source]¶
Bases:
StemmingProcess
The default Old French stemming algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers import OldFrenchTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> from cltk.languages.utils import get_lang >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Old French pipeline", processes=[OldFrenchTokenizationProcess, OldFrenchStemmingProcess], language=get_lang("fro")) >>> nlp = NLP(language='fro', custom_pipeline = pipe, suppress_banner=True) >>> nlp(get_example_text("fro")[:29]).stems ['un', 'aventu', 'vos', 'voil', 'di', 'mo']
- description = 'Default stemmer for the Old French language.'¶