8.1.19. cltk.tokenizers package¶
Init for cltk.tokenize.
8.1.19.1. Subpackages¶
8.1.19.2. Submodules¶
8.1.19.3. cltk.tokenizers.akk module¶
Code for word tokenization: Akkadian
- class cltk.tokenizers.akk.AkkadianWordTokenizer[source]¶
Bases:
WordTokenizer
Akkadian word and cuneiform tokenizer.
- tokenize(text)[source]¶
Operates on a single line of text, returns all words in the line as a tuple in a list.
input: “1. isz-pur-ram a-na” output: [(“isz-pur-ram”, “akkadian”), (“a-na”, “akkadian”)]
- Param:
line: text string
- Returns:
list of tuples: (word, language)
- tokenize_sign(word)[source]¶
Takes tuple (word, language) and splits the word up into individual sign tuples (sign, language) in a list.
input: (“{gisz}isz-pur-ram”, “akkadian”) output: [(“gisz”, “determinative”), (“isz”, “akkadian”), (“pur”, “akkadian”), (“ram”, “akkadian”)]
- Param:
tuple created by word_tokenizer2
- Returns:
list of tuples: (sign, function or language)
8.1.19.4. cltk.tokenizers.arb module¶
Code for word tokenization: Arabic
- class cltk.tokenizers.arb.ArabicWordTokenizer[source]¶
Bases:
WordTokenizer
Class for word tokenizer using the pyarabic package: https://pypi.org/project/PyArabic/
8.1.19.5. cltk.tokenizers.enm module¶
Code for word tokenization: Middle English
- class cltk.tokenizers.enm.MiddleEnglishWordTokenizer[source]¶
Bases:
RegexWordTokenizer
A regex-based tokenizer for Middle English.
8.1.19.6. cltk.tokenizers.fro module¶
Code for word tokenization: Old French
- class cltk.tokenizers.fro.OldFrenchWordTokenizer[source]¶
Bases:
RegexWordTokenizer
A regex-based tokenizer for Old French.
8.1.19.7. cltk.tokenizers.gmh module¶
Code for word tokenization: Middle High German
- class cltk.tokenizers.gmh.MiddleHighGermanWordTokenizer[source]¶
Bases:
RegexWordTokenizer
A regex-based tokenizer for Middle High German.
8.1.19.8. cltk.tokenizers.line module¶
Tokenize lines.
- class cltk.tokenizers.line.LineTokenizer(language)[source]¶
Bases:
object
Tokenize text by line; designed for study of poetry.
- tokenize(untokenized_string, include_blanks=False)[source]¶
Tokenize lines by ‘ ‘.
- type untokenized_string:
str
- param untokenized_string:
A string containing one of more sentences.
- param include_blanks:
Boolean; If True, blanks will be preserved by “” in returned list of strings; Default is False.
:rtype : list of strings
8.1.19.9. cltk.tokenizers.non module¶
Code for word tokenization: Old Norse
- class cltk.tokenizers.non.OldNorseWordTokenizer[source]¶
Bases:
RegexWordTokenizer
A regex-based tokenizer for Old Norse.
8.1.19.10. cltk.tokenizers.processes module¶
Module for tokenizers.
TODO: Think about adding check somewhere if a contrib (not user) chooses an unavailable item
- class cltk.tokenizers.processes.TokenizationProcess(language=None)[source]¶
Bases:
Process
To be inherited for each language’s tokenization declarations.
Example:
TokenizationProcess
->LatinTokenizationProcess
>>> from cltk.tokenizers.processes import TokenizationProcess >>> from cltk.core.data_types import Process >>> issubclass(TokenizationProcess, Process) True >>> tok = TokenizationProcess()
- algorithm¶
The backoff tokenizer, from NLTK.
- class cltk.tokenizers.processes.MultilingualTokenizationProcess(language=None)[source]¶
Bases:
TokenizationProcess
The default tokenization algorithm.
>>> from cltk.tokenizers.processes import MultilingualTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = MultilingualTokenizationProcess() >>> output_doc = tokenizer_process.run(Doc(raw=get_example_text("non")[:29])) >>> output_doc.tokens ['Gylfi', 'konungr', 'réð', 'þar', 'löndum']
>>> [word.index_char_start for word in output_doc.words] [0, 6, 14, 18, 22] >>> [word.index_char_stop for word in output_doc.words] [5, 13, 17, 21, 28]
- description = 'Default tokenizer for languages lacking a dedicated tokenizer. This is a whitespace tokenizer inheriting from the NLTK.'¶
- class cltk.tokenizers.processes.AkkadianTokenizationProcess(language=None)[source]¶
Bases:
TokenizationProcess
The default Akkadian tokenization algorithm.
TODO: Change function or post-process to separate tokens from language labels
>>> from cltk.tokenizers import AkkadianTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = AkkadianTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("akk"))) >>> output_doc.tokens [('u2-wa-a-ru', 'akkadian'), ('at-ta', 'akkadian'), ('e2-kal2-la-ka', 'akkadian'), ('_e2_-ka', 'sumerian'), ('wu-e-er', 'akkadian')]
- description = 'Default tokenizer for the Akkadian language.'¶
- algorithm¶
- class cltk.tokenizers.processes.ArabicTokenizationProcess(language=None)[source]¶
Bases:
TokenizationProcess
The default Arabic tokenization algorithm.
>>> from cltk.tokenizers import ArabicTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = ArabicTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("arb")[:34])) >>> output_doc.tokens ['كهيعص', '﴿', '١', '﴾', 'ذِكْرُ', 'رَحْمَتِ', 'رَبِّكَ']
- description = 'Default tokenizer for the Arabic language.'¶
- algorithm¶
- class cltk.tokenizers.processes.GreekTokenizationProcess(language=None)[source]¶
Bases:
TokenizationProcess
The default Greek tokenization algorithm.
>>> from cltk.tokenizers import GreekTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = GreekTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("grc")[:23])) >>> output_doc.tokens ['ὅτι', 'μὲν', 'ὑμεῖς', ',', 'ὦ', 'ἄνδρες']
- description = 'Default tokenizer for the Greek language.'¶
- algorithm¶
- class cltk.tokenizers.processes.LatinTokenizationProcess(language=None)[source]¶
Bases:
TokenizationProcess
The default Latin tokenization algorithm.
>>> from cltk.tokenizers import LatinTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = LatinTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("lat")[:23])) >>> output_doc.tokens ['Gallia', 'est', 'omnis', 'divisa']
- description = 'Default tokenizer for the Latin language.'¶
- algorithm¶
- class cltk.tokenizers.processes.MiddleHighGermanTokenizationProcess(language=None)[source]¶
Bases:
TokenizationProcess
The default Middle High German tokenization algorithm.
>>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = MiddleHighGermanTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("gmh")[:29])) >>> output_doc.tokens ['Uns', 'ist', 'in', 'alten', 'mæren', 'wunder']
- description = 'Default Middle High German tokenizer'¶
- algorithm¶
- class cltk.tokenizers.processes.MiddleEnglishTokenizationProcess(language=None)[source]¶
Bases:
TokenizationProcess
The default Middle English tokenization algorithm.
>>> from cltk.tokenizers import MiddleEnglishTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = MiddleEnglishTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("enm")[:31])) >>> output_doc.tokens ['Whilom', ',', 'as', 'olde', 'stories', 'tellen']
- description = 'Default Middle English tokenizer'¶
- algorithm¶
- class cltk.tokenizers.processes.OldFrenchTokenizationProcess(language=None)[source]¶
Bases:
TokenizationProcess
The default Old French tokenization algorithm.
>>> from cltk.tokenizers import OldFrenchTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tok = OldFrenchTokenizationProcess() >>> output_doc = tok.run(input_doc=Doc(raw=get_example_text("fro")[:37])) >>> output_doc.tokens ['Une', 'aventure', 'vos', 'voil', 'dire', 'Molt', 'bien']
- description = 'Default tokenizer for the Old French language.'¶
- algorithm¶
- class cltk.tokenizers.processes.MiddleFrenchTokenizationProcess(language=None)[source]¶
Bases:
TokenizationProcess
The default Middle French tokenization algorithm.
>>> from cltk.tokenizers import MiddleFrenchTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = MiddleFrenchTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("frm")[:37])) >>> output_doc.tokens ['Attilius', 'Regulus', ',', 'general', 'de', "l'", 'armée']
- description = 'Default tokenizer for the Middle French language.'¶
- algorithm¶
- class cltk.tokenizers.processes.OldNorseTokenizationProcess(language=None)[source]¶
Bases:
TokenizationProcess
The default OldNorse tokenization algorithm.
>>> from cltk.tokenizers import OldNorseTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tok = OldNorseTokenizationProcess() >>> output_doc = tok.run(input_doc=Doc(raw=get_example_text("non")[:29])) >>> output_doc.tokens ['Gylfi', 'konungr', 'réð', 'þar', 'löndum']
- description = 'Default Old Norse tokenizer'¶
- algorithm¶
8.1.19.11. cltk.tokenizers.utils module¶
Tokenization utilities
TODO: KJ consider moving to scripts
dir.
8.1.19.12. cltk.tokenizers.word module¶
Language-specific word tokenizers. Primary purpose is to handle enclitics.
- class cltk.tokenizers.word.WordTokenizer[source]¶
Bases:
object
Base class for word tokenizers
- abstract tokenize(text, model=None)[source]¶
Create a list of tokens from a string. This method should be overridden by subclasses of WordTokenizer.
- class cltk.tokenizers.word.PunktWordTokenizer(sent_tokenizer=None)[source]¶
Bases:
WordTokenizer
Class for punkt word tokenization
- class cltk.tokenizers.word.RegexWordTokenizer(patterns=None)[source]¶
Bases:
WordTokenizer
Class for regex-based word tokenization