8.1.19. cltk.tokenizers package¶
Init for cltk.tokenize.
8.1.19.1. Subpackages¶
8.1.19.2. Submodules¶
8.1.19.3. cltk.tokenizers.akk module¶
Code for word tokenization: Akkadian
-
class
cltk.tokenizers.akk.
AkkadianWordTokenizer
[source]¶ Bases:
cltk.tokenizers.word.WordTokenizer
Akkadian word and cuneiform tokenizer.
-
tokenize
(text)[source]¶ Operates on a single line of text, returns all words in the line as a tuple in a list.
input: “1. isz-pur-ram a-na” output: [(“isz-pur-ram”, “akkadian”), (“a-na”, “akkadian”)]
- Param:
line: text string
- Returns:
list of tuples: (word, language)
-
tokenize_sign
(word)[source]¶ Takes tuple (word, language) and splits the word up into individual sign tuples (sign, language) in a list.
input: (“{gisz}isz-pur-ram”, “akkadian”) output: [(“gisz”, “determinative”), (“isz”, “akkadian”), (“pur”, “akkadian”), (“ram”, “akkadian”)]
- Param:
tuple created by word_tokenizer2
- Returns:
list of tuples: (sign, function or language)
-
8.1.19.4. cltk.tokenizers.arb module¶
Code for word tokenization: Arabic
-
class
cltk.tokenizers.arb.
ArabicWordTokenizer
[source]¶ Bases:
cltk.tokenizers.word.WordTokenizer
Class for word tokenizer using the pyarabic package: https://pypi.org/project/PyArabic/
8.1.19.5. cltk.tokenizers.enm module¶
Code for word tokenization: Middle English
-
class
cltk.tokenizers.enm.
MiddleEnglishWordTokenizer
[source]¶ Bases:
cltk.tokenizers.word.RegexWordTokenizer
A regex-based tokenizer for Middle English.
8.1.19.6. cltk.tokenizers.fro module¶
Code for word tokenization: Old French
-
class
cltk.tokenizers.fro.
OldFrenchWordTokenizer
[source]¶ Bases:
cltk.tokenizers.word.RegexWordTokenizer
A regex-based tokenizer for Old French.
8.1.19.7. cltk.tokenizers.gmh module¶
Code for word tokenization: Middle High German
-
class
cltk.tokenizers.gmh.
MiddleHighGermanWordTokenizer
[source]¶ Bases:
cltk.tokenizers.word.RegexWordTokenizer
A regex-based tokenizer for Middle High German.
8.1.19.8. cltk.tokenizers.line module¶
Tokenize lines.
-
class
cltk.tokenizers.line.
LineTokenizer
(language)[source]¶ Bases:
object
Tokenize text by line; designed for study of poetry.
-
tokenize
(untokenized_string, include_blanks=False)[source]¶ Tokenize lines by ‘ ‘.
- type untokenized_string:
str
- param untokenized_string:
A string containing one of more sentences.
- param include_blanks:
Boolean; If True, blanks will be preserved by “” in returned list of strings; Default is False.
:rtype : list of strings
-
8.1.19.9. cltk.tokenizers.non module¶
Code for word tokenization: Old Norse
-
class
cltk.tokenizers.non.
OldNorseWordTokenizer
[source]¶ Bases:
cltk.tokenizers.word.RegexWordTokenizer
A regex-based tokenizer for Old Norse.
8.1.19.10. cltk.tokenizers.processes module¶
Module for tokenizers.
TODO: Think about adding check somewhere if a contrib (not user) chooses an unavailable item
-
class
cltk.tokenizers.processes.
TokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.core.data_types.Process
To be inherited for each language’s tokenization declarations.
Example:
TokenizationProcess
->LatinTokenizationProcess
>>> from cltk.tokenizers.processes import TokenizationProcess >>> from cltk.core.data_types import Process >>> issubclass(TokenizationProcess, Process) True >>> tok = TokenizationProcess()
-
algorithm
¶ The backoff tokenizer, from NLTK.
-
-
class
cltk.tokenizers.processes.
MultilingualTokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.tokenizers.processes.TokenizationProcess
The default tokenization algorithm.
>>> from cltk.tokenizers.processes import MultilingualTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = MultilingualTokenizationProcess() >>> output_doc = tokenizer_process.run(Doc(raw=get_example_text("non")[:29])) >>> output_doc.tokens ['Gylfi', 'konungr', 'réð', 'þar', 'löndum']
>>> [word.index_char_start for word in output_doc.words] [0, 6, 14, 18, 22] >>> [word.index_char_stop for word in output_doc.words] [5, 13, 17, 21, 28]
-
description
= 'Default tokenizer for languages lacking a dedicated tokenizer. This is a whitespace tokenizer inheriting from the NLTK.'¶
-
-
class
cltk.tokenizers.processes.
AkkadianTokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.tokenizers.processes.TokenizationProcess
The default Akkadian tokenization algorithm.
TODO: Change function or post-process to separate tokens from language labels
>>> from cltk.tokenizers import AkkadianTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = AkkadianTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("akk"))) >>> output_doc.tokens [('u2-wa-a-ru', 'akkadian'), ('at-ta', 'akkadian'), ('e2-kal2-la-ka', 'akkadian'), ('_e2_-ka', 'sumerian'), ('wu-e-er', 'akkadian')]
-
description
= 'Default tokenizer for the Akkadian language.'¶
-
algorithm
¶
-
-
class
cltk.tokenizers.processes.
ArabicTokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.tokenizers.processes.TokenizationProcess
The default Arabic tokenization algorithm.
>>> from cltk.tokenizers import ArabicTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = ArabicTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("arb")[:34])) >>> output_doc.tokens ['كهيعص', '﴿', '١', '﴾', 'ذِكْرُ', 'رَحْمَتِ', 'رَبِّكَ']
-
description
= 'Default tokenizer for the Arabic language.'¶
-
algorithm
¶
-
-
class
cltk.tokenizers.processes.
GreekTokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.tokenizers.processes.TokenizationProcess
The default Greek tokenization algorithm.
>>> from cltk.tokenizers import GreekTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = GreekTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("grc")[:23])) >>> output_doc.tokens ['ὅτι', 'μὲν', 'ὑμεῖς', ',', 'ὦ', 'ἄνδρες']
-
description
= 'Default tokenizer for the Greek language.'¶
-
algorithm
¶
-
-
class
cltk.tokenizers.processes.
LatinTokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.tokenizers.processes.TokenizationProcess
The default Latin tokenization algorithm.
>>> from cltk.tokenizers import LatinTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = LatinTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("lat")[:23])) >>> output_doc.tokens ['Gallia', 'est', 'omnis', 'divisa']
-
description
= 'Default tokenizer for the Latin language.'¶
-
algorithm
¶
-
-
class
cltk.tokenizers.processes.
MiddleHighGermanTokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.tokenizers.processes.TokenizationProcess
The default Middle High German tokenization algorithm.
>>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = MiddleHighGermanTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("gmh")[:29])) >>> output_doc.tokens ['Uns', 'ist', 'in', 'alten', 'mæren', 'wunder']
-
description
= 'Default Middle High German tokenizer'¶
-
algorithm
¶
-
-
class
cltk.tokenizers.processes.
MiddleEnglishTokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.tokenizers.processes.TokenizationProcess
The default Middle English tokenization algorithm.
>>> from cltk.tokenizers import MiddleEnglishTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = MiddleEnglishTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("enm")[:31])) >>> output_doc.tokens ['Whilom', ',', 'as', 'olde', 'stories', 'tellen']
-
description
= 'Default Middle English tokenizer'¶
-
algorithm
¶
-
-
class
cltk.tokenizers.processes.
OldFrenchTokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.tokenizers.processes.TokenizationProcess
The default Old French tokenization algorithm.
>>> from cltk.tokenizers import OldFrenchTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tok = OldFrenchTokenizationProcess() >>> output_doc = tok.run(input_doc=Doc(raw=get_example_text("fro")[:37])) >>> output_doc.tokens ['Une', 'aventure', 'vos', 'voil', 'dire', 'Molt', 'bien']
-
description
= 'Default tokenizer for the Old French language.'¶
-
algorithm
¶
-
-
class
cltk.tokenizers.processes.
MiddleFrenchTokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.tokenizers.processes.TokenizationProcess
The default Middle French tokenization algorithm.
>>> from cltk.tokenizers import MiddleFrenchTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tokenizer_process = MiddleFrenchTokenizationProcess() >>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("frm")[:37])) >>> output_doc.tokens ['Attilius', 'Regulus', ',', 'general', 'de', "l'", 'armée']
-
description
= 'Default tokenizer for the Middle French language.'¶
-
algorithm
¶
-
-
class
cltk.tokenizers.processes.
OldNorseTokenizationProcess
(language: str = None)[source]¶ Bases:
cltk.tokenizers.processes.TokenizationProcess
The default OldNorse tokenization algorithm.
>>> from cltk.tokenizers import OldNorseTokenizationProcess >>> from cltk.languages.example_texts import get_example_text >>> tok = OldNorseTokenizationProcess() >>> output_doc = tok.run(input_doc=Doc(raw=get_example_text("non")[:29])) >>> output_doc.tokens ['Gylfi', 'konungr', 'réð', 'þar', 'löndum']
-
description
= 'Default Old Norse tokenizer'¶
-
algorithm
¶
-
8.1.19.11. cltk.tokenizers.utils module¶
Tokenization utilities
TODO: KJ consider moving to scripts
dir.
8.1.19.12. cltk.tokenizers.word module¶
Language-specific word tokenizers. Primary purpose is to handle enclitics.
-
class
cltk.tokenizers.word.
WordTokenizer
[source]¶ Bases:
object
Base class for word tokenizers
-
abstract
tokenize
(text, model=None)[source]¶ Create a list of tokens from a string. This method should be overridden by subclasses of WordTokenizer.
-
abstract
-
class
cltk.tokenizers.word.
PunktWordTokenizer
(sent_tokenizer=None)[source]¶ Bases:
cltk.tokenizers.word.WordTokenizer
Class for punkt word tokenization
-
class
cltk.tokenizers.word.
RegexWordTokenizer
(patterns=None)[source]¶ Bases:
cltk.tokenizers.word.WordTokenizer
Class for regex-based word tokenization