8.1.19. cltk.tokenizers package

Init for cltk.tokenize.

8.1.19.1. Subpackages

8.1.19.2. Submodules

8.1.19.3. cltk.tokenizers.akk module

Code for word tokenization: Akkadian

class cltk.tokenizers.akk.AkkadianWordTokenizer[source]

Bases: WordTokenizer

Akkadian word and cuneiform tokenizer.

tokenize(text)[source]

Operates on a single line of text, returns all words in the line as a tuple in a list.

input: “1. isz-pur-ram a-na” output: [(“isz-pur-ram”, “akkadian”), (“a-na”, “akkadian”)]

Param:

line: text string

Returns:

list of tuples: (word, language)

tokenize_sign(word)[source]

Takes tuple (word, language) and splits the word up into individual sign tuples (sign, language) in a list.

input: (“{gisz}isz-pur-ram”, “akkadian”) output: [(“gisz”, “determinative”), (“isz”, “akkadian”), (“pur”, “akkadian”), (“ram”, “akkadian”)]

Param:

tuple created by word_tokenizer2

Returns:

list of tuples: (sign, function or language)

static compute_indices(text, tokens)[source]

8.1.19.4. cltk.tokenizers.arb module

Code for word tokenization: Arabic

class cltk.tokenizers.arb.ArabicWordTokenizer[source]

Bases: WordTokenizer

Class for word tokenizer using the pyarabic package: https://pypi.org/project/PyArabic/

tokenize(text)[source]
Return type:

list

Parameters:

text (str) – text to be tokenized into sentences

8.1.19.5. cltk.tokenizers.enm module

Code for word tokenization: Middle English

class cltk.tokenizers.enm.MiddleEnglishWordTokenizer[source]

Bases: RegexWordTokenizer

A regex-based tokenizer for Middle English.

8.1.19.6. cltk.tokenizers.fro module

Code for word tokenization: Old French

class cltk.tokenizers.fro.OldFrenchWordTokenizer[source]

Bases: RegexWordTokenizer

A regex-based tokenizer for Old French.

8.1.19.7. cltk.tokenizers.gmh module

Code for word tokenization: Middle High German

class cltk.tokenizers.gmh.MiddleHighGermanWordTokenizer[source]

Bases: RegexWordTokenizer

A regex-based tokenizer for Middle High German.

8.1.19.8. cltk.tokenizers.line module

Tokenize lines.

class cltk.tokenizers.line.LineTokenizer(language)[source]

Bases: object

Tokenize text by line; designed for study of poetry.

tokenize(untokenized_string, include_blanks=False)[source]

Tokenize lines by ‘ ‘.

type untokenized_string:

str

param untokenized_string:

A string containing one of more sentences.

param include_blanks:

Boolean; If True, blanks will be preserved by “” in returned list of strings; Default is False.

:rtype : list of strings

8.1.19.9. cltk.tokenizers.non module

Code for word tokenization: Old Norse

class cltk.tokenizers.non.OldNorseWordTokenizer[source]

Bases: RegexWordTokenizer

A regex-based tokenizer for Old Norse.

8.1.19.10. cltk.tokenizers.processes module

Module for tokenizers.

TODO: Think about adding check somewhere if a contrib (not user) chooses an unavailable item

class cltk.tokenizers.processes.TokenizationProcess(language=None)[source]

Bases: Process

To be inherited for each language’s tokenization declarations.

Example: TokenizationProcess -> LatinTokenizationProcess

>>> from cltk.tokenizers.processes import TokenizationProcess
>>> from cltk.core.data_types import Process
>>> issubclass(TokenizationProcess, Process)
True
>>> tok = TokenizationProcess()
algorithm

The backoff tokenizer, from NLTK.

run(input_doc)[source]
Return type:

Doc

class cltk.tokenizers.processes.MultilingualTokenizationProcess(language=None)[source]

Bases: TokenizationProcess

The default tokenization algorithm.

>>> from cltk.tokenizers.processes import MultilingualTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> tokenizer_process = MultilingualTokenizationProcess()
>>> output_doc = tokenizer_process.run(Doc(raw=get_example_text("non")[:29]))
>>> output_doc.tokens
['Gylfi', 'konungr', 'réð', 'þar', 'löndum']
>>> [word.index_char_start for word in output_doc.words]
[0, 6, 14, 18, 22]
>>> [word.index_char_stop for word in output_doc.words]
[5, 13, 17, 21, 28]
description = 'Default tokenizer for languages lacking a dedicated tokenizer. This is a whitespace tokenizer inheriting from the NLTK.'
class cltk.tokenizers.processes.AkkadianTokenizationProcess(language=None)[source]

Bases: TokenizationProcess

The default Akkadian tokenization algorithm.

TODO: Change function or post-process to separate tokens from language labels

>>> from cltk.tokenizers import AkkadianTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> tokenizer_process = AkkadianTokenizationProcess()
>>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("akk")))
>>> output_doc.tokens
[('u2-wa-a-ru', 'akkadian'), ('at-ta', 'akkadian'), ('e2-kal2-la-ka', 'akkadian'), ('_e2_-ka', 'sumerian'), ('wu-e-er', 'akkadian')]
description = 'Default tokenizer for the Akkadian language.'
algorithm
class cltk.tokenizers.processes.ArabicTokenizationProcess(language=None)[source]

Bases: TokenizationProcess

The default Arabic tokenization algorithm.

>>> from cltk.tokenizers import ArabicTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> tokenizer_process = ArabicTokenizationProcess()
>>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("arb")[:34]))
>>> output_doc.tokens
['كهيعص', '﴿', '١', '﴾', 'ذِكْرُ', 'رَحْمَتِ', 'رَبِّكَ']
description = 'Default tokenizer for the Arabic language.'
algorithm
class cltk.tokenizers.processes.GreekTokenizationProcess(language=None)[source]

Bases: TokenizationProcess

The default Greek tokenization algorithm.

>>> from cltk.tokenizers import GreekTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> tokenizer_process = GreekTokenizationProcess()
>>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("grc")[:23]))
>>> output_doc.tokens
['ὅτι', 'μὲν', 'ὑμεῖς', ',', 'ὦ', 'ἄνδρες']
description = 'Default tokenizer for the Greek language.'
algorithm
class cltk.tokenizers.processes.LatinTokenizationProcess(language=None)[source]

Bases: TokenizationProcess

The default Latin tokenization algorithm.

>>> from cltk.tokenizers import LatinTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> tokenizer_process = LatinTokenizationProcess()
>>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("lat")[:23]))
>>> output_doc.tokens
['Gallia', 'est', 'omnis', 'divisa']
description = 'Default tokenizer for the Latin language.'
algorithm
class cltk.tokenizers.processes.MiddleHighGermanTokenizationProcess(language=None)[source]

Bases: TokenizationProcess

The default Middle High German tokenization algorithm.

>>> from cltk.languages.example_texts import get_example_text
>>> tokenizer_process = MiddleHighGermanTokenizationProcess()
>>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("gmh")[:29]))
>>> output_doc.tokens
['Uns', 'ist', 'in', 'alten', 'mæren', 'wunder']
description = 'Default Middle High German tokenizer'
algorithm
class cltk.tokenizers.processes.MiddleEnglishTokenizationProcess(language=None)[source]

Bases: TokenizationProcess

The default Middle English tokenization algorithm.

>>> from cltk.tokenizers import MiddleEnglishTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> tokenizer_process = MiddleEnglishTokenizationProcess()
>>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("enm")[:31]))
>>> output_doc.tokens
['Whilom', ',', 'as', 'olde', 'stories', 'tellen']
description = 'Default Middle English tokenizer'
algorithm
class cltk.tokenizers.processes.OldFrenchTokenizationProcess(language=None)[source]

Bases: TokenizationProcess

The default Old French tokenization algorithm.

>>> from cltk.tokenizers import OldFrenchTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> tok = OldFrenchTokenizationProcess()
>>> output_doc = tok.run(input_doc=Doc(raw=get_example_text("fro")[:37]))
>>> output_doc.tokens
['Une', 'aventure', 'vos', 'voil', 'dire', 'Molt', 'bien']
description = 'Default tokenizer for the Old French language.'
algorithm
class cltk.tokenizers.processes.MiddleFrenchTokenizationProcess(language=None)[source]

Bases: TokenizationProcess

The default Middle French tokenization algorithm.

>>> from cltk.tokenizers import MiddleFrenchTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> tokenizer_process = MiddleFrenchTokenizationProcess()
>>> output_doc = tokenizer_process.run(input_doc=Doc(raw=get_example_text("frm")[:37]))
>>> output_doc.tokens
['Attilius', 'Regulus', ',', 'general', 'de', "l'", 'armée']
description = 'Default tokenizer for the Middle French language.'
algorithm
class cltk.tokenizers.processes.OldNorseTokenizationProcess(language=None)[source]

Bases: TokenizationProcess

The default OldNorse tokenization algorithm.

>>> from cltk.tokenizers import OldNorseTokenizationProcess
>>> from cltk.languages.example_texts import get_example_text
>>> tok = OldNorseTokenizationProcess()
>>> output_doc = tok.run(input_doc=Doc(raw=get_example_text("non")[:29]))
>>> output_doc.tokens
['Gylfi', 'konungr', 'réð', 'þar', 'löndum']
description = 'Default Old Norse tokenizer'
algorithm

8.1.19.11. cltk.tokenizers.utils module

Tokenization utilities

TODO: KJ consider moving to scripts dir.

class cltk.tokenizers.utils.SentenceTokenizerTrainer(language=None, punctuation=None, strict=False, strict_punctuation=None, abbreviations=None)[source]

Bases: object

Train sentences tokenizer

train_sentence_tokenizer(text)[source]

Train sentences tokenizer.

pickle_sentence_tokenizer(filename, tokenizer)[source]

8.1.19.12. cltk.tokenizers.word module

Language-specific word tokenizers. Primary purpose is to handle enclitics.

class cltk.tokenizers.word.WordTokenizer[source]

Bases: object

Base class for word tokenizers

abstract tokenize(text, model=None)[source]

Create a list of tokens from a string. This method should be overridden by subclasses of WordTokenizer.

abstract tokenize_sign(text, model=None)[source]

Create a list of tokens from a string, for cuneiform signs.. This method should be overridden by subclasses of WordTokenizer.

static compute_indices(text, tokens)[source]
class cltk.tokenizers.word.PunktWordTokenizer(sent_tokenizer=None)[source]

Bases: WordTokenizer

Class for punkt word tokenization

tokenize(text)[source]
Return type:

list

Parameters:

text (str) – text to be tokenized into sentences

class cltk.tokenizers.word.RegexWordTokenizer(patterns=None)[source]

Bases: WordTokenizer

Class for regex-based word tokenization

tokenize(text)[source]
Return type:

list

Parameters:
  • text (str) – text to be tokenized into sentences

  • model – tokenizer object to used # Should be in init?

:type model

class cltk.tokenizers.word.CLTKTreebankWordTokenizer[source]

Bases: TreebankWordTokenizer

static compute_indices(text, tokens)[source]