8.1.14. cltk.sentence package

8.1.14.1. Submodules

8.1.14.2. cltk.sentence.grc module

Code for sentences tokenization: Greek.

Sentence tokenization for Ancient Greek is available using a regular-expression based tokenizer.

>>> from cltk.sentence.grc import GreekRegexSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = GreekRegexSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("grc"))
>>> sentences[:2]
['ὅτι μὲν ὑμεῖς, ὦ ἄνδρες Ἀθηναῖοι, πεπόνθατε ὑπὸ τῶν ἐμῶν κατηγόρων, οὐκ οἶδα: ἐγὼ δ᾽ οὖν καὶ αὐτὸς ὑπ᾽ αὐτῶν ὀλίγου ἐμαυτοῦ ἐπελαθόμην, οὕτω πιθανῶς ἔλεγον.', 'καίτοι ἀληθές γε ὡς ἔπος εἰπεῖν οὐδὲν εἰρήκασιν.']
>>> len(sentences)
9
class cltk.sentence.grc.GreekRegexSentenceTokenizer[source]

Bases: RegexSentenceTokenizer

RegexSentenceTokenizer for Ancient Greek.

8.1.14.3. cltk.sentence.lat module

Code for sentences tokenization: Latin

>>> from cltk.sentence.lat import LatinPunktSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = LatinPunktSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("lat"))
>>> sentences[2]
'Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.'
>>> len(sentences)
8
class cltk.sentence.lat.LatinLanguageVars[source]

Bases: PunktLanguageVars

class cltk.sentence.lat.LatinPunktSentenceTokenizer(strict=False)[source]

Bases: PunktSentenceTokenizer

Sentence tokenizer for Latin. Inherits from NLTK’s PunktSentenceTokenizer.

8.1.14.4. cltk.sentence.non module

Code for sentences tokenization: Old Norse.

Sentence tokenization for Old Norse is available using a regular-expression based tokenizer.

>>> from cltk.sentence.non import OldNorseRegexSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = OldNorseRegexSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("non"))
>>> sentences[:2]
['Gylfi konungr réð þar löndum er nú heitir Svíþjóð.', 'Frá honum er þat sagt at hann gaf einni farandi konu at launum skemmtunar sinnar eitt plógsland í ríki sínu þat er fjórir öxn drægi upp dag ok nótt.']
>>> len(sentences)
7
class cltk.sentence.non.OldNorseRegexSentenceTokenizer[source]

Bases: RegexSentenceTokenizer

RegexSentenceTokenizer for Old Norse.

8.1.14.5. cltk.sentence.processes module

Module for sentence tokenizers.

class cltk.sentence.processes.SentenceTokenizationProcess(language=None)[source]

Bases: Process

To be inherited for each language’s tokenization declarations.

Example: SentenceTokenizationProcess -> OldNorseTokenizationProcess

>>> from cltk.tokenizers.processes import TokenizationProcess
>>> from cltk.core.data_types import Process
>>> issubclass(SentenceTokenizationProcess, Process)
True
>>> tok = SentenceTokenizationProcess()
model = None
algorithm
run(input_doc)[source]
Return type:

Doc

class cltk.sentence.processes.OldNorseSentenceTokenizationProcess(language=None)[source]

Bases: SentenceTokenizationProcess

The default Old Norse sentence tokenization algorithm.

>>> from cltk.core.data_types import Process, Pipeline
>>> from cltk.sentence.processes import OldNorseSentenceTokenizationProcess
>>> from cltk.tokenizers import OldNorseTokenizationProcess
>>> from cltk.languages.utils import get_lang
>>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP
>>> pipe = Pipeline(description="A custom Old Norse pipeline",     processes=[OldNorseTokenizationProcess, OldNorseSentenceTokenizationProcess],     language=get_lang("non"))
>>> nlp = NLP(language='non', custom_pipeline=pipe, suppress_banner=True)
>>> output_doc = nlp.analyze(get_example_text("non"))
>>> len(output_doc.sentences_strings)
7
algorithm

8.1.14.6. cltk.sentence.san module

Sentence tokenization for Sanskrit.

>>> from cltk.sentence.san import SanskritRegexSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = SanskritRegexSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("san"))
>>> sentences[1]
'तेन त्यक्तेन भुञ्जीथा मा गृधः कस्य स्विद्धनम् ॥'
>>> len(sentences)
12
class cltk.sentence.san.SanskritLanguageVars[source]

Bases: PunktLanguageVars

sent_end_chars = ['।', '॥', '\\|', '\\|\\|']

Characters which are candidates for sentence boundaries

class cltk.sentence.san.SanskritRegexSentenceTokenizer[source]

Bases: RegexSentenceTokenizer

RegexSentenceTokenizer for Sanskrit.

8.1.14.7. cltk.sentence.sentence module

Tokenize sentences.

class cltk.sentence.sentence.SentenceTokenizer(language=None)[source]

Bases: ABC

Base class for sentences tokenization

tokenize(text, model=None)[source]

Method for tokenizing sentences with pretrained punkt models; can be overridden by language-specific tokenizers.

Return type:

list

Parameters:
  • text (str) – text to be tokenized into sentences

  • model – tokenizer object to used # Should be in init?

:type model

class cltk.sentence.sentence.PunktSentenceTokenizer(language=None, lang_vars=None)[source]

Bases: SentenceTokenizer

Base class for punkt sentences tokenization.

missing_models_message = 'PunktSentenceTokenizer requires a language model.'
class cltk.sentence.sentence.RegexSentenceTokenizer(language=None, sent_end_chars=None)[source]

Bases: SentenceTokenizer

Base class for regex sentences tokenization.

tokenize(text, model=None)[source]

Method for tokenizing sentences with regular expressions.

Return type:

list

Parameters:

text (str) – text to be tokenized into sentences