8.1.14. cltk.sentence package¶
8.1.14.1. Submodules¶
8.1.14.2. cltk.sentence.grc module¶
Code for sentences tokenization: Greek.
Sentence tokenization for Ancient Greek is available using a regular-expression based tokenizer.
>>> from cltk.sentence.grc import GreekRegexSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = GreekRegexSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("grc"))
>>> sentences[:2]
['ὅτι μὲν ὑμεῖς, ὦ ἄνδρες Ἀθηναῖοι, πεπόνθατε ὑπὸ τῶν ἐμῶν κατηγόρων, οὐκ οἶδα: ἐγὼ δ᾽ οὖν καὶ αὐτὸς ὑπ᾽ αὐτῶν ὀλίγου ἐμαυτοῦ ἐπελαθόμην, οὕτω πιθανῶς ἔλεγον.', 'καίτοι ἀληθές γε ὡς ἔπος εἰπεῖν οὐδὲν εἰρήκασιν.']
>>> len(sentences)
9
-
class
cltk.sentence.grc.
GreekRegexSentenceTokenizer
[source]¶ Bases:
cltk.sentence.sentence.RegexSentenceTokenizer
RegexSentenceTokenizer
for Ancient Greek.
8.1.14.3. cltk.sentence.lat module¶
Code for sentences tokenization: Latin
>>> from cltk.sentence.lat import LatinPunktSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = LatinPunktSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("lat"))
>>> sentences[2]
'Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.'
>>> len(sentences)
8
-
class
cltk.sentence.lat.
LatinPunktSentenceTokenizer
(strict=False)[source]¶ Bases:
cltk.sentence.sentence.PunktSentenceTokenizer
Sentence tokenizer for Latin. Inherits from NLTK’s
PunktSentenceTokenizer
.
8.1.14.4. cltk.sentence.san module¶
Sentence tokenization for Sanskrit.
>>> from cltk.sentence.san import SanskritRegexSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = SanskritRegexSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("san"))
>>> sentences[1]
'तेन त्यक्तेन भुञ्जीथा मा गृधः कस्य स्विद्धनम् ॥'
>>> len(sentences)
12
-
class
cltk.sentence.san.
SanskritLanguageVars
[source]¶ Bases:
nltk.tokenize.punkt.PunktLanguageVars
-
sent_end_chars
= ['।', '॥', '\\|', '\\|\\|']¶
-
-
class
cltk.sentence.san.
SanskritRegexSentenceTokenizer
[source]¶ Bases:
cltk.sentence.sentence.RegexSentenceTokenizer
RegexSentenceTokenizer for Sanskrit.
8.1.14.5. cltk.sentence.sentence module¶
Tokenize sentences.
-
class
cltk.sentence.sentence.
SentenceTokenizer
(language=None)[source]¶ Bases:
object
Base class for sentences tokenization
-
class
cltk.sentence.sentence.
PunktSentenceTokenizer
(language=None, lang_vars=None)[source]¶ Bases:
cltk.sentence.sentence.SentenceTokenizer
Base class for punkt sentences tokenization
-
missing_models_message
= 'PunktSentenceTokenizer requires a language model.'¶
-
-
class
cltk.sentence.sentence.
RegexSentenceTokenizer
(language=None, sent_end_chars=None)[source]¶ Bases:
cltk.sentence.sentence.SentenceTokenizer
Base class for regex sentences tokenization