8.1.14. cltk.sentence package¶
8.1.14.1. Submodules¶
8.1.14.2. cltk.sentence.grc module¶
Code for sentences tokenization: Greek.
Sentence tokenization for Ancient Greek is available using a regular-expression based tokenizer.
>>> from cltk.sentence.grc import GreekRegexSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = GreekRegexSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("grc"))
>>> sentences[:2]
['ὅτι μὲν ὑμεῖς, ὦ ἄνδρες Ἀθηναῖοι, πεπόνθατε ὑπὸ τῶν ἐμῶν κατηγόρων, οὐκ οἶδα: ἐγὼ δ᾽ οὖν καὶ αὐτὸς ὑπ᾽ αὐτῶν ὀλίγου ἐμαυτοῦ ἐπελαθόμην, οὕτω πιθανῶς ἔλεγον.', 'καίτοι ἀληθές γε ὡς ἔπος εἰπεῖν οὐδὲν εἰρήκασιν.']
>>> len(sentences)
9
- class cltk.sentence.grc.GreekRegexSentenceTokenizer[source]¶
Bases:
RegexSentenceTokenizer
RegexSentenceTokenizer
for Ancient Greek.
8.1.14.3. cltk.sentence.lat module¶
Code for sentences tokenization: Latin
>>> from cltk.sentence.lat import LatinPunktSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = LatinPunktSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("lat"))
>>> sentences[2]
'Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit.'
>>> len(sentences)
8
- class cltk.sentence.lat.LatinPunktSentenceTokenizer(strict=False)[source]¶
Bases:
PunktSentenceTokenizer
Sentence tokenizer for Latin. Inherits from NLTK’s
PunktSentenceTokenizer
.
8.1.14.4. cltk.sentence.non module¶
Code for sentences tokenization: Old Norse.
Sentence tokenization for Old Norse is available using a regular-expression based tokenizer.
>>> from cltk.sentence.non import OldNorseRegexSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = OldNorseRegexSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("non"))
>>> sentences[:2]
['Gylfi konungr réð þar löndum er nú heitir Svíþjóð.', 'Frá honum er þat sagt at hann gaf einni farandi konu at launum skemmtunar sinnar eitt plógsland í ríki sínu þat er fjórir öxn drægi upp dag ok nótt.']
>>> len(sentences)
7
- class cltk.sentence.non.OldNorseRegexSentenceTokenizer[source]¶
Bases:
RegexSentenceTokenizer
RegexSentenceTokenizer
for Old Norse.
8.1.14.5. cltk.sentence.processes module¶
Module for sentence tokenizers.
- class cltk.sentence.processes.SentenceTokenizationProcess(language=None)[source]¶
Bases:
Process
To be inherited for each language’s tokenization declarations.
Example:
SentenceTokenizationProcess
->OldNorseTokenizationProcess
>>> from cltk.tokenizers.processes import TokenizationProcess >>> from cltk.core.data_types import Process >>> issubclass(SentenceTokenizationProcess, Process) True >>> tok = SentenceTokenizationProcess()
- model = None¶
- algorithm¶
- class cltk.sentence.processes.OldNorseSentenceTokenizationProcess(language=None)[source]¶
Bases:
SentenceTokenizationProcess
The default Old Norse sentence tokenization algorithm.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.sentence.processes import OldNorseSentenceTokenizationProcess >>> from cltk.tokenizers import OldNorseTokenizationProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text
>>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Old Norse pipeline", processes=[OldNorseTokenizationProcess, OldNorseSentenceTokenizationProcess], language=get_lang("non")) >>> nlp = NLP(language='non', custom_pipeline=pipe, suppress_banner=True) >>> output_doc = nlp.analyze(get_example_text("non")) >>> len(output_doc.sentences_strings) 7
- algorithm¶
8.1.14.6. cltk.sentence.san module¶
Sentence tokenization for Sanskrit.
>>> from cltk.sentence.san import SanskritRegexSentenceTokenizer
>>> from cltk.languages.example_texts import get_example_text
>>> splitter = SanskritRegexSentenceTokenizer()
>>> sentences = splitter.tokenize(get_example_text("san"))
>>> sentences[1]
'तेन त्यक्तेन भुञ्जीथा मा गृधः कस्य स्विद्धनम् ॥'
>>> len(sentences)
12
- class cltk.sentence.san.SanskritLanguageVars[source]¶
Bases:
PunktLanguageVars
- sent_end_chars = ['।', '॥', '\\|', '\\|\\|']¶
Characters which are candidates for sentence boundaries
- class cltk.sentence.san.SanskritRegexSentenceTokenizer[source]¶
Bases:
RegexSentenceTokenizer
RegexSentenceTokenizer for Sanskrit.
8.1.14.7. cltk.sentence.sentence module¶
Tokenize sentences.
- class cltk.sentence.sentence.SentenceTokenizer(language=None)[source]¶
Bases:
ABC
Base class for sentences tokenization
- class cltk.sentence.sentence.PunktSentenceTokenizer(language=None, lang_vars=None)[source]¶
Bases:
SentenceTokenizer
Base class for punkt sentences tokenization.
- missing_models_message = 'PunktSentenceTokenizer requires a language model.'¶
- class cltk.sentence.sentence.RegexSentenceTokenizer(language=None, sent_end_chars=None)[source]¶
Bases:
SentenceTokenizer
Base class for regex sentences tokenization.