8.1.2. cltk.core package¶
8.1.2.1. Submodules¶
8.1.2.2. cltk.core.cltk_logger module¶
CLTK’s logging module.
8.1.2.3. cltk.core.data_types module¶
Custom data types for the CLTK. These types form the building blocks of the NLP pipeline.
>>> from cltk.core.data_types import Language
>>> from cltk.core.data_types import Word
>>> from cltk.core.data_types import Process
>>> from cltk.core.data_types import Doc
>>> from cltk.core.data_types import Pipeline
-
class
cltk.core.data_types.
Language
(name: str, glottolog_id: str, latitude: float, longitude: float, dates: List[int], family_id: str, parent_id: str, level: str, iso_639_3_code: str, type: str)[source]¶ Bases:
object
For holding information about any given language. Used to encode data from ISO 639-3 and Glottolog at
cltk.languages.glottolog.LANGUAGES
. May be extended by user for dialects or languages not documented by ISO 639-3.>>> from cltk.core.data_types import Language >>> from cltk.languages.utils import get_lang >>> lat = get_lang("lat") >>> isinstance(lat, Language) True >>> lat Language(name='Latin', glottolog_id='lati1261', latitude=41.9026, longitude=12.4502, dates=[], family_id='indo1319', parent_id='impe1234', level='language', iso_639_3_code='lat', type='a')
-
name
: str¶
-
glottolog_id
: str¶
-
latitude
: float¶
-
longitude
: float¶
-
dates
: List[int]¶
-
family_id
: str¶
-
parent_id
: str¶
-
level
: str¶
-
iso_639_3_code
: str¶
-
type
: str¶
-
-
class
cltk.core.data_types.
Word
(index_char_start: int = None, index_char_stop: int = None, index_token: int = None, index_sentence: int = None, string: str = None, pos: str = None, lemma: str = None, stem: str = None, scansion: str = None, xpos: str = None, upos: str = None, dependency_relation: str = None, governor: int = None, features: cltk.morphology.morphosyntax.MorphosyntacticFeatureBundle = {}, category: cltk.morphology.morphosyntax.MorphosyntacticFeatureBundle = {}, embedding: numpy.ndarray = None, stop: bool = None, named_entity: bool = None, syllables: List[str] = None, phonetic_transcription: str = None, definition: str = None)[source]¶ Bases:
object
Contains attributes of each processed word in a list of words. Designed to be used in the
Doc.words
dataclass.>>> from cltk.core.data_types import Word >>> from cltk.languages.example_texts import get_example_text >>> get_example_text("lat")[:25] 'Gallia est omnis divisa i' >>> from cltk.languages.utils import get_lang >>> lat = get_lang("lat") >>> Word(index_char_start=0, index_char_stop=6, index_token=0, string=get_example_text("lat")[0:6], pos="nom") Word(index_char_start=0, index_char_stop=6, index_token=0, index_sentence=None, string='Gallia', pos='nom', lemma=None, stem=None, scansion=None, xpos=None, upos=None, dependency_relation=None, governor=None, features={}, category={}, stop=None, named_entity=None, syllables=None, phonetic_transcription=None, definition=None)
-
index_char_start
: int = None¶
-
index_char_stop
: int = None¶
-
index_token
: int = None¶
-
index_sentence
: int = None¶
-
string
: str = None¶
-
pos
: str = None¶
-
lemma
: str = None¶
-
stem
: str = None¶
-
scansion
: str = None¶
-
xpos
: str = None¶
-
upos
: str = None¶
-
dependency_relation
: str = None¶
-
governor
: int = None¶
-
features
: cltk.morphology.morphosyntax.MorphosyntacticFeatureBundle = {}¶
-
category
: cltk.morphology.morphosyntax.MorphosyntacticFeatureBundle = {}¶
-
embedding
: numpy.ndarray = None¶
-
stop
: bool = None¶
-
named_entity
: bool = None¶
-
syllables
: List[str] = None¶
-
phonetic_transcription
: str = None¶
-
definition
: str = None¶
-
-
class
cltk.core.data_types.
Sentence
(words: List[cltk.core.data_types.Word] = None, index: int = None, embedding: numpy.ndarray = None)[source]¶ Bases:
object
The Data Container for sentences.
-
words
: List[cltk.core.data_types.Word] = None¶
-
index
: int = None¶
-
embedding
: numpy.ndarray = None¶
-
-
class
cltk.core.data_types.
Doc
(language: str = None, words: List[cltk.core.data_types.Word] = None, pipeline: Optional[cltk.core.data_types.Pipeline] = None, raw: str = None, normalized_text: str = None, sentence_embeddings: Dict[int, numpy.ndarray] = None)[source]¶ Bases:
object
The object returned to the user from the
NLP()
class. Contains overall attributes of submitted texts, plus most importantly the processed tokenized textwords
, being a list ofWord
types.>>> from cltk import NLP >>> from cltk.languages.example_texts import get_example_text >>> cltk_nlp = NLP(language="lat", suppress_banner=True) >>> cltk_doc = cltk_nlp.analyze(text=get_example_text("lat")) >>> cltk_doc.raw[:38] 'Gallia est omnis divisa in partes tres' >>> isinstance(cltk_doc.raw, str) True >>> cltk_doc.tokens[:10] ['Gallia', 'est', 'omnis', 'divisa', 'in', 'partes', 'tres', ',', 'quarum', 'unam'] >>> cltk_doc.tokens_stops_filtered[:10] ['Gallia', 'omnis', 'divisa', 'partes', 'tres', ',', 'incolunt', 'Belgae', ',', 'aliam'] >>> cltk_doc.pos[:3] ['NOUN', 'AUX', 'PRON'] >>> cltk_doc.morphosyntactic_features[:3] [{Case: [nominative], Degree: [positive], Gender: [feminine], Number: [singular]}, {Mood: [indicative], Number: [singular], Person: [third], Tense: [present], VerbForm: [finite], Voice: [active]}, {Case: [nominative], Degree: [positive], Gender: [feminine], Number: [singular], PrononimalType: [indefinite]}] >>> cltk_doc[0].gender [feminine] >>> cltk_doc[0]['Case'] [nominative] >>> cltk_doc.lemmata[:5] ['mallis', 'sum', 'omnis', 'divido', 'in'] >>> len(cltk_doc.sentences) 9 >>> len(cltk_doc.sentences[0]) 26 >>> type(cltk_doc.sentences[0][2]) <class 'cltk.core.data_types.Word'> >>> cltk_doc.sentences[0][2].string 'omnis' >>> len(cltk_doc.sentences_tokens) 9 >>> len(cltk_doc.sentences_tokens[0]) 26 >>> isinstance(cltk_doc.sentences_tokens[0][2], str) True >>> cltk_doc.sentences_tokens[0][2] 'omnis' >>> len(cltk_doc.sentences_strings) 9 >>> len(cltk_doc.sentences_strings[0]) 150 >>> isinstance(cltk_doc.sentences_strings[0], str) True >>> cltk_doc.sentences_strings[1] 'Hi omnes lingua , institutis , legibus inter se differunt .' >>> import numpy as np >>> isinstance(cltk_doc.embeddings[1], np.ndarray) True
-
language
: str = None¶
-
words
: List[cltk.core.data_types.Word] = None¶
-
pipeline
: cltk.core.data_types.Pipeline = None¶
-
raw
: str = None¶
-
normalized_text
: str = None¶
-
embeddings_model
= None¶
-
sentence_embeddings
: Dict[int, numpy.ndarray] = None¶
-
property
sentences
¶ Returns a list of
Sentence``s, with each ``Sentence
being a container for a list ofWord
objects.- Return type:
List
[Sentence
]
-
property
sentences_tokens
¶ Returns a list of lists, with the inner list being a list of word token strings.
- Return type:
List
[List
[str
]]
-
property
sentences_strings
¶ Returns a list of strings, with each string being a sentence reconstructed from the word tokens.
- Return type:
List
[str
]
-
property
tokens
¶ Returns a list of string word tokens of all words in the doc.
- Return type:
List
[str
]
-
property
tokens_stops_filtered
¶ Returns a list of string word tokens of all words in the doc, but with stopwords removed.
- Return type:
List
[str
]
-
property
pos
¶ Returns a list of the POS tags of all words in the doc.
- Return type:
List
[str
]
-
property
morphosyntactic_features
¶ Returns a list of dictionaries containing the morphosyntactic features of each word (when available). Each dictionary specifies feature names as keys and feature values as values.
- Return type:
List
[Dict
[str
,str
]]
-
property
lemmata
¶ Returns a list of lemmata, indexed to the word tokens provided by Doc.tokens.
- Return type:
List
[str
]
-
property
stems
¶ Returns a list of word stems, indexed to the word tokens provided by Doc.tokens.
- Return type:
List
[str
]
-
property
embeddings
¶ Returns an embedding for each word.
TODO: Consider option to use lemma
-
-
class
cltk.core.data_types.
Process
(language: str = None)[source]¶ Bases:
abc.ABC
For each type of NLP process there needs to be a definition. It includes the type of data it expects (
str
,List[str]
,Word
, etc.) and what field withinWord
it will populate. This base class is intended to be inherited by NLP process types (e.g.,TokenizationProcess
orDependencyProcess
).-
language
: str = None¶
-
-
class
cltk.core.data_types.
Pipeline
(description: str, processes: List[Type[cltk.core.data_types.Process]], language: cltk.core.data_types.Language)[source]¶ Bases:
object
Abstract
Pipeline
class to be inherited.# TODO: Consider adding a Unicode normalization as a default first Process
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.languages.utils import get_lang >>> from cltk.tokenizers import LatinTokenizationProcess >>> a_pipeline = Pipeline(description="A custom Latin pipeline", processes=[LatinTokenizationProcess], language=get_lang("lat")) >>> a_pipeline.description 'A custom Latin pipeline' >>> issubclass(a_pipeline.processes[0], Process) True
-
description
: str¶
-
processes
: List[Type[cltk.core.data_types.Process]]¶
-
language
: cltk.core.data_types.Language¶
-
8.1.2.4. cltk.core.exceptions module¶
Custom exceptions for cltk
library.
-
exception
cltk.core.exceptions.
CLTKException
[source]¶ Bases:
Exception
Exception class for the
cltk
library.>>> from cltk.core.exceptions import CLTKException >>> raise CLTKException Traceback (most recent call last): ... File "<doctest cltk.core.exceptions.CLTKException[1]>", line 1, in <module> raise CLTKException cltk.core.exceptions.CLTKException
-
exception
cltk.core.exceptions.
UnimplementedAlgorithmError
[source]¶ Bases:
cltk.core.exceptions.CLTKException
Exception for when a language is supported by the CLTK however a particular algorithm is not available for that language.
>>> from cltk.core.exceptions import UnimplementedAlgorithmError >>> raise UnimplementedAlgorithmError Traceback (most recent call last): ... File "<doctest cltk.core.exceptions.UnimplementedAlgorithmError[1]>", line 1, in <module> raise UnimplementedAlgorithmError cltk.core.exceptions.UnimplementedAlgorithmError
-
exception
cltk.core.exceptions.
UnknownLanguageError
[source]¶ Bases:
cltk.core.exceptions.CLTKException
Exception for when a user requests a language either not known to the CLTK or not yet implemented.
All known languages at
cltk.languages.glottolog.py
. Implemented languages include those atcltk.languages.pipelines
and some miscellaneously implemented throughout the library.>>> from cltk.core.exceptions import UnknownLanguageError >>> raise UnknownLanguageError Traceback (most recent call last): ... File "<doctest cltk.core.exceptions.UnknownLanguageError[1]>", line 1, in <module> raise UnknownLanguageError cltk.core.exceptions.UnknownLanguageError