8.1.6. cltk.embeddings package¶
Init for cltk.embeddings
.
8.1.6.1. Submodules¶
8.1.6.2. cltk.embeddings.embeddings module¶
Module for accessing pre-trained fastText word embeddings and Word2Vec embeddings from NLPL. Two sets of models are available from fastText, one being trained only on corpora taken from Wikipedia (249 languages) and the other being a combination of Wikipedia and Common Crawl (157 languages, a subset of the former).
The Word2Vec models are in two versions, txt
and bin
, with the
txt
being approximately twice the size and containing information
for retraining.
# TODO: Classes Word2VecEmbeddings
and FastTextEmbeddings
contain duplicative code. Consider combining them.
# TODO: Instead of returning None, return an empty numpy array of correct len.
-
class
cltk.embeddings.embeddings.
Word2VecEmbeddings
(iso_code, model_type='txt', interactive=True, silent=False, overwrite=False)[source]¶ Bases:
object
Wrapper for Word2Vec embeddings. Note: For models provided by fastText, use class
FastTextEmbeddings
.-
_check_input_params
()[source]¶ Confirm that input parameters are valid and in a valid configuration.
- Return type:
None
-
_build_nlpl_filepath
()[source]¶ Create filepath where chosen language should be found.
- Return type:
str
-
-
class
cltk.embeddings.embeddings.
FastTextEmbeddings
(iso_code, training_set='wiki', model_type='vec', interactive=True, overwrite=False, silent=False)[source]¶ Bases:
object
Wrapper for fastText embeddings.
-
download_fasttext_models
()[source]¶ Perform complete download of fastText models and save them in appropriate
cltk_data
dir.TODO: Add tests TODO: Implement
overwrite
TODO: error out better or continue to _load_model?
-
_check_input_params
()[source]¶ Look at combination of parameters give to class and determine if any invalid combination or missing models.
-
_load_model
()[source]¶ Load model into memory.
TODO: When testing show that this is a Gensim type TODO: Suppress Gensim info printout from screen
-
_is_fasttext_lang_available
()[source]¶ Returns whether any vectors are available, for fastText, for the input language. This is not comprehensive of all fastText embeddings, only those added into the CLTK.
- Return type:
bool
-
_build_fasttext_filepath
()[source]¶ Create filepath at which to save a downloaded fasttext model.
Todo
Do better than test for just name. Try trimming up to user home dir.
>>> from cltk.embeddings.embeddings import FastTextEmbeddings >>> embeddings_obj = FastTextEmbeddings(iso_code="lat", silent=True) >>> vec_fp = embeddings_obj._build_fasttext_filepath() >>> os.path.split(vec_fp)[1] 'wiki.la.vec' >>> embeddings_obj = FastTextEmbeddings(iso_code="lat", training_set="bin", silent=True) >>> bin_fp = embeddings_obj._build_fasttext_filepath() >>> os.path.split(bin_fp)[1] 'wiki.la.bin' >>> embeddings_obj = FastTextEmbeddings(iso_code="lat", training_set="common_crawl", model_type="vec", silent=True) >>> os.path.split(vec_fp)[1] 'cc.la.300.vec' >>> embeddings_obj = FastTextEmbeddings(iso_code="lat", training_set="common_crawl", model_type="bin", silent=True) >>> bin_fp = embeddings_obj._build_fasttext_filepath() >>> vec_fp = embeddings_obj._build_fasttext_filepath() >>> os.path.split(bin_fp)[1] 'cc.la.300.bin'
8.1.6.3. cltk.embeddings.processes module¶
This module holds the embeddings ``Process``es.
-
class
cltk.embeddings.processes.
EmbeddingsProcess
(language: str = None, variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None)[source]¶ Bases:
cltk.core.data_types.Process
To be inherited for each language’s embeddings declarations.
Note
There can be no
DefaultEmbeddingsProcess
because word embeddings are naturally language-specific.Example:
EmbeddingsProcess
<-LatinEmbeddingsProcess
>>> from cltk.core.data_types import Doc >>> from cltk.embeddings.processes import EmbeddingsProcess >>> from cltk.core.data_types import Process >>> issubclass(EmbeddingsProcess, Process) True >>> emb_proc = EmbeddingsProcess()
-
language
: str = None¶
-
variant
: str = 'fasttext'¶
-
embedding_length
: int = None¶
-
idf_model
: Optional[Dict[str, float]] = None¶
-
min_idf
: Optional[numpy.float64] = None¶
-
max_idf
: Optional[numpy.float64] = None¶
-
algorithm
¶
-
class
cltk.embeddings.processes.
ArabicEmbeddingsProcess
(language: str = 'arb', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Arabic.')[source]¶ Bases:
cltk.embeddings.processes.EmbeddingsProcess
The default Arabic embeddings algorithm.
-
description
: str = 'Default embeddings for Arabic.'¶
-
language
: str = 'arb'¶
-
-
class
cltk.embeddings.processes.
AramaicEmbeddingsProcess
(language: str = 'arb', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Aramaic.')[source]¶ Bases:
cltk.embeddings.processes.EmbeddingsProcess
The default Aramaic embeddings algorithm.
-
description
: str = 'Default embeddings for Aramaic.'¶
-
language
: str = 'arb'¶
-
-
class
cltk.embeddings.processes.
GothicEmbeddingsProcess
(language: str = 'got', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Gothic.')[source]¶ Bases:
cltk.embeddings.processes.EmbeddingsProcess
The default Gothic embeddings algorithm.
-
description
: str = 'Default embeddings for Gothic.'¶
-
language
: str = 'got'¶
-
-
class
cltk.embeddings.processes.
GreekEmbeddingsProcess
(language: str = 'grc', variant: str = 'nlpl', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Ancient Greek.')[source]¶ Bases:
cltk.embeddings.processes.EmbeddingsProcess
The default Ancient Greek embeddings algorithm.
-
language
: str = 'grc'¶
-
description
: str = 'Default embeddings for Ancient Greek.'¶
-
variant
: str = 'nlpl'¶
-
-
class
cltk.embeddings.processes.
LatinEmbeddingsProcess
(language: str = 'lat', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Latin.')[source]¶ Bases:
cltk.embeddings.processes.EmbeddingsProcess
The default Latin embeddings algorithm.
-
language
: str = 'lat'¶
-
description
: str = 'Default embeddings for Latin.'¶
-
-
class
cltk.embeddings.processes.
OldEnglishEmbeddingsProcess
(language: str = 'ang', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Old English.')[source]¶ Bases:
cltk.embeddings.processes.EmbeddingsProcess
The default Old English embeddings algorithm.
-
description
: str = 'Default embeddings for Old English.'¶
-
language
: str = 'ang'¶
-
-
class
cltk.embeddings.processes.
PaliEmbeddingsProcess
(language: str = 'pli', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Pali.')[source]¶ Bases:
cltk.embeddings.processes.EmbeddingsProcess
The default Pali embeddings algorithm.
-
description
: str = 'Default embeddings for Pali.'¶
-
language
: str = 'pli'¶
-
-
class
cltk.embeddings.processes.
SanskritEmbeddingsProcess
(language: str = 'san', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Sanskrit.')[source]¶ Bases:
cltk.embeddings.processes.EmbeddingsProcess
The default Sanskrit embeddings algorithm.
-
description
: str = 'Default embeddings for Sanskrit.'¶
-
language
: str = 'san'¶
-
-
-
class
-