8.1.6. cltk.embeddings package

Init for cltk.embeddings.

8.1.6.1. Submodules

8.1.6.2. cltk.embeddings.embeddings module

Module for accessing pre-trained fastText word embeddings and Word2Vec embeddings from NLPL. Two sets of models are available from fastText, one being trained only on corpora taken from Wikipedia (249 languages) and the other being a combination of Wikipedia and Common Crawl (157 languages, a subset of the former).

The Word2Vec models are in two versions, txt and bin, with the txt being approximately twice the size and containing information for retraining.

# TODO: Classes Word2VecEmbeddings and FastTextEmbeddings contain duplicative code. Consider combining them.

# TODO: Instead of returning None, return an empty numpy array of correct len.

class cltk.embeddings.embeddings.Word2VecEmbeddings(iso_code, model_type='txt', interactive=True, silent=False, overwrite=False)[source]

Bases: object

Wrapper for Word2Vec embeddings. Note: For models provided by fastText, use class FastTextEmbeddings.

get_word_vector(word)[source]

Return embedding array.

get_embedding_length()[source]

Return the embedding length for selected model.

Return type:

int

get_sims(word)[source]

Get similar words.

_check_input_params()[source]

Confirm that input parameters are valid and in a valid configuration.

Return type:

None

_build_zip_filepath()[source]

Create filepath where .zip file will be saved.

Return type:

str

_build_nlpl_filepath()[source]

Create filepath where chosen language should be found.

Return type:

str

_is_nlpl_model_present()[source]

Check if model in an otherwise valid filepath.

Return type:

bool

_download_nlpl_models()[source]

Perform complete download of Word2Vec models and save them in appropriate cltk_data dir.

Return type:

None

_unzip_nlpl_model()[source]

Unzip model

Return type:

None

_load_model()[source]

Load model into memory.

TODO: When testing show that this is a Gensim type TODO: Suppress Gensim info printout from screen

class cltk.embeddings.embeddings.FastTextEmbeddings(iso_code, training_set='wiki', model_type='vec', interactive=True, overwrite=False, silent=False)[source]

Bases: object

Wrapper for fastText embeddings.

get_word_vector(word)[source]

Return embedding array.

get_embedding_length()[source]

Return the embedding length for selected model.

Return type:

int

get_sims(word)[source]

Get similar words.

download_fasttext_models()[source]

Perform complete download of fastText models and save them in appropriate cltk_data dir.

TODO: Add tests TODO: Implement overwrite TODO: error out better or continue to _load_model?

_is_model_present()[source]

Check if model in an otherwise valid filepath.

_check_input_params()[source]

Look at combination of parameters give to class and determine if any invalid combination or missing models.

_load_model()[source]

Load model into memory.

TODO: When testing show that this is a Gensim type TODO: Suppress Gensim info printout from screen

_is_fasttext_lang_available()[source]

Returns whether any vectors are available, for fastText, for the input language. This is not comprehensive of all fastText embeddings, only those added into the CLTK.

Return type:

bool

_build_fasttext_filepath()[source]

Create filepath at which to save a downloaded fasttext model.

Todo

Do better than test for just name. Try trimming up to user home dir.

>>> from cltk.embeddings.embeddings import FastTextEmbeddings  
>>> embeddings_obj = FastTextEmbeddings(iso_code="lat", silent=True)  
>>> vec_fp = embeddings_obj._build_fasttext_filepath()  
>>> os.path.split(vec_fp)[1]  
'wiki.la.vec'
>>> embeddings_obj = FastTextEmbeddings(iso_code="lat", training_set="bin", silent=True)  
>>> bin_fp = embeddings_obj._build_fasttext_filepath()  
>>> os.path.split(bin_fp)[1]  
'wiki.la.bin'
>>> embeddings_obj = FastTextEmbeddings(iso_code="lat", training_set="common_crawl", model_type="vec", silent=True)  
>>> os.path.split(vec_fp)[1]  
'cc.la.300.vec'
>>> embeddings_obj = FastTextEmbeddings(iso_code="lat", training_set="common_crawl", model_type="bin", silent=True)  
>>> bin_fp = embeddings_obj._build_fasttext_filepath()  
>>> vec_fp = embeddings_obj._build_fasttext_filepath()  
>>> os.path.split(bin_fp)[1]  
'cc.la.300.bin'
_build_fasttext_url()[source]

Make the URL at which the requested model may be downloaded.

8.1.6.3. cltk.embeddings.processes module

This module holds the embeddings ``Process``es.

class cltk.embeddings.processes.EmbeddingsProcess(language: str = None, variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None)[source]

Bases: cltk.core.data_types.Process

To be inherited for each language’s embeddings declarations.

Note

There can be no DefaultEmbeddingsProcess because word embeddings are naturally language-specific.

Example: EmbeddingsProcess <- LatinEmbeddingsProcess

>>> from cltk.core.data_types import Doc
>>> from cltk.embeddings.processes import EmbeddingsProcess
>>> from cltk.core.data_types import Process
>>> issubclass(EmbeddingsProcess, Process)
True
>>> emb_proc = EmbeddingsProcess()
language: str = None
variant: str = 'fasttext'
embedding_length: int = None
idf_model: Optional[Dict[str, float]] = None
min_idf: Optional[numpy.float64] = None
max_idf: Optional[numpy.float64] = None
algorithm
run(input_doc)[source]

Compute the embeddings.

Return type:

Doc

class cltk.embeddings.processes.ArabicEmbeddingsProcess(language: str = 'arb', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Arabic.')[source]

Bases: cltk.embeddings.processes.EmbeddingsProcess

The default Arabic embeddings algorithm.

description: str = 'Default embeddings for Arabic.'
language: str = 'arb'
class cltk.embeddings.processes.AramaicEmbeddingsProcess(language: str = 'arb', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Aramaic.')[source]

Bases: cltk.embeddings.processes.EmbeddingsProcess

The default Aramaic embeddings algorithm.

description: str = 'Default embeddings for Aramaic.'
language: str = 'arb'
class cltk.embeddings.processes.GothicEmbeddingsProcess(language: str = 'got', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Gothic.')[source]

Bases: cltk.embeddings.processes.EmbeddingsProcess

The default Gothic embeddings algorithm.

description: str = 'Default embeddings for Gothic.'
language: str = 'got'
class cltk.embeddings.processes.GreekEmbeddingsProcess(language: str = 'grc', variant: str = 'nlpl', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Ancient Greek.')[source]

Bases: cltk.embeddings.processes.EmbeddingsProcess

The default Ancient Greek embeddings algorithm.

language: str = 'grc'
description: str = 'Default embeddings for Ancient Greek.'
variant: str = 'nlpl'
class cltk.embeddings.processes.LatinEmbeddingsProcess(language: str = 'lat', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Latin.')[source]

Bases: cltk.embeddings.processes.EmbeddingsProcess

The default Latin embeddings algorithm.

language: str = 'lat'
description: str = 'Default embeddings for Latin.'
class cltk.embeddings.processes.OldEnglishEmbeddingsProcess(language: str = 'ang', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Old English.')[source]

Bases: cltk.embeddings.processes.EmbeddingsProcess

The default Old English embeddings algorithm.

description: str = 'Default embeddings for Old English.'
language: str = 'ang'
class cltk.embeddings.processes.PaliEmbeddingsProcess(language: str = 'pli', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Pali.')[source]

Bases: cltk.embeddings.processes.EmbeddingsProcess

The default Pali embeddings algorithm.

description: str = 'Default embeddings for Pali.'
language: str = 'pli'
class cltk.embeddings.processes.SanskritEmbeddingsProcess(language: str = 'san', variant: str = 'fasttext', embedding_length: int = None, idf_model: Optional[Dict[str, float]] = None, min_idf: Optional[numpy.float64] = None, max_idf: Optional[numpy.float64] = None, description: str = 'Default embeddings for Sanskrit.')[source]

Bases: cltk.embeddings.processes.EmbeddingsProcess

The default Sanskrit embeddings algorithm.

description: str = 'Default embeddings for Sanskrit.'
language: str = 'san'