8.1.6. cltk.embeddings package¶
Init for cltk.embeddings.
8.1.6.1. Submodules¶
8.1.6.2. cltk.embeddings.embeddings module¶
Module for accessing pre-trained fastText word embeddings and Word2Vec embeddings from NLPL. Two sets of models are available from fastText, one being trained only on corpora taken from Wikipedia (249 languages) and the other being a combination of Wikipedia and Common Crawl (157 languages, a subset of the former).
The Word2Vec models are in two versions, txt and bin, with the
txt being approximately twice the size and containing information
for retraining.
Note: In Oct 2022, we changed from the fasttext library to Spacy’s floret,
which contains fasttext’s source but without its packaging problems.
# TODO: Classes Word2VecEmbeddings and FastTextEmbeddings contain duplicative code. Consider combining them.
# TODO: Instead of returning None, return an empty numpy array of correct len.
- class cltk.embeddings.embeddings.CLTKWord2VecEmbeddings(iso_code, model_type='txt', interactive=True, silent=False, overwrite=False)[source]¶
Bases:
objectWrapper for self-hosted Word2Vec embeddings.
- _check_input_params()[source]¶
Confirm that input parameters are valid and in a valid configuration.
- Return type:
None
- class cltk.embeddings.embeddings.Word2VecEmbeddings(iso_code, model_type='txt', interactive=True, silent=False, overwrite=False)[source]¶
Bases:
objectWrapper for Word2Vec embeddings. Note: For models provided by fastText, use class
FastTextEmbeddings.- _check_input_params()[source]¶
Confirm that input parameters are valid and in a valid configuration.
- Return type:
None
- _build_nlpl_filepath()[source]¶
Create filepath where chosen language should be found.
- Return type:
str
- class cltk.embeddings.embeddings.FastTextEmbeddings(iso_code, training_set='wiki', model_type='vec', interactive=True, overwrite=False, silent=False)[source]¶
Bases:
objectWrapper for fastText embeddings.
- download_fasttext_models()[source]¶
Perform complete download of fastText models and save them in appropriate
cltk_datadir.TODO: Add tests TODO: Implement
overwriteTODO: error out better or continue to _load_model?
- _check_input_params()[source]¶
Look at combination of parameters give to class and determine if any invalid combination or missing models.
- _load_model()[source]¶
Load model into memory.
TODO: When testing show that this is a Gensim type TODO: Suppress Gensim info printout from screen
- _is_fasttext_lang_available()[source]¶
Returns whether any vectors are available, for fastText, for the input language. This is not comprehensive of all fastText embeddings, only those added into the CLTK.
- Return type:
bool
- _build_fasttext_filepath()[source]¶
Create filepath at which to save a downloaded fasttext model.
Todo
Do better than test for just name. Try trimming up to user home dir.
>>> from cltk.embeddings.embeddings import FastTextEmbeddings >>> embeddings_obj = FastTextEmbeddings(iso_code="lat", silent=True) >>> vec_fp = embeddings_obj._build_fasttext_filepath() >>> os.path.split(vec_fp)[1] 'wiki.la.vec' >>> embeddings_obj = FastTextEmbeddings(iso_code="lat", training_set="bin", silent=True) >>> bin_fp = embeddings_obj._build_fasttext_filepath() >>> os.path.split(bin_fp)[1] 'wiki.la.bin' >>> embeddings_obj = FastTextEmbeddings(iso_code="lat", training_set="common_crawl", model_type="vec", silent=True) >>> os.path.split(vec_fp)[1] 'cc.la.300.vec' >>> embeddings_obj = FastTextEmbeddings(iso_code="lat", training_set="common_crawl", model_type="bin", silent=True) >>> bin_fp = embeddings_obj._build_fasttext_filepath() >>> vec_fp = embeddings_obj._build_fasttext_filepath() >>> os.path.split(bin_fp)[1] 'cc.la.300.bin'
8.1.6.3. cltk.embeddings.processes module¶
This module holds the embeddings ``Process``es.
- class cltk.embeddings.processes.EmbeddingsProcess(language=None, variant='fasttext', embedding_length=None, idf_model=None, min_idf=None, max_idf=None)[source]¶
Bases:
ProcessTo be inherited for each language’s embeddings declarations.
Note
There can be no
DefaultEmbeddingsProcessbecause word embeddings are naturally language-specific.Example:
EmbeddingsProcess<-LatinEmbeddingsProcess>>> from cltk.core.data_types import Doc >>> from cltk.embeddings.processes import EmbeddingsProcess >>> from cltk.core.data_types import Process >>> issubclass(EmbeddingsProcess, Process) True >>> emb_proc = EmbeddingsProcess()
- language: str = None¶
- variant: str = 'fasttext'¶
- embedding_length: int = None¶
- idf_model: Optional[Dict[str, float]] = None¶
- min_idf: Optional[float64] = None¶
- max_idf: Optional[float64] = None¶
- algorithm¶
- class cltk.embeddings.processes.ArabicEmbeddingsProcess(language='arb', variant='fasttext', embedding_length=None, idf_model=None, min_idf=None, max_idf=None, description='Default embeddings for Arabic.')[source]¶
Bases:
EmbeddingsProcessThe default Arabic embeddings algorithm.
- description: str = 'Default embeddings for Arabic.'¶
- language: str = 'arb'¶
- class cltk.embeddings.processes.AramaicEmbeddingsProcess(language='arb', variant='fasttext', embedding_length=None, idf_model=None, min_idf=None, max_idf=None, description='Default embeddings for Aramaic.')[source]¶
Bases:
EmbeddingsProcessThe default Aramaic embeddings algorithm.
- description: str = 'Default embeddings for Aramaic.'¶
- language: str = 'arb'¶
- class cltk.embeddings.processes.GothicEmbeddingsProcess(language='got', variant='fasttext', embedding_length=None, idf_model=None, min_idf=None, max_idf=None, description='Default embeddings for Gothic.')[source]¶
Bases:
EmbeddingsProcessThe default Gothic embeddings algorithm.
- description: str = 'Default embeddings for Gothic.'¶
- language: str = 'got'¶
- class cltk.embeddings.processes.GreekEmbeddingsProcess(language='grc', variant='nlpl', embedding_length=None, idf_model=None, min_idf=None, max_idf=None, description='Default embeddings for Ancient Greek.', authorship_info='``LatinEmbeddingsProcess`` using word2vec model by University of Oslo from http://vectors.nlpl.eu/ . Please cite: https://aclanthology.org/W17-0237/')[source]¶
Bases:
EmbeddingsProcessThe default Ancient Greek embeddings algorithm.
- language: str = 'grc'¶
- description: str = 'Default embeddings for Ancient Greek.'¶
- variant: str = 'nlpl'¶
- authorship_info: str = '``LatinEmbeddingsProcess`` using word2vec model by University of Oslo from http://vectors.nlpl.eu/ . Please cite: https://aclanthology.org/W17-0237/'¶
- class cltk.embeddings.processes.LatinEmbeddingsProcess(language='lat', variant='fasttext', embedding_length=None, idf_model=None, min_idf=None, max_idf=None, description='Default embeddings for Latin.', authorship_info='``LatinEmbeddingsProcess`` using word2vec model by University of Oslo from http://vectors.nlpl.eu/ . Please cite: https://aclanthology.org/W17-0237/')[source]¶
Bases:
EmbeddingsProcessThe default Latin embeddings algorithm.
- language: str = 'lat'¶
- description: str = 'Default embeddings for Latin.'¶
- variant: str = 'fasttext'¶
- authorship_info: str = '``LatinEmbeddingsProcess`` using word2vec model by University of Oslo from http://vectors.nlpl.eu/ . Please cite: https://aclanthology.org/W17-0237/'¶
- class cltk.embeddings.processes.MiddleEnglishEmbeddingsProcess(language=None, variant='fasttext', embedding_length=None, idf_model=None, min_idf=None, max_idf=None)[source]¶
Bases:
EmbeddingsProcessThe default Middle English embeddings algorithm.
- language: str = 'enm'¶
- variant: str = 'cltk'¶
- description = 'Default embeddings for Middle English'¶
- algorithm¶
- class cltk.embeddings.processes.OldEnglishEmbeddingsProcess(language='ang', variant='fasttext', embedding_length=None, idf_model=None, min_idf=None, max_idf=None, description='Default embeddings for Old English.')[source]¶
Bases:
EmbeddingsProcessThe default Old English embeddings algorithm.
- description: str = 'Default embeddings for Old English.'¶
- language: str = 'ang'¶
- class cltk.embeddings.processes.PaliEmbeddingsProcess(language='pli', variant='fasttext', embedding_length=None, idf_model=None, min_idf=None, max_idf=None, description='Default embeddings for Pali.')[source]¶
Bases:
EmbeddingsProcessThe default Pali embeddings algorithm.
- description: str = 'Default embeddings for Pali.'¶
- language: str = 'pli'¶
- class cltk.embeddings.processes.SanskritEmbeddingsProcess(language='san', variant='fasttext', embedding_length=None, idf_model=None, min_idf=None, max_idf=None, description='Default embeddings for Sanskrit.')[source]¶
Bases:
EmbeddingsProcessThe default Sanskrit embeddings algorithm.
- description: str = 'Default embeddings for Sanskrit.'¶
- language: str = 'san'¶
8.1.6.4. cltk.embeddings.sentence module¶
For computing embeddings for lists of words.
- cltk.embeddings.sentence.rescale_idf(val, min_idf, max_idf)[source]¶
Rescale idf values.
- Return type:
float
- cltk.embeddings.sentence.compute_pc(x, npc=1)[source]¶
Compute the principal components. DO NOT MAKE THE DATA ZERO MEAN!
- Parameters:
x (
ndarray) – X[i,:] is a data pointnpc (
int) – number of principal components to remove
- Return type:
ndarray- Returns:
component_[i,:] is the i-th pc
This has been adapted from the SIF paper code: https://openreview.net/pdf?id=SyK00v5xx.
- cltk.embeddings.sentence.remove_pc(x, npc=1)[source]¶
Remove the projection on the principal components. Calling this on a collection of sentence embeddings, prior to comparison, may improve accuracy.
- Parameters:
x (
ndarray) – X[i,:] is a data pointnpc (
int) – number of principal components to remove
- Return type:
ndarray- Returns:
XX[i, :] is the data point after removing its projection
This has been adapted from the SIF paper code: https://openreview.net/pdf?id=SyK00v5xx.
- cltk.embeddings.sentence.get_sent_embeddings(sent, idf_model, min_idf, max_idf, dimensions=300)[source]¶
Provides the weighted average of a sentence’s word vectors.
Expectations: Word can only appear once in a sentence, multiple occurrences are collapsed. Must have 2 or more embeddings, otherwise Principle Component cannot be found and removed.
- Parameters:
sent (
Sentence) –Sentenceidf_model (
Dict[str,Union[float,float64]]) – a dictionary of tokens and idf valuesmin_idf (
Union[float,float64]) – the min idf score to use for scalingmax_idf (
Union[float,float64]) – the max idf score to use for scalingdimensions (
int) – the number of dimensions of the embedding
- Return ndarray:
values of the sentence embedding, or returns an array of zeroes if no sentence embedding could be computed.
- Return type:
ndarray