8.1.11. cltk.ner package

8.1.11.1. Submodules

8.1.11.2. cltk.ner.ner module

Named entity recognition (NER).

Note

For Greek and Latin, v. 0.1 had a way of getting True/False whether a word was an entity of any sort (i.e., a proper noun). The data used for this is available at os.path.join(CLTK_DATA_DIR, "grc/model/grc_models_cltk/ner/proper_names.txt")

and os.path.join(CLTK_DATA_DIR, "lat/model/lat_models_cltk/ner/proper_names.txt"), respectively.

cltk.ner.ner.tag_ner(iso_code, input_tokens)[source]

Run NER for chosen language. Some languages return boolean True/False, others give string of entity type (e.g., LOC).

>>> from cltk.ner.ner import tag_ner
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> tokens = split_punct_ws(get_example_text(iso_code="lat"))

# >>> text = “ἐπὶ δ᾽ οὖν τοῖς πρώτοις τοῖσδε Περικλῆς ὁ Ξανθίππου ᾑρέθη λέγειν. καὶ ἐπειδὴ καιρὸς ἐλάμβανε, προελθὼν ἀπὸ τοῦ σήματος ἐπὶ βῆμα ὑψηλὸν πεποιημένον, ὅπως ἀκούοιτο ὡς ἐπὶ πλεῖστον τοῦ ὁμίλου, ἔλεγε τοιάδε.” # >>> tokens = split_punct_ws(text) # >>> are_words_entities = tag_ner(iso_code=”grc”, input_tokens=tokens) # >>> tokens[:9] # [‘ἐπὶ’, ‘δ᾽’, ‘οὖν’, ‘τοῖς’, ‘πρώτοις’, ‘τοῖσδε’, ‘Περικλῆς’, ‘ὁ’, ‘Ξανθίππου’] # >>> are_words_entities[:9] # TODO find working ex! # [False, False, False, False, False, False, False, False, False]

>>> tokens = split_punct_ws(get_example_text(iso_code="fro"))
>>> are_words_entities = tag_ner(iso_code="fro", input_tokens=tokens)
>>> tokens[30:50]
['Bretaigne', 'A', 'I', 'molt', 'riche', 'chevalier', 'Hardi', 'et', 'coragous', 'et', 'fier', 'De', 'la', 'Table', 'Reonde', 'estoit', 'Le', 'roi', 'Artu', 'que']
>>> are_words_entities[30:50]
['LOC', False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, 'CHI']
Return type:

List[Union[str, bool]]

8.1.11.3. cltk.ner.processes module

This module holds the ``Process``es for NER.

class cltk.ner.processes.NERProcess(language=None)[source]

Bases: Process

To be inherited for each language’s NER declarations.

>>> from cltk.core.data_types import Doc
>>> from cltk.ner.processes import NERProcess
>>> from cltk.core.data_types import Process
>>> issubclass(NERProcess, Process)
True
>>> emb_proc = NERProcess()
language: str = None
algorithm
run(input_doc)[source]
Return type:

Doc

class cltk.ner.processes.GreekNERProcess(language='grc', description='Default NER for Greek.')[source]

Bases: NERProcess

The default Greek NER algorithm.

Todo

Update doctest w/ production model

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> text = "ἐπὶ δ᾽ οὖν τοῖς πρώτοις τοῖσδε Περικλῆς ὁ Ξανθίππου ᾑρέθη λέγειν. καὶ ἐπειδὴ καιρὸς ἐλάμβανε, προελθὼν ἀπὸ τοῦ σήματος ἐπὶ βῆμα ὑψηλὸν πεποιημένον, ὅπως ἀκούοιτο ὡς ἐπὶ πλεῖστον τοῦ ὁμίλου, ἔλεγε τοιάδε."
>>> tokens = [Word(string=token) for token in split_punct_ws(text)]
>>> a_process = GreekNERProcess()
>>> output_doc = a_process.run(Doc(raw=text, words=tokens))
>>> output_doc.words[7].string
'ὁ'
>>> output_doc.words[7].named_entity
False
>>> output_doc.words[8].string
'Ξανθίππου'
>>> output_doc.words[8].named_entity
False
language: str = 'grc'
description: str = 'Default NER for Greek.'
class cltk.ner.processes.OldEnglishNERProcess(language='ang', description='Default NER for Old English.')[source]

Bases: NERProcess

The default OE NER algorithm.

Todo

Update doctest w/ production model

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> text = get_example_text(iso_code="ang")
>>> tokens = [Word(string=token) for token in split_punct_ws(text)]
>>> a_process = OldEnglishNERProcess()
>>> output_doc = a_process.run(Doc(raw=text, words=tokens))
>>> output_doc.words[2].string, output_doc.words[2].named_entity
('Gardena', 'LOCATION')
language: str = 'ang'
description: str = 'Default NER for Old English.'
class cltk.ner.processes.LatinNERProcess(language='lat', description='Default NER for Latin.')[source]

Bases: NERProcess

The default Latin NER algorithm.

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> tokens = [Word(string=token) for token in split_punct_ws(get_example_text("lat"))]
>>> a_process = LatinNERProcess()
>>> output_doc = a_process.run(Doc(raw=get_example_text("lat"), words=tokens))
>>> [word.named_entity for word in output_doc.words][:20]
['LOCATION', False, False, False, False, False, False, False, False, False, 'LOCATION', False, 'LOCATION', False, False, False, False, 'LOCATION', False, 'LOCATION']
language: str = 'lat'
description: str = 'Default NER for Latin.'
class cltk.ner.processes.OldFrenchNERProcess(language='fro', description='Default NER for Old French.')[source]

Bases: NERProcess

The default Old French NER algorithm.

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> tokens = [Word(string=token) for token in split_punct_ws(get_example_text("fro"))]
>>> a_process = OldFrenchNERProcess()
>>> output_doc = a_process.run(Doc(raw=get_example_text("fro"), words=tokens))
>>> output_doc.words[30].string
'Bretaigne'
>>> output_doc.words[30].named_entity
'LOC'
>>> output_doc.words[31].named_entity
False
language: str = 'fro'
description: str = 'Default NER for Old French.'

8.1.11.4. cltk.ner.spacy_ner module

Module for all NER relying on spaCy.

class cltk.ner.spacy_ner.CustomTokenizer(vocab)[source]

Bases: DummyTokenizer

cltk.ner.spacy_ner.download_prompt(iso_code, message, model_url, interactive=True, silent=False)[source]

Ask user whether to download files.

TODO: Make ft and stanza use this fn. Consider moving to other module.

cltk.ner.spacy_ner.spacy_tag_ner(iso_code, text_tokens, model_path)[source]

Take a list of tokens and return label or None.

>>> text_tokens = ["Gallia", "est", "omnis", "divisa", "in", "partes", "tres", ",", "quarum", "unam", "incolunt", "Belgae", ",", "aliam", "Aquitani", ",", "tertiam", "qui", "ipsorum", "lingua", "Celtae", ",", "nostra", "Galli", "appellantur", "."]
>>> from cltk.utils import CLTK_DATA_DIR
>>> spacy_tag_ner('lat', text_tokens=text_tokens, model_path=os.path.join(CLTK_DATA_DIR, "lat", "model", "lat_models_cltk", "ner", "spacy_model"))
['LOCATION', False, False, False, False, False, False, False, False, False, False, 'LOCATION', False, False, 'LOCATION', False, False, False, False, False, 'LOCATION', False, False, 'LOCATION', False, False]
Return type:

List[Union[str, bool]]