8.1.11. cltk.ner package¶
8.1.11.1. Submodules¶
8.1.11.2. cltk.ner.ner module¶
Named entity recognition (NER).
Note
For Greek and Latin, v. 0.1
had a way of getting True
/False
whether a word was an entity of any sort (i.e., a proper noun). The
data used for this is available at os.path.join(CLTK_DATA_DIR, "grc/model/grc_models_cltk/ner/proper_names.txt")
and
os.path.join(CLTK_DATA_DIR, "lat/model/lat_models_cltk/ner/proper_names.txt")
, respectively.
-
cltk.ner.ner.
tag_ner
(iso_code, input_tokens)[source]¶ Run NER for chosen language. Some languages return boolean True/False, others give string of entity type (e.g.,
LOC
).>>> from cltk.ner.ner import tag_ner >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> tokens = split_punct_ws(get_example_text(iso_code="lat"))
>>> text = "ἐπὶ δ᾽ οὖν τοῖς πρώτοις τοῖσδε Περικλῆς ὁ Ξανθίππου ᾑρέθη λέγειν. καὶ ἐπειδὴ καιρὸς ἐλάμβανε, προελθὼν ἀπὸ τοῦ σήματος ἐπὶ βῆμα ὑψηλὸν πεποιημένον, ὅπως ἀκούοιτο ὡς ἐπὶ πλεῖστον τοῦ ὁμίλου, ἔλεγε τοιάδε." >>> tokens = split_punct_ws(text) >>> are_words_entities = tag_ner(iso_code="grc", input_tokens=tokens) >>> tokens[:9] ['ἐπὶ', 'δ᾽', 'οὖν', 'τοῖς', 'πρώτοις', 'τοῖσδε', 'Περικλῆς', 'ὁ', 'Ξανθίππου'] >>> are_words_entities[:9] # TODO check this result [False, False, False, False, False, False, False, False, False]
>>> tokens = split_punct_ws(get_example_text(iso_code="fro")) >>> are_words_entities = tag_ner(iso_code="fro", input_tokens=tokens) >>> tokens[30:50] ['Bretaigne', 'A', 'I', 'molt', 'riche', 'chevalier', 'Hardi', 'et', 'coragous', 'et', 'fier', 'De', 'la', 'Table', 'Reonde', 'estoit', 'Le', 'roi', 'Artu', 'que'] >>> are_words_entities[30:50] ['LOC', False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, 'CHI']
- Return type:
List
[Union
[str
,bool
]]
8.1.11.3. cltk.ner.processes module¶
This module holds the ``Process``es for NER.
-
class
cltk.ner.processes.
NERProcess
(language: str = None)[source]¶ Bases:
cltk.core.data_types.Process
To be inherited for each language’s NER declarations.
>>> from cltk.core.data_types import Doc >>> from cltk.ner.processes import NERProcess >>> from cltk.core.data_types import Process >>> issubclass(NERProcess, Process) True >>> emb_proc = NERProcess()
-
language
: str = None¶
-
algorithm
¶
-
-
class
cltk.ner.processes.
GreekNERProcess
(language: str = 'grc', description: str = 'Default NER for Greek.')[source]¶ Bases:
cltk.ner.processes.NERProcess
The default Greek NER algorithm.
Todo
Update doctest w/ production model
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> text = "ἐπὶ δ᾽ οὖν τοῖς πρώτοις τοῖσδε Περικλῆς ὁ Ξανθίππου ᾑρέθη λέγειν. καὶ ἐπειδὴ καιρὸς ἐλάμβανε, προελθὼν ἀπὸ τοῦ σήματος ἐπὶ βῆμα ὑψηλὸν πεποιημένον, ὅπως ἀκούοιτο ὡς ἐπὶ πλεῖστον τοῦ ὁμίλου, ἔλεγε τοιάδε." >>> tokens = [Word(string=token) for token in split_punct_ws(text)] >>> a_process = GreekNERProcess() >>> output_doc = a_process.run(Doc(raw=text, words=tokens)) >>> output_doc.words[7].string 'ὁ' >>> output_doc.words[7].named_entity False >>> output_doc.words[8].string 'Ξανθίππου' >>> output_doc.words[8].named_entity False
-
language
: str = 'grc'¶
-
description
: str = 'Default NER for Greek.'¶
-
class
cltk.ner.processes.
OldEnglishNERProcess
(language: str = 'ang', description: str = 'Default NER for Old English.')[source]¶ Bases:
cltk.ner.processes.NERProcess
The default OE NER algorithm.
Todo
Update doctest w/ production model
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> text = get_example_text(iso_code="ang") >>> tokens = [Word(string=token) for token in split_punct_ws(text)] >>> a_process = OldEnglishNERProcess() >>> output_doc = a_process.run(Doc(raw=text, words=tokens)) >>> output_doc.words[2].string, output_doc.words[2].named_entity ('Gardena', 'LOCATION')
-
language
: str = 'ang'¶
-
description
: str = 'Default NER for Old English.'¶
-
class
cltk.ner.processes.
LatinNERProcess
(language: str = 'lat', description: str = 'Default NER for Latin.')[source]¶ Bases:
cltk.ner.processes.NERProcess
The default Latin NER algorithm.
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> tokens = [Word(string=token) for token in split_punct_ws(get_example_text("lat"))] >>> a_process = LatinNERProcess() >>> output_doc = a_process.run(Doc(raw=get_example_text("lat"), words=tokens)) >>> [word.named_entity for word in output_doc.words][:20] ['LOCATION', False, False, False, False, False, False, False, False, False, 'LOCATION', False, 'LOCATION', False, False, False, False, 'LOCATION', False, 'LOCATION']
-
language
: str = 'lat'¶
-
description
: str = 'Default NER for Latin.'¶
-
-
class
cltk.ner.processes.
OldFrenchNERProcess
(language: str = 'fro', description: str = 'Default NER for Old French.')[source]¶ Bases:
cltk.ner.processes.NERProcess
The default Old French NER algorithm.
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> tokens = [Word(string=token) for token in split_punct_ws(get_example_text("fro"))] >>> a_process = OldFrenchNERProcess() >>> output_doc = a_process.run(Doc(raw=get_example_text("fro"), words=tokens)) >>> output_doc.words[30].string 'Bretaigne' >>> output_doc.words[30].named_entity 'LOC' >>> output_doc.words[31].named_entity False
-
language
: str = 'fro'¶
-
description
: str = 'Default NER for Old French.'¶
-
8.1.11.4. cltk.ner.spacy_ner module¶
Module for all NER relying on spaCy.
-
cltk.ner.spacy_ner.
download_prompt
(iso_code, message, model_url, interactive=True, silent=False)[source]¶ Ask user whether to download files.
TODO: Make ft and stanza use this fn. Consider moving to other module.
-
cltk.ner.spacy_ner.
spacy_tag_ner
(iso_code, text_tokens, model_path)[source]¶ Take a list of tokens and return label or None.
>>> text_tokens = ["Gallia", "est", "omnis", "divisa", "in", "partes", "tres", ",", "quarum", "unam", "incolunt", "Belgae", ",", "aliam", "Aquitani", ",", "tertiam", "qui", "ipsorum", "lingua", "Celtae", ",", "nostra", "Galli", "appellantur", "."] >>> from cltk.utils import CLTK_DATA_DIR >>> spacy_tag_ner('lat', text_tokens=text_tokens, model_path=os.path.join(CLTK_DATA_DIR, "lat/model/lat_models_cltk/ner/spacy_model/")) ['LOCATION', False, False, False, False, False, False, False, False, False, False, 'LOCATION', False, False, 'LOCATION', False, False, False, False, False, 'LOCATION', False, False, 'LOCATION', False, False]
- Return type:
List
[Union
[str
,bool
]]
-
-