8.1.11. cltk.ner package¶
8.1.11.1. Submodules¶
8.1.11.2. cltk.ner.ner module¶
Named entity recognition (NER).
Note
For Greek and Latin, v. 0.1
had a way of getting True
/False
whether a word was an entity of any sort (i.e., a proper noun). The
data used for this is available at os.path.join(CLTK_DATA_DIR, "grc/model/grc_models_cltk/ner/proper_names.txt")
and
os.path.join(CLTK_DATA_DIR, "lat/model/lat_models_cltk/ner/proper_names.txt")
, respectively.
- cltk.ner.ner.tag_ner(iso_code, input_tokens)[source]¶
Run NER for chosen language. Some languages return boolean True/False, others give string of entity type (e.g.,
LOC
).>>> from cltk.ner.ner import tag_ner >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> tokens = split_punct_ws(get_example_text(iso_code="lat"))
# >>> text = “ἐπὶ δ᾽ οὖν τοῖς πρώτοις τοῖσδε Περικλῆς ὁ Ξανθίππου ᾑρέθη λέγειν. καὶ ἐπειδὴ καιρὸς ἐλάμβανε, προελθὼν ἀπὸ τοῦ σήματος ἐπὶ βῆμα ὑψηλὸν πεποιημένον, ὅπως ἀκούοιτο ὡς ἐπὶ πλεῖστον τοῦ ὁμίλου, ἔλεγε τοιάδε.” # >>> tokens = split_punct_ws(text) # >>> are_words_entities = tag_ner(iso_code=”grc”, input_tokens=tokens) # >>> tokens[:9] # [‘ἐπὶ’, ‘δ᾽’, ‘οὖν’, ‘τοῖς’, ‘πρώτοις’, ‘τοῖσδε’, ‘Περικλῆς’, ‘ὁ’, ‘Ξανθίππου’] # >>> are_words_entities[:9] # TODO find working ex! # [False, False, False, False, False, False, False, False, False]
>>> tokens = split_punct_ws(get_example_text(iso_code="fro")) >>> are_words_entities = tag_ner(iso_code="fro", input_tokens=tokens) >>> tokens[30:50] ['Bretaigne', 'A', 'I', 'molt', 'riche', 'chevalier', 'Hardi', 'et', 'coragous', 'et', 'fier', 'De', 'la', 'Table', 'Reonde', 'estoit', 'Le', 'roi', 'Artu', 'que'] >>> are_words_entities[30:50] ['LOC', False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, 'CHI']
- Return type:
List
[Union
[str
,bool
]]
8.1.11.3. cltk.ner.processes module¶
This module holds the ``Process``es for NER.
- class cltk.ner.processes.NERProcess(language=None)[source]¶
Bases:
Process
To be inherited for each language’s NER declarations.
>>> from cltk.core.data_types import Doc >>> from cltk.ner.processes import NERProcess >>> from cltk.core.data_types import Process >>> issubclass(NERProcess, Process) True >>> emb_proc = NERProcess()
- language: str = None¶
- algorithm¶
- class cltk.ner.processes.GreekNERProcess(language='grc', description='Default NER for Greek.')[source]¶
Bases:
NERProcess
The default Greek NER algorithm.
Todo
Update doctest w/ production model
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> text = "ἐπὶ δ᾽ οὖν τοῖς πρώτοις τοῖσδε Περικλῆς ὁ Ξανθίππου ᾑρέθη λέγειν. καὶ ἐπειδὴ καιρὸς ἐλάμβανε, προελθὼν ἀπὸ τοῦ σήματος ἐπὶ βῆμα ὑψηλὸν πεποιημένον, ὅπως ἀκούοιτο ὡς ἐπὶ πλεῖστον τοῦ ὁμίλου, ἔλεγε τοιάδε." >>> tokens = [Word(string=token) for token in split_punct_ws(text)] >>> a_process = GreekNERProcess() >>> output_doc = a_process.run(Doc(raw=text, words=tokens)) >>> output_doc.words[7].string 'ὁ' >>> output_doc.words[7].named_entity False >>> output_doc.words[8].string 'Ξανθίππου' >>> output_doc.words[8].named_entity False
- language: str = 'grc'¶
- description: str = 'Default NER for Greek.'¶
- class cltk.ner.processes.OldEnglishNERProcess(language='ang', description='Default NER for Old English.')[source]¶
Bases:
NERProcess
The default OE NER algorithm.
Todo
Update doctest w/ production model
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> text = get_example_text(iso_code="ang") >>> tokens = [Word(string=token) for token in split_punct_ws(text)] >>> a_process = OldEnglishNERProcess() >>> output_doc = a_process.run(Doc(raw=text, words=tokens)) >>> output_doc.words[2].string, output_doc.words[2].named_entity ('Gardena', 'LOCATION')
- language: str = 'ang'¶
- description: str = 'Default NER for Old English.'¶
- class cltk.ner.processes.LatinNERProcess(language='lat', description='Default NER for Latin.')[source]¶
Bases:
NERProcess
The default Latin NER algorithm.
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> tokens = [Word(string=token) for token in split_punct_ws(get_example_text("lat"))] >>> a_process = LatinNERProcess() >>> output_doc = a_process.run(Doc(raw=get_example_text("lat"), words=tokens)) >>> [word.named_entity for word in output_doc.words][:20] ['LOCATION', False, False, False, False, False, False, False, False, False, 'LOCATION', False, 'LOCATION', False, False, False, False, 'LOCATION', False, 'LOCATION']
- language: str = 'lat'¶
- description: str = 'Default NER for Latin.'¶
- class cltk.ner.processes.OldFrenchNERProcess(language='fro', description='Default NER for Old French.')[source]¶
Bases:
NERProcess
The default Old French NER algorithm.
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> tokens = [Word(string=token) for token in split_punct_ws(get_example_text("fro"))] >>> a_process = OldFrenchNERProcess() >>> output_doc = a_process.run(Doc(raw=get_example_text("fro"), words=tokens)) >>> output_doc.words[30].string 'Bretaigne' >>> output_doc.words[30].named_entity 'LOC' >>> output_doc.words[31].named_entity False
- language: str = 'fro'¶
- description: str = 'Default NER for Old French.'¶
8.1.11.4. cltk.ner.spacy_ner module¶
Module for all NER relying on spaCy.
- cltk.ner.spacy_ner.download_prompt(iso_code, message, model_url, interactive=True, silent=False)[source]¶
Ask user whether to download files.
TODO: Make ft and stanza use this fn. Consider moving to other module.
- cltk.ner.spacy_ner.spacy_tag_ner(iso_code, text_tokens, model_path)[source]¶
Take a list of tokens and return label or None.
>>> text_tokens = ["Gallia", "est", "omnis", "divisa", "in", "partes", "tres", ",", "quarum", "unam", "incolunt", "Belgae", ",", "aliam", "Aquitani", ",", "tertiam", "qui", "ipsorum", "lingua", "Celtae", ",", "nostra", "Galli", "appellantur", "."] >>> from cltk.utils import CLTK_DATA_DIR >>> spacy_tag_ner('lat', text_tokens=text_tokens, model_path=os.path.join(CLTK_DATA_DIR, "lat", "model", "lat_models_cltk", "ner", "spacy_model")) ['LOCATION', False, False, False, False, False, False, False, False, False, False, 'LOCATION', False, False, 'LOCATION', False, False, False, False, False, 'LOCATION', False, False, 'LOCATION', False, False]
- Return type:
List
[Union
[str
,bool
]]