8.1.3.1.1.1.1. cltk.corpora.grc.tlg package¶
8.1.3.1.1.1.1.1. Submodules¶
8.1.3.1.1.1.1.6. cltk.corpora.grc.tlg.file_utils module¶
Higher-level (i.e., user-friendly) functions for quickly reading
TLG data after it has been processed by TLGU()
.
- cltk.corpora.grc.tlg.file_utils.tlg_plaintext_cleanup(text, rm_punctuation=False, rm_periods=False)[source]¶
Remove and substitute post-processing for Greek TLG text. TODO: Surely more junk to pull out. Please submit bugs!
8.1.3.1.1.1.1.8. cltk.corpora.grc.tlg.index_lists module¶
8.1.3.1.1.1.1.9. cltk.corpora.grc.tlg.parse_tlg_indices module¶
For loading TLG .json files and searching, then pulling author ids.
- cltk.corpora.grc.tlg.parse_tlg_indices.get_female_authors()[source]¶
Open female authors index and return ordered set of author ids.
- cltk.corpora.grc.tlg.parse_tlg_indices.get_epithet_index()[source]¶
Return dict of epithets (key) to a set of all author ids of that epithet (value).
- cltk.corpora.grc.tlg.parse_tlg_indices.get_epithets()[source]¶
Return a list of all the epithet labels.
- cltk.corpora.grc.tlg.parse_tlg_indices.select_authors_by_epithet(query)[source]¶
Pass exact name (case insensitive) of epithet name, return ordered set of author ids.
- cltk.corpora.grc.tlg.parse_tlg_indices.get_epithet_of_author(_id)[source]¶
Pass author id and return the name of its associated epithet.
- cltk.corpora.grc.tlg.parse_tlg_indices.get_geo_index()[source]¶
Get entire index of geographic name (key) and set of associated authors (value).
- cltk.corpora.grc.tlg.parse_tlg_indices.get_geographies()[source]¶
Return a list of all the epithet labels.
- cltk.corpora.grc.tlg.parse_tlg_indices.select_authors_by_geo(query)[source]¶
Pass exact name (case insensitive) of geography name, return ordered set of author ids.
- cltk.corpora.grc.tlg.parse_tlg_indices.get_geo_of_author(_id)[source]¶
Pass author id and return the name of its associated epithet.
- cltk.corpora.grc.tlg.parse_tlg_indices.get_id_author()[source]¶
Returns entirety of id-author TLG index.
- cltk.corpora.grc.tlg.parse_tlg_indices.select_id_by_name(query)[source]¶
Do a case-insensitive regex match on author name, returns TLG id.
- cltk.corpora.grc.tlg.parse_tlg_indices.open_json(_file)[source]¶
Loads the json file as a dictionary and returns it.
- cltk.corpora.grc.tlg.parse_tlg_indices.get_works_by_id(_id)[source]¶
Pass author id and return a dictionary of its works.
- cltk.corpora.grc.tlg.parse_tlg_indices.check_id(_id)[source]¶
Pass author id and return a string with the author label
- cltk.corpora.grc.tlg.parse_tlg_indices.get_date_author()[source]¶
Returns entirety of date-author index.
- cltk.corpora.grc.tlg.parse_tlg_indices.get_dates()[source]¶
Return a list of all the epithet labels.
- cltk.corpora.grc.tlg.parse_tlg_indices.get_date_of_author(_id)[source]¶
Pass author id and return the name of its associated date.
- cltk.corpora.grc.tlg.parse_tlg_indices._get_epoch(_str)[source]¶
Take incoming string, return its epoch.
- cltk.corpora.grc.tlg.parse_tlg_indices._check_number(_str)[source]¶
check if the string contains only a number followed by ?
8.1.3.1.1.1.1.10. cltk.corpora.grc.tlg.tlg_index module¶
Indices for the TLG.
Note: # TLG_MASTER_INDEX
is the result of failed IDT parsing.
TODO: Add work names to TLG_WORKS_INDEX
TODO: Add all TLG index data.
8.1.3.1.1.1.1.11. cltk.corpora.grc.tlg.tlgu module¶
Wrapper for tlgu command line utility.
Original software at: http://tlgu.carmen.gr/
.
TLGU software written by Dimitri Marinakis and available at http://tlgu.carmen.gr/ under GPLv2 license.
TODO: the arguments to convert_corpus()
need some rationalization, and
divide_works()
should be incorporated into it.
- class cltk.corpora.grc.tlg.tlgu.TLGU(interactive=True)[source]¶
Bases:
object
Check, install, and call TLGU.
- static convert(input_path=None, output_path=None, markup=None, rm_newlines=False, divide_works=False, lat=False, extra_args=None)[source]¶
Do conversion.
- Parameters:
input_path – TLG filepath to convert.
output_path – filepath of new converted text.
markup – Specificity of inline markup. Default None removes all numerical markup; ‘full’ gives most detailed, with reference numbers included before each text line.
rm_newlines – No spaces; removes line ends and hyphens before an ID code; hyphens and spaces before page and column ends are retained.
divide_works – Each work (book) is output as a separate file in the form output_file-xxx.txt; if an output file is not specified, this option has no effect.
lat – Primarily Latin text (PHI). Some TLG texts, notably doccan1.txt and doccan2.txt are mostly roman texts lacking explicit language change codes. Setting this option will force a change to Latin text after each citation block is encountered.
extra_args – Any other tlgu args to be passed, in list form and without dashes, e.g.: [‘p’, ‘b’, ‘B’].