8.1.3.1.1.1.1. cltk.corpora.grc.tlg package¶
8.1.3.1.1.1.1.1. Submodules¶
8.1.3.1.1.1.1.6. cltk.corpora.grc.tlg.file_utils module¶
Higher-level (i.e., user-friendly) functions for quickly reading
TLG data after it has been processed by TLGU()
.
-
cltk.corpora.grc.tlg.file_utils.
tlg_plaintext_cleanup
(text, rm_punctuation=False, rm_periods=False)[source]¶ Remove and substitute post-processing for Greek TLG text. TODO: Surely more junk to pull out. Please submit bugs!
Reads TLG index and builds a list of absolute filepaths.
8.1.3.1.1.1.1.8. cltk.corpora.grc.tlg.index_lists module¶
8.1.3.1.1.1.1.9. cltk.corpora.grc.tlg.parse_tlg_indices module¶
For loading TLG .json files and searching, then pulling author ids.
Open female authors index and return ordered set of author ids.
-
cltk.corpora.grc.tlg.parse_tlg_indices.
get_epithet_index
()[source]¶ Return dict of epithets (key) to a set of all author ids of that epithet (value).
-
cltk.corpora.grc.tlg.parse_tlg_indices.
get_epithets
()[source]¶ Return a list of all the epithet labels.
Pass exact name (case insensitive) of epithet name, return ordered set of author ids.
Pass author id and return the name of its associated epithet.
-
cltk.corpora.grc.tlg.parse_tlg_indices.
get_geo_index
()[source]¶ Get entire index of geographic name (key) and set of associated authors (value).
-
cltk.corpora.grc.tlg.parse_tlg_indices.
get_geographies
()[source]¶ Return a list of all the epithet labels.
Pass exact name (case insensitive) of geography name, return ordered set of author ids.
Pass author id and return the name of its associated epithet.
Returns entirety of id-author TLG index.
-
cltk.corpora.grc.tlg.parse_tlg_indices.
select_id_by_name
(query)[source]¶ Do a case-insensitive regex match on author name, returns TLG id.
-
cltk.corpora.grc.tlg.parse_tlg_indices.
open_json
(_file)[source]¶ Loads the json file as a dictionary and returns it.
-
cltk.corpora.grc.tlg.parse_tlg_indices.
get_works_by_id
(_id)[source]¶ Pass author id and return a dictionary of its works.
-
cltk.corpora.grc.tlg.parse_tlg_indices.
check_id
(_id)[source]¶ Pass author id and return a string with the author label
Returns entirety of date-author index.
-
cltk.corpora.grc.tlg.parse_tlg_indices.
get_dates
()[source]¶ Return a list of all the epithet labels.
Pass author id and return the name of its associated date.
-
cltk.corpora.grc.tlg.parse_tlg_indices.
_get_epoch
(_str)[source]¶ Take incoming string, return its epoch.
-
cltk.corpora.grc.tlg.parse_tlg_indices.
_check_number
(_str)[source]¶ check if the string contains only a number followed by ?
8.1.3.1.1.1.1.10. cltk.corpora.grc.tlg.tlg_index module¶
Indices for the TLG.
Note: # TLG_MASTER_INDEX
is the result of failed IDT parsing.
TODO: Add work names to TLG_WORKS_INDEX
TODO: Add all TLG index data.
8.1.3.1.1.1.1.11. cltk.corpora.grc.tlg.tlgu module¶
Wrapper for tlgu command line utility.
Original software at: http://tlgu.carmen.gr/
.
TLGU software written by Dimitri Marinakis and available at http://tlgu.carmen.gr/ under GPLv2 license.
TODO: the arguments to convert_corpus()
need some rationalization, and
divide_works()
should be incorporated into it.
-
class
cltk.corpora.grc.tlg.tlgu.
TLGU
(interactive=True)[source]¶ Bases:
object
Check, install, and call TLGU.
-
static
convert
(input_path=None, output_path=None, markup=None, rm_newlines=False, divide_works=False, lat=False, extra_args=None)[source]¶ Do conversion.
- Parameters:
input_path – TLG filepath to convert.
output_path – filepath of new converted text.
markup – Specificity of inline markup. Default None removes all numerical markup; ‘full’ gives most detailed, with reference numbers included before each text line.
rm_newlines – No spaces; removes line ends and hyphens before an ID code; hyphens and spaces before page and column ends are retained.
divide_works – Each work (book) is output as a separate file in the form output_file-xxx.txt; if an output file is not specified, this option has no effect.
lat – Primarily Latin text (PHI). Some TLG texts, notably doccan1.txt and doccan2.txt are mostly roman texts lacking explicit language change codes. Setting this option will force a change to Latin text after each citation block is encountered.
extra_args – Any other tlgu args to be passed, in list form and without dashes, e.g.: [‘p’, ‘b’, ‘B’].
-
static