8.1.3.1.1.1.1. cltk.corpora.grc.tlg package

8.1.3.1.1.1.1.1. Submodules

8.1.3.1.1.1.1.2. cltk.corpora.grc.tlg.author_date module

8.1.3.1.1.1.1.3. cltk.corpora.grc.tlg.author_epithet module

8.1.3.1.1.1.1.4. cltk.corpora.grc.tlg.author_female module

8.1.3.1.1.1.1.5. cltk.corpora.grc.tlg.author_geo module

8.1.3.1.1.1.1.6. cltk.corpora.grc.tlg.file_utils module

Higher-level (i.e., user-friendly) functions for quickly reading TLG data after it has been processed by TLGU().

cltk.corpora.grc.tlg.file_utils.tlg_plaintext_cleanup(text, rm_punctuation=False, rm_periods=False)[source]

Remove and substitute post-processing for Greek TLG text. TODO: Surely more junk to pull out. Please submit bugs!

cltk.corpora.grc.tlg.file_utils.assemble_tlg_author_filepaths()[source]

Reads TLG index and builds a list of absolute filepaths.

cltk.corpora.grc.tlg.file_utils.assemble_tlg_works_filepaths()[source]

Reads TLG index and builds a list of absolute filepaths.

8.1.3.1.1.1.1.7. cltk.corpora.grc.tlg.id_author module

8.1.3.1.1.1.1.8. cltk.corpora.grc.tlg.index_lists module

8.1.3.1.1.1.1.9. cltk.corpora.grc.tlg.parse_tlg_indices module

For loading TLG .json files and searching, then pulling author ids.

cltk.corpora.grc.tlg.parse_tlg_indices.get_female_authors()[source]

Open female authors index and return ordered set of author ids.

cltk.corpora.grc.tlg.parse_tlg_indices.get_epithet_index()[source]

Return dict of epithets (key) to a set of all author ids of that epithet (value).

cltk.corpora.grc.tlg.parse_tlg_indices.get_epithets()[source]

Return a list of all the epithet labels.

cltk.corpora.grc.tlg.parse_tlg_indices.select_authors_by_epithet(query)[source]

Pass exact name (case insensitive) of epithet name, return ordered set of author ids.

cltk.corpora.grc.tlg.parse_tlg_indices.get_epithet_of_author(_id)[source]

Pass author id and return the name of its associated epithet.

cltk.corpora.grc.tlg.parse_tlg_indices.get_geo_index()[source]

Get entire index of geographic name (key) and set of associated authors (value).

cltk.corpora.grc.tlg.parse_tlg_indices.get_geographies()[source]

Return a list of all the epithet labels.

cltk.corpora.grc.tlg.parse_tlg_indices.select_authors_by_geo(query)[source]

Pass exact name (case insensitive) of geography name, return ordered set of author ids.

cltk.corpora.grc.tlg.parse_tlg_indices.get_geo_of_author(_id)[source]

Pass author id and return the name of its associated epithet.

cltk.corpora.grc.tlg.parse_tlg_indices.get_lists()[source]

A list of the TLG’s lists.

cltk.corpora.grc.tlg.parse_tlg_indices.get_id_author()[source]

Returns entirety of id-author TLG index.

cltk.corpora.grc.tlg.parse_tlg_indices.select_id_by_name(query)[source]

Do a case-insensitive regex match on author name, returns TLG id.

cltk.corpora.grc.tlg.parse_tlg_indices.open_json(_file)[source]

Loads the json file as a dictionary and returns it.

cltk.corpora.grc.tlg.parse_tlg_indices.get_works_by_id(_id)[source]

Pass author id and return a dictionary of its works.

cltk.corpora.grc.tlg.parse_tlg_indices.check_id(_id)[source]

Pass author id and return a string with the author label

cltk.corpora.grc.tlg.parse_tlg_indices.get_date_author()[source]

Returns entirety of date-author index.

cltk.corpora.grc.tlg.parse_tlg_indices.get_dates()[source]

Return a list of all the epithet labels.

cltk.corpora.grc.tlg.parse_tlg_indices.get_date_of_author(_id)[source]

Pass author id and return the name of its associated date.

cltk.corpora.grc.tlg.parse_tlg_indices._get_epoch(_str)[source]

Take incoming string, return its epoch.

cltk.corpora.grc.tlg.parse_tlg_indices._check_number(_str)[source]

check if the string contains only a number followed by ?

cltk.corpora.grc.tlg.parse_tlg_indices._handle_splits(_str)[source]

Check if incoming date has a ‘-” or ‘/’, if so do stuff.

cltk.corpora.grc.tlg.parse_tlg_indices.normalize_dates()[source]

Experiment to make sense of TLG dates. TODO: start here, parse everything with pass

8.1.3.1.1.1.1.10. cltk.corpora.grc.tlg.tlg_index module

Indices for the TLG.

Note: # TLG_MASTER_INDEX is the result of failed IDT parsing.

TODO: Add work names to TLG_WORKS_INDEX TODO: Add all TLG index data.

8.1.3.1.1.1.1.11. cltk.corpora.grc.tlg.tlgu module

Wrapper for tlgu command line utility.

Original software at: http://tlgu.carmen.gr/.

TLGU software written by Dimitri Marinakis and available at http://tlgu.carmen.gr/ under GPLv2 license.

TODO: the arguments to convert_corpus() need some rationalization, and divide_works() should be incorporated into it.

class cltk.corpora.grc.tlg.tlgu.TLGU(interactive=True)[source]

Bases: object

Check, install, and call TLGU.

_check_and_download_tlgu_source()[source]

Check if tlgu downloaded, if not download it.

_check_install()[source]

Check if tlgu installed, if not install it.

static convert(input_path=None, output_path=None, markup=None, rm_newlines=False, divide_works=False, lat=False, extra_args=None)[source]

Do conversion.

Parameters:
  • input_path – TLG filepath to convert.

  • output_path – filepath of new converted text.

  • markup – Specificity of inline markup. Default None removes all numerical markup; ‘full’ gives most detailed, with reference numbers included before each text line.

  • rm_newlines – No spaces; removes line ends and hyphens before an ID code; hyphens and spaces before page and column ends are retained.

  • divide_works – Each work (book) is output as a separate file in the form output_file-xxx.txt; if an output file is not specified, this option has no effect.

  • lat – Primarily Latin text (PHI). Some TLG texts, notably doccan1.txt and doccan2.txt are mostly roman texts lacking explicit language change codes. Setting this option will force a change to Latin text after each citation block is encountered.

  • extra_args – Any other tlgu args to be passed, in list form and without dashes, e.g.: [‘p’, ‘b’, ‘B’].

convert_corpus(corpus, markup=None, lat=None)[source]

Look for imported TLG or PHI files and convert them all to ~/cltk_data/grc/text/tlg/<plaintext>. TODO: Add markup options to input. TODO: Add rm_newlines, divide_works, and extra_args

divide_works(corpus)[source]

Use the work-breaking option. TODO: Maybe incorporate this into convert_corpus() TODO: Write test for this

8.1.3.1.1.1.1.12. cltk.corpora.grc.tlg.work_numbers module