8.1.18. cltk.text package

8.1.18.1. Submodules

8.1.18.2. cltk.text.akk module

cltk.text.akk._convert_consonant(sign)[source]

Uses dictionary to replace ATF convention for unicode characters.

>>> signs = ["as,", "S,ATU", "tet,", "T,et", "sza", "ASZ"]
>>> [_convert_consonant(s) for s in signs]
['aṣ', 'ṢATU', 'teṭ', 'Ṭet', 'ša', 'AŠ']
Return type:

str

cltk.text.akk._convert_number_to_subscript(num)[source]

Converts number into subscript.

>>> signs = ["a", "a1", "be2", "bad3", "buru14"]
>>> [_get_number_from_sign(s)[1] for s in signs]
[0, 1, 2, 3, 14]
Return type:

str

cltk.text.akk._get_number_from_sign(sign)[source]

Captures numbers after sign for __convert_num__.

input = [“a”, “a1”, “be2”, “bad3”, “buru14”] output = [0, 1, 2, 3, 14]

Parameters:

sign (str) – string

Return type:

Tuple[str, int]

Returns:

string, integer

class cltk.text.akk.ATFConverter(two_three=True)[source]

Bases: object

Class to convert tokens to unicode.

Transliterates ATF data from CDLI into readable unicode.

sz = š s, = ṣ t, = ṭ ‘ = ʾ Sign values for 2-3 take accent aigu and accent grave standards, otherwise signs are printed as subscript.

For in depth reading on ATF-formatting for CDLI and ORACC:

Oracc ATF Primer = http://oracc.museum.upenn.edu/doc/help/editinginatf/ primer/index.html ATF Structure = http://oracc.museum.upenn.edu/doc/help/editinginatf/ primer/structuretutorial/index.html ATF Inline = http://oracc.museum.upenn.edu/doc/help/editinginatf/ primer/inlinetutorial/index.html

_convert_num(sign)[source]

Converts number registered in get_number_from_sign.

Return type:

str

process(tokens)[source]

Expects a list of tokens, will return the list converted from ATF format to print-format.

>>> c = ATFConverter()
>>> c.process(["a", "a2", "a3", "geme2", "bad3", "buru14"])
['a', 'a₂', 'a₃', 'geme₂', 'bad₃', 'buru₁₄']
Return type:

List[str]

8.1.18.3. cltk.text.lat module

Functions for replacing j/J and v/V to i/I and u/U

cltk.text.lat.replace_jv(text)[source]

Do j/v replacement.

>>> replace_jv("vem jam VEL JAM")
'uem iam UEL IAM'
Return type:

str

8.1.18.4. cltk.text.non module

Code for punctuation removal: Old Norse

class cltk.text.non.OldNorsePunctuationRemover[source]

Bases: object

filter(word)[source]

8.1.18.5. cltk.text.processes module

class cltk.text.processes.PunctuationRemovalProcess(language=None)[source]

Bases: Process

run(input_doc)[source]
Return type:

Doc

class cltk.text.processes.DefaultPunctuationRemovalProcess(language=None)[source]

Bases: PunctuationRemovalProcess

description = 'Default punctuation removal algorithm'
algorithm
class cltk.text.processes.DefaultPunctuationRemover[source]

Bases: object

filter(word)[source]
class cltk.text.processes.OldNorsePunctuationRemovalProcess(language=None)[source]

Bases: PunctuationRemovalProcess

description = 'Default Old Norse punctuation removal algorithm'
algorithm