8.1.1. cltk.alphabet package¶
Modules for accessing the alphabets and character sets of in-scope CLTK languages.
8.1.1.1. Subpackages¶
8.1.1.2. Submodules¶
8.1.1.3. cltk.alphabet.ang module¶
The Old English alphabet.
>>> from cltk.alphabet import ang
>>> ang.DIGITS[:5]
['ān', 'tƿeġen', 'þrēo', 'fēoƿer', 'fīf']
>>> ang.DIPHTHONGS[:5]
['ea', 'eo', 'ie']
8.1.1.4. cltk.alphabet.arb module¶
The Arabic alphabet. Sources:
>>> from cltk.alphabet import arb
>>> arb.LETTERS[:5]
('ا', 'ب', 'ت', 'ة', 'ث')
>>> arb.PUNCTUATION_MARKS
['،', '؛', '؟']
>>> arb.ALEF
'ا'
>>> arb.WEAK
('ا', 'و', 'ي', 'ى')
8.1.1.5. cltk.alphabet.arc module¶
The Imperial Aramaic alphabet, plus simple script to transform a Hebrew transcription of an Imperial Aramaic text to its own Unicode block.
TODO: Add Hebrew-to-Aramaic converter
8.1.1.6. cltk.alphabet.ave module¶
The Avestan alphabet. Sources:
8.1.1.7. cltk.alphabet.ben module¶
The Bengali alphabet.
>>> from cltk.alphabet import ben
>>> ben.VOWELS[:5]
['অ', 'আ', 'ই', 'ঈ', 'উ']
>>> ben.DEPENDENT_VOWELS[:5]
['◌া', 'ি', '◌ী', '◌ু', '◌ূ']
>>> ben.CONSONANTS[:5]
['ক', 'খ', 'গ', 'ঘ ', 'ঙ']
8.1.1.8. cltk.alphabet.egy module¶
Convert MdC transliterated text to Unicode.
- cltk.alphabet.egy.mdc_unicode(string, q_kopf=True)[source]¶
parameters: string: str q_kopf: boolean return: unicode_text: str The translitterated text passes to the function under the variable ‘string’. The search and replace operation is done for the related caracters. If the q_kopf parameter is False, we replace ‘q’ with ‘ḳ’
8.1.1.9. cltk.alphabet.enm module¶
The Middle English alphabet. Sources:
From Old English to Standard English, Dennis Freeborn
The produced consonant sound in Middle English are categorized as following:
Stops: ⟨/b/, /p/, /d/, /t/, /g/, /k/⟩
Affricatives: ⟨/ǰ/, /č/, /v/, /f/, /ð/, /θ/, /z/, /s/, /ž/, /š/, /c̹/, /x/, /h/⟩
Nasals: ⟨/m/, /n/, /ɳ/⟩
Later Resonants: ⟨/l/⟩
Medial Resonants: ⟨/r/, /y/, /w/⟩
Thorn (þ) was gradually replaced by the diphthong “th”, while Eth (ð), which had already fallen out of use by the 14th century, was later replaced by “d”
Wynn (ƿ) is the predecessor of “w”. Modern transliteration scripts, usually replace it with “w” as to avoid confusion with the strikingly similar p
The vowel sounds in Middle English are divided into:
Long Vowels: ⟨/a:/, /e/, /e̜/, /i/ , /ɔ:/, /o/ , /u/⟩
Short Vowels: ⟨/a/, /ɛ/, /I/, /ɔ/, /U/, /ə/⟩
As established rules for ME orthography were effectively nonexistent, compiling a definite list of diphthongs is non-trivial. The following aims to compile a list of the most commonly-used diphthongs.
>>> from cltk.alphabet import enm
>>> enm.ALPHABET[:5]
['a', 'b', 'c', 'd', 'e']
>>> enm.CONSONANTS[:5]
['b', 'c', 'd', 'f', 'g']
- cltk.alphabet.enm.normalize_middle_english(text, to_lower=True, alpha_conv=True, punct=True)[source]¶
Normalizes Middle English text string and returns normalized string.
- Parameters:
text (
str
) – str text to be normalizedto_lower (
bool
) – bool convert text to lower textalpha_conv (
bool
) – bool convert text to canonical form æ -> ae, þ -> th, ð -> th, ȝ -> y if at beginning, gh otherwisepunct (
bool
) – remove punctuation
>>> normalize_middle_english('Whan Phebus in the CraBbe had neRe hys cours ronne', to_lower = True) 'whan phebus in the crabbe had nere hys cours ronne' >>> normalize_middle_english('I pray ȝow þat ȝe woll', alpha_conv = True) 'i pray yow that ye woll' >>> normalize_middle_english("furst, to begynne:...", punct = True) 'furst to begynne'
- Return type:
str
8.1.1.10. cltk.alphabet.fro module¶
The normalizer aims to maximally reduce the variation between the orthography of texts written in the Anglo-Norman dialect to bring it in line with “orthographe commune”. It is heavily inspired by Pope (1956). Spelling variation is not consistent enough to ensure the highest accuracy; the normalizer in its current format should therefore be used as a last resort. The normalizer, word tokenizer, stemmer, lemmatizer, and list of stopwords for OF/MF were developed as part of Google Summer of Code 2017. A full write-up of this work can be found at : https://gist.github.com/nat1881/6f134617805e2efbe5d275770e26d350 References : Pope, M.K. 1956. From Latin to Modern French with Especial Consideration of Anglo-Norman. Manchester: MUP. Anglo-French spelling variants normalized to “orthographe commune”, from M. K. Pope (1956)
word-final d - e.g. vertud vs vertu
use of <u> over <ou>
<eaus> for <eus>, <ceaus> for <ceus>
- triphtongs:
<iu> for <ieu>
<u> for <eu>
<ie> for <iee>
<ue> for <uee>
<ure> for <eure>
“epenthetic vowels” - e.g. averai for avrai
<eo> for <o>
<iw>, <ew> for <ieux>
final <a> for <e>
8.1.1.11. cltk.alphabet.gmh module¶
The alphabet for Middle High German. Source:
Schreibkonventionen des klassischen Mittelhochdeutschen, Simone Berchtold
The consonants of Middle High German are categorized as:
Stops: ⟨p t k/c/q b d g⟩
Affricates: ⟨pf/ph tz/z⟩
Fricatives: ⟨v f s ȥ sch ch h⟩
Nasals: ⟨m n⟩
Liquids: ⟨l r⟩
Semivowels: ⟨w j⟩
Misc. notes:
c is used at the beginning of only loanwords and is pronounced the same as k (e.g. calant, cappitain)
Double consonants are pronounced the same way as their corresponding letters in Modern Standard German (e.g. pp/p)
schl, schm, schn, schw are written in MHG as sw, sl, sm, sn
æ (also seen as ae), œ (also seen as oe) and iu denote the use of Umlaut over â, ô and û respectively
ȥ or ʒ is used in modern handbooks and grammars to indicate the s or s-like sound which arose from Germanic t in the High German consonant shift.
>>> from cltk.alphabet import gmh
>>> gmh.CONSONANTS[:5]
['b', 'd', 'g', 'h', 'f']
>>> gmh.VOWELS[:5]
['a', 'ë', 'e', 'i', 'o']
- cltk.alphabet.gmh.normalize_middle_high_german(text, to_lower_all=True, to_lower_beginning=False, alpha_conv=True, punct=True, ascii=False)[source]¶
Normalize input string.
>>> from cltk.alphabet import gmh >>> from cltk.languages.example_texts import get_example_text >>> gmh.normalize_middle_high_german(get_example_text("gmh"))[:50] 'uns ist in alten\nmæren wunders vil geseit\nvon hele'
- Parameters:
text (
str
) –to_lower_beginning (
bool
) –to_lower_all (
bool
) – convert whole text to lowercasealpha_conv (
bool
) – convert alphabet to canonical formpunct (
bool
) – remove punctuationascii (
bool
) – returns ascii form
- Returns:
normalized text
8.1.1.12. cltk.alphabet.guj module¶
The Gujarati alphabet.
>>> from cltk.alphabet import guj
>>> guj.VOWELS[:5]
['અ', 'આ', 'ઇ', 'ઈ', 'ઉ']
>>> guj.CONSONANTS[:5]
['ક', 'ખ', 'ગ', 'ઘ', 'ચ']
8.1.1.13. cltk.alphabet.hin module¶
The Hindi alphabet.
>>> from cltk.alphabet import hin
>>> hin.VOWELS[:5]
['अ', 'आ', 'इ', 'ई', 'उ']
>>> hin.CONSONANTS[:5]
['क', 'ख', 'ग', 'घ', 'ङ']
>>> hin.SONORANT_CONSONANTS
['य', 'र', 'ल', 'व']
8.1.1.14. cltk.alphabet.kan module¶
The Kannada alphabet. The characters can be divided into 3 categories:
Swaras (Vowels) : 13 in modern Kannada and 14 in Classical
Vynjanas (Consonants) : They are further divided into 2 categories:
Structured Consonants : 25
Unstructured Consonants : 9 in modern Kannada and 11 in Classical
Yogavaahakas (part vowel, part consonant) : 2
Corresponding to each Swaras and Yogavaahakas there is a symbol. Thus Consonant + Vowel Symbol = Kagunita.
>>> from cltk.alphabet import kan
>>> kan.VOWELS[:5]
['ಅ', 'ಆ', 'ಇ', 'ಈ', 'ಉ']
>>> kan.STRUCTURED_CONSONANTS[:5]
['ಕ', 'ಖ', 'ಗ', 'ಘ', 'ಙಚ']
8.1.1.15. cltk.alphabet.lat module¶
Alphabet and text normalization for Latin.
Principles of Text Cleaning gleaned from
http://udallasclassics.org/wp-content/uploads/maurer_files/APPARATUSABBREVIATIONS.pdf
Guidelines: - […] Square brackets, or in recent editions wavy brackets ʺ{…}ʺ, enclose words etc. that an editor thinks should be deleted (see ʺdel.ʺ) or marked as out of place (see ʺsecl.ʺ). - […] Square brackets in a papyrus text, or in an inscription, enclose places where words have been lost through physical damage. - If this happens in mid-line, editors use ʺ[…]ʺ. - If only the end of the line is missing, they use a single bracket ʺ[…ʺ - If the lineʹs beginning is missing, they use ʺ…]ʺ - Within the brackets, often each dot represents one missing letter. - [[…]] Double brackets enclose letters or words deleted by the medieval copyist himself. - (…) Round brackets are used to supplement words abbreviated by the original copyist; e.g. in an inscription: ʺtrib(unus) mil(itum) leg(ionis) IIIʺ - <…> diamond ( = elbow = angular) brackets enclose words etc. that an editor has added (see ʺsuppl.ʺ) - † An obelus (pl. obeli) means that the word(s etc.) is very plainly corrrupt, but the editor cannot see how to emend. - If only one word is corrupt, there is only one obelus, which precedes the word; if two or more words are corrupt, two obeli enclose them. (Such at least is the rule–but that rule is often broken, especially in older editions, which sometimes dagger several words using only one obelus.) To dagger words in this way is to ʺobelizeʺ them.
- class cltk.alphabet.lat.JVReplacer[source]¶
Bases:
object
Replace J/V with I/U. Latin alphabet does not distinguish between J/j and I/i and V/v and U/u; Yet, many texts bear the influence of later editors and the predilections of other languages.
In practical terms, the JV substitution is recommended on all Latin text preprocessing; it helps to collapse the search space.
>>> replacer = JVReplacer() >>> replacer.replace("Julius Caesar") 'Iulius Caesar'
>>> replacer.replace("In vino veritas.") 'In uino ueritas.'
- class cltk.alphabet.lat.LigatureReplacer[source]¶
Bases:
object
Replace ‘œæ’ with AE, ‘Œ Æ’ with OE. Classical Latin wrote the o and e separately (as has today again become the general practice), but the ligature was used by medieval and early modern writings, in part because the diphthongal sound had, by Late Latin, merged into the sound [e]. See: https://en.wikipedia.org/wiki/%C5%92 Æ (minuscule: æ) is a grapheme named æsc or ash, formed from the letters a and e, originally a ligature representing the Latin diphthong ae. It has been promoted to the full status of a letter in the alphabets of some languages, including Danish, Norwegian, Icelandic, and Faroese. See: https://en.wikipedia.org/wiki/%C3%86
>>> replacer = LigatureReplacer() >>> replacer.replace("mæd") 'maed'
>>> replacer.replace("prœil") 'proeil'
- cltk.alphabet.lat.dehyphenate(text)[source]¶
Remove hyphens from text; used on texts that have an line breaks with hyphens that may creep into the text. Caution using this elsewhere. :type text:
str
:param text: :rtype:str
:return:>>> dehyphenate('quid re-tundo hier') 'quid retundo hier'
- cltk.alphabet.lat.swallow(text, pattern_matcher)[source]¶
Utility function internal to this module
- Parameters:
text (
str
) – text to cleanpattern_matcher (
Pattern
) – pattern to match
- Return type:
str
- Returns:
the text without the matched pattern; spaces are not substituted
- cltk.alphabet.lat.swallow_braces(text)[source]¶
Remove Text within braces, and drop the braces.
- Parameters:
text (
str
) – Text with braces- Return type:
str
- Returns:
Text with the braces and any text inside removed
>>> swallow_braces("{PRO P. QVINCTIO ORATIO} Quae res in civitate {etc}... ") 'Quae res in civitate ...'
- cltk.alphabet.lat.drop_latin_punctuation(text)[source]¶
Drop all Latin punctuation except the hyphen and obelization markers, replacing the punctuation with a space. Please collapsing hyphenated words and removing obelization marks separately beforehand.
The hyphen is important in Latin tokenization as the enclitic particle -ne is different than the interjection ne.
- Parameters:
text (
str
) – Text to clean- Return type:
str
- Returns:
cleaned text
>>> drop_latin_punctuation('quid est ueritas?') 'quid est ueritas '
>>> drop_latin_punctuation("vides -ne , quod , planus est ") 'vides -ne quod planus est '
>>> drop_latin_punctuation("here is some trash, punct \/':;,!\?\._『@#\$%^&\*okay").replace(" ", " ") 'here is some trash punct okay'
- cltk.alphabet.lat.remove_accents(text)[source]¶
Remove accents; note: AE replacement and macron replacement should happen elsewhere, if desired. :type text:
str
:param text: text with undesired accents :rtype:str
:return: clean text>>> remove_accents('suspensám') 'suspensam'
>>> remove_accents('quăm') 'quam'
>>> remove_accents('aegérrume') 'aegerrume'
>>> remove_accents('ĭndignu') 'indignu'
>>> remove_accents('îs') 'is'
>>> remove_accents('óccidentem') 'occidentem'
>>> remove_accents('frúges') 'fruges'
- cltk.alphabet.lat.remove_macrons(text)[source]¶
Remove macrons above vowels :type text:
str
:param text: text with macronized vowels :rtype:str
:return: clean text>>> remove_macrons("canō") 'cano'
>>> remove_macrons("Īuliī") 'Iulii'
- cltk.alphabet.lat.swallow_angle_brackets(text)[source]¶
Disappear text in and surrounding an angle bracket >>> text = “ <O> mea dext<e>ra illa CICERO RUFO Quo<quo>. modo proficiscendum <in> tuis. deesse HS <c> quae metu <exagitatus>, furore <es>set consilium ” >>> swallow_angle_brackets(text) ‘mea illa CICERO RUFO modo proficiscendum tuis. deesse HS quae metu furore consilium’
- Return type:
str
- cltk.alphabet.lat.disappear_angle_brackets(text)[source]¶
Remove all angle brackets, keeping the surrounding text; no spaces are inserted :type text:
str
:param text: text with angle bracket :rtype:str
:return: text without angle brackets
- cltk.alphabet.lat.swallow_square_brackets(text)[source]¶
Swallow text inside angle brackets, without substituting a space. :type text:
str
:param text: text to clean :rtype:str
:return: text without square brackets and text inside removed>>> swallow_square_brackets("qui aliquod institui[t] exemplum") 'qui aliquod institui exemplum'
>>> swallow_square_brackets("posthac tamen cum haec [tamen] quaeremus,") 'posthac tamen cum haec quaeremus,'
- cltk.alphabet.lat.swallow_obelized_words(text)[source]¶
Swallow obelized words; handles enclosed and words flagged on the left. Considers plus signs and daggers as obelization markers :type text:
str
:param text: Text with obelized words :rtype:str
:return: clean text>>> swallow_obelized_words("tu Fauonium †asinium† dicas") 'tu Fauonium dicas'
>>> swallow_obelized_words("tu Fauonium †asinium dicas") 'tu Fauonium dicas'
>>> swallow_obelized_words("meam +similitudinem+") 'meam'
>>> swallow_obelized_words("mea +ratio non habet" ) 'mea non habet'
- cltk.alphabet.lat.disappear_round_brackets(text)[source]¶
Remove round brackets and keep the text intact :type text:
str
:param text: Text with round brackets. :rtype:str
:return: Clean text.>>> disappear_round_brackets("trib(unus) mil(itum) leg(ionis) III") 'tribunus militum legionis III'
- cltk.alphabet.lat.swallow_editorial(text)[source]¶
Swallow common editorial morks :type text:
str
:param text: Text with editorial marks :rtype:str
:return: Clean text.>>> swallow_editorial("{PRO P. QVINCTIO ORATIO} Quae res in civitate trib(unus) mil(itum) leg(ionis) III tu Fauonium †asinium† dicas meam +similitudinem+ mea +ratio non habet ... ") '{PRO P. QVINCTIO ORATIO} Quae res in civitate tribunus militum legionis III tu Fauonium dicas meam mea non habet ...'
- cltk.alphabet.lat.accept_editorial(text)[source]¶
Accept common editorial suggestions :type text:
str
:param text: Text with editorial suggestions :rtype:str
:return: clean text>>> accept_editorial("{PRO P. QVINCTIO ORATIO} Quae res in civitate trib(unus) mil(itum) leg(ionis) III tu Fauonium †asinium† dicas meam +similitudinem+ mea +ratio non habet ... ") 'Quae res in civitate tribunus militum legionis III tu Fauonium dicas meam mea non habet '
- cltk.alphabet.lat.truecase(word, case_counter)[source]¶
Truecase a word using a Truecase dictionary
- Parameters:
word (
str
) – a wordcase_counter (
Dict
[str
,int
]) – A counter; a dictionary of words/tokens and their relative frequency counts
- Returns:
the truecased word
>>> case_counts ={"caesar": 1, "Caesar": 99} >>> truecase('CAESAR', case_counts) 'Caesar'
- cltk.alphabet.lat.normalize_lat(text, drop_accents=False, drop_macrons=False, jv_replacement=False, ligature_replacement=False)[source]¶
The function for all default Latin normalization.
>>> text = "canō Īuliī suspensám quăm aegérrume ĭndignu îs óccidentem frúges Julius Caesar. In vino veritas. mæd prœil" >>> normalize_lat(text) 'canō Īuliī suspensám quăm aegérrume ĭndignu îs óccidentem frúges Julius Caesar. In vino veritas. mæd prœil'
>>> normalize_lat(text, drop_accents=True) 'canō Īuliī suspensam quăm aegerrume ĭndignu is óccidentem frúges Julius Caesar. In vino veritas. mæd prœil'
>>> normalize_lat(text, drop_accents=True, drop_macrons=True) 'cano Iulii suspensam quăm aegerrume ĭndignu is óccidentem frúges Julius Caesar. In vino veritas. mæd prœil'
>>> normalize_lat(text, drop_accents=True, drop_macrons=True, jv_replacement=True) 'cano Iulii suspensam quăm aegerrume ĭndignu is óccidentem frúges Iulius Caesar. In uino ueritas. mæd prœil'
>>> normalize_lat(text, drop_accents=True, drop_macrons=True, jv_replacement=True, ligature_replacement=True) 'cano Iulii suspensam quăm aegerrume ĭndignu is óccidentem frúges Iulius Caesar. In uino ueritas. maed proeil'
- Return type:
str
8.1.1.16. cltk.alphabet.non module¶
Old Norse runes, Unicode block: 16A0–16FF. Source: Viking Language 1, Jessie L. Byock
TODO: Document and test better.
- class cltk.alphabet.non.AutoName(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
Enum
- class cltk.alphabet.non.RunicAlphabetName(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
AutoName
- elder_futhark = 'elder_futhark'¶
- younger_futhark = 'younger_futhark'¶
- short_twig_younger_futhark = 'short_twig_younger_futhark'¶
- class cltk.alphabet.non.Rune(runic_alphabet, form, sound, transcription, name)[source]¶
Bases:
object
>>> Rune(RunicAlphabetName.elder_futhark, "ᚺ", "h", "h", "haglaz") ᚺ >>> Rune.display_runes(ELDER_FUTHARK) ['ᚠ', 'ᚢ', 'ᚦ', 'ᚨ', 'ᚱ', 'ᚲ', 'ᚷ', 'ᚹ', 'ᚺ', 'ᚾ', 'ᛁ', 'ᛃ', 'ᛇ', 'ᛈ', 'ᛉ', 'ᛊ', 'ᛏ', 'ᛒ', 'ᛖ', 'ᛗ', 'ᛚ', 'ᛜ', 'ᛟ', 'ᛞ']
- class cltk.alphabet.non.Transcriber[source]¶
Bases:
object
>>> little_jelling_stone = "᛬ᚴᚢᚱᛘᛦ᛬ᚴᚢᚾᚢᚴᛦ᛬ᚴ(ᛅᚱ)ᚦᛁ᛬ᚴᚢᛒᛚ᛬ᚦᚢᛋᛁ᛬ᛅ(ᚠᛏ)᛬ᚦᚢᚱᚢᛁ᛬ᚴᚢᚾᚢ᛬ᛋᛁᚾᛅ᛬ᛏᛅᚾᛘᛅᚱᚴᛅᛦ᛬ᛒᚢᛏ᛬" >>> Transcriber.transcribe(little_jelling_stone, YOUNGER_FUTHARK) '᛫kurmR᛫kunukR᛫k(ar)þi᛫kubl᛫þusi᛫a(ft)᛫þurui᛫kunu᛫sina᛫tanmarkaR᛫but᛫'
- static from_form_to_transcription(runic_alphabet)[source]¶
Make a dictionary whose keys are forms of runes and values their transcriptions. Used by transcribe method. :type runic_alphabet:
list
:param runic_alphabet: :return: dict
- static transcribe(rune_sentence, runic_alphabet)[source]¶
From a runic inscription, the transcribe method gives a conventional transcription. :type rune_sentence:
str
:param rune_sentence: str, elements of this are from runic_alphabet or are punctuations :type runic_alphabet:list
:param runic_alphabet: list :return:
8.1.1.17. cltk.alphabet.omr module¶
The alphabet for Marathi.
# Using the International Alphabet of Sanskrit Transliteration (IAST), these vowels are represented thus
>>> from cltk.alphabet import omr
>>> omr.VOWELS[:5]
['अ', 'आ', 'इ', 'ई', 'उ']
>>> omr.IAST_VOWELS[:5]
['a', 'ā', 'i', 'ī', 'u']
>>> list(zip(omr.SEMI_VOWELS, omr.IAST_SEMI_VOWELS))
[('य', 'y'), ('र', 'r'), ('ल', 'l'), ('व', 'w')]
8.1.1.18. cltk.alphabet.ory module¶
The Odia alphabet.
>>> from cltk.alphabet import ory
>>> ory.VOWELS["0B05"]
'ଅ'
>>> ory.STRUCTURED_CONSONANTS["0B15"]
'କ'
8.1.1.19. cltk.alphabet.osc module¶
The Oscan alphabet. Sources:
<https://www.unicode.org/charts/PDF/U10300.pdf>
Buck, C. A Grammar of Oscan and Umbrian.
8.1.1.20. cltk.alphabet.ota module¶
Ottoman alphabet
Misc. notes:
Based off Persian Alphabet Transliteration in CLTK by Iman Nazar
Uses UTF-8 Encoding for Ottoman/Persian Letters
When printing Arabic letters, they appear in the console from left to right and inconsistently linked, but correctly link and flow right to left when inputted into a word processor. The problems only exist in the terminal.
TODO: Add tests
8.1.1.21. cltk.alphabet.oty module¶
Alphabet for Old Tamil. GRANTHA_CONSONANTS
are from
the Grantha script which was used between 6th and 20th
century to write Sanskrit and the classical language Manipravalam.
TODO: Add tests
8.1.1.22. cltk.alphabet.peo module¶
The Old Persian Cuneiform. Sources:
8.1.1.23. cltk.alphabet.pes module¶
The Persian alphabet.
TODO: Write tests.
8.1.1.24. cltk.alphabet.pli module¶
The Pali alphabet.
TODO: Add tests.
8.1.1.25. cltk.alphabet.processes module¶
This module holds the Process
for normalizing text strings, usually
before the text is sent to other processes.
- class cltk.alphabet.processes.NormalizeProcess(language=None)[source]¶
Bases:
Process
Generic process for text normalization.
- language: str = None¶
- algorithm¶
- class cltk.alphabet.processes.GreekNormalizeProcess(language=None)[source]¶
Bases:
NormalizeProcess
Text normalization for Ancient Greek.
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> lang = "grc" >>> orig_text = get_example_text(lang) >>> non_normed_doc = Doc(raw=orig_text) >>> normalize_proc = GreekNormalizeProcess(language=lang) >>> normalized_text = normalize_proc.run(input_doc=non_normed_doc) >>> normalized_text == orig_text False
- language: str = 'grc'¶
- class cltk.alphabet.processes.LatinNormalizeProcess(language=None)[source]¶
Bases:
NormalizeProcess
Text normalization for Latin.
>>> from cltk.core.data_types import Doc, Word >>> from cltk.languages.example_texts import get_example_text >>> from boltons.strutils import split_punct_ws >>> lang = "lat" >>> orig_text = get_example_text(lang) >>> non_normed_doc = Doc(raw=orig_text) >>> normalize_proc = LatinNormalizeProcess(language=lang) >>> normalized_text = normalize_proc.run(input_doc=non_normed_doc) >>> normalized_text == orig_text False
- language: str = 'lat'¶
8.1.1.26. cltk.alphabet.san module¶
Data module for the Sanskrit languages alphabet and related characters.
8.1.1.27. cltk.alphabet.tel module¶
Telugu alphabet
TODO: Add tests.
8.1.1.28. cltk.alphabet.text_normalization module¶
Functions for preprocessing texts. Not language-specific.
- cltk.alphabet.text_normalization.remove_non_ascii(input_string)[source]¶
Remove non-ascii characters Source: http://stackoverflow.com/a/1342373
- cltk.alphabet.text_normalization.remove_non_latin(input_string, also_keep=None)[source]¶
Remove non-Latin characters. also_keep should be a list which will add chars (e.g. punctuation) that will not be filtered.
- cltk.alphabet.text_normalization.split_trailing_punct(text, punctuation=None)[source]¶
Some tokenizers, including that in Stanza, do not always handle punctuation properly. For example, a trailing colon (
"οἶδα:"
) is not split into an extra punctuation token. This function does such splitting on raw text before being sent to such a tokenizer.- Parameters:
text (
str
) – Input text string.punctuation (
Optional
[List
[str
]]) – List of punctuation that should be split when trailing a word.
- Return type:
str
- Returns:
Text string with trailing punctuation separated by a whitespace character.
>>> raw_text = "κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν" >>> split_trailing_punct(text=raw_text) 'κατηγόρων ’, οὐκ οἶδα : ἐγὼ δ᾽ οὖν'
- cltk.alphabet.text_normalization.split_leading_punct(text, punctuation=None)[source]¶
Some tokenizers, including that in Stanza, do not always handle punctuation properly. For example, an open curly quote (
"‘κατηγόρων’"
) is not split into an extra punctuation token. This function does such splitting on raw text before being sent to such a tokenizer.- Parameters:
text (
str
) – Input text string.punctuation (
Optional
[List
[str
]]) – List of punctuation that should be split when before a word.
- Return type:
str
- Returns:
Text string with leading punctuation separated by a whitespace character.
>>> raw_text = "‘κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν" >>> split_leading_punct(text=raw_text) '‘ κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν'
- cltk.alphabet.text_normalization.remove_odd_punct(text, punctuation=None)[source]¶
Remove certain characters that downstream processes do not handle well. It would be better to use
split_leading_punct()
andsplit_trailing_punct()
, however the default models out of Stanza make very strange mistakes when, e.g.,"‘"
is made its own token.What to do about the apostrophe following an elision (e.g.,
"δ᾽""
)?>>> raw_text = "‘κατηγόρων’, οὐκ οἶδα: ἐγὼ δ᾽ οὖν" >>> remove_odd_punct(raw_text) 'κατηγόρων, οὐκ οἶδα ἐγὼ δ᾽ οὖν'
- Return type:
str
8.1.1.29. cltk.alphabet.urd module¶
Urdu alphabet
TODO: Add tests.
8.1.1.30. cltk.alphabet.xlc module¶
The Lycian alphabet. Sources:
<https://www.unicode.org/charts/PDF/U10280.pdf>
8.1.1.31. cltk.alphabet.xld module¶
The Lydian alphabet. Sources:
Payne, A. and Wintjies. (2016) Lords of Asia Minor: An Introduction to the Lydians.