8.1.12. cltk.phonology package¶
The phonology module aims to provide tools that:
phonetically/phonologically transcribe words of a given language,
syllabify words.
For some specific languages, there exist, for example, a word stresser (i.e. a function that gives which syllable is stressed).
These tasks are interesting in themselves for historical linguists or teachers. They are also essential for more high-level tasks such as prosody analyzers.
Like for all CLTK modules, the phonology module may be extended and improved if a set of features does not suit your needs because they are insufficient or they do not follow rules you want to test (agreement on phonology of extinct languages is often weak).
8.1.12.1. Subpackages¶
- 8.1.12.1.1. cltk.phonology.ang package
- 8.1.12.1.2. cltk.phonology.arb package
- 8.1.12.1.3. cltk.phonology.enm package
- 8.1.12.1.4. cltk.phonology.gmh package
- 8.1.12.1.5. cltk.phonology.got package
- 8.1.12.1.6. cltk.phonology.grc package
- 8.1.12.1.7. cltk.phonology.lat package
- 8.1.12.1.7.1. Submodules
- 8.1.12.1.7.2. cltk.phonology.lat.phonology module
- 8.1.12.1.7.3. cltk.phonology.lat.syllabifier module
- 8.1.12.1.7.4. cltk.phonology.lat.transcription module
Phone
Word
Word._refresh()
Word._j_maker()
Word._w_maker()
Word._wj_block()
Word._uj_diph_maker()
Word._b_devoice()
Word._final_m_drop()
Word._n_place_assimilation()
Word._g_n_nasality_assimilation()
Word._ns_nf_lengthening()
Word._l_darken()
Word._j_z_doubling()
Word._long_vowel_catcher()
Word._e_i_closer_before_vowel()
Word._intervocalic_j()
Word.ALTERNATIONS
Word._alternate()
Word.syllabify()
Word._print_ipa()
Transcriber
- 8.1.12.1.8. cltk.phonology.non package
- 8.1.12.1.8.1. Subpackages
- 8.1.12.1.8.2. Submodules
- 8.1.12.1.8.3. cltk.phonology.non.orthophonology module
- 8.1.12.1.8.4. cltk.phonology.non.phonology module
- 8.1.12.1.8.5. cltk.phonology.non.syllabifier module
- 8.1.12.1.8.6. cltk.phonology.non.transcription module
- 8.1.12.1.8.7. cltk.phonology.non.utils module
8.1.12.2. Submodules¶
8.1.12.3. cltk.phonology.akk module¶
Functions and classes for Akkadian phonology.
- cltk.phonology.akk.get_cv_pattern(word, pprint=False)[source]¶
Return a patterned string representing the consonants and vowels of the input word.
>>> word = 'iparras' >>> get_cv_pattern(word) [('V', 1, 'i'), ('C', 1, 'p'), ('V', 2, 'a'), ('C', 2, 'r'), ('C', 2, 'r'), ('V', 2, 'a'), ('C', 3, 's')] >>> get_cv_pattern(word, True) 'V₁C₁V₂C₂C₂V₂C₃'
- Return type:
Union
[List
[Tuple
[str
,int
,str
]],str
]
- cltk.phonology.akk.syllabify(word)[source]¶
Split Akkadian words into list of syllables >>> syllabify(“napištašunu”) [‘na’, ‘piš’, ‘ta’, ‘šu’, ‘nu’]
>>> syllabify("epištašu") ['e', 'piš', 'ta', 'šu']
- Return type:
List
[str
]
- cltk.phonology.akk.find_stress(word)[source]¶
Find the stressed syllable in a word. The general logic follows Huehnergard 3rd edition (pgs. 3-4): (a) Light: ending in a short vowel: e.g., -a, -ba (b) Heavy: ending in a long vowel marked with a macron, or in a short vowel plus a consonant: e.g., -ā, -bā, -ak, -bak (c) Ultraheavy: ending in a long vowel marked with a circumflex, in any long vowel plus a consonant: e.g., -â, -bâ, -āk, -bāk, -âk, -bâk. (a) If the last syllable is ultraheavy, it bears the stress. (b) Otherwise, stress falls on the last non-final heavy or ultraheavy syllable. (c) Words that contain no non-final heavy or ultraheavy syllables have the stress fall on the first syllable.
>>> find_stress("napištašunu") ['na', '[piš]', 'ta', 'šu', 'nu']
- Return type:
List
[str
]
8.1.12.4. cltk.phonology.orthophonology module¶
A module for representing the orthophonology of a language: the mapping from orthographic representations to IPA symbols.
Pre-modern languages are characterized by their non-standardized writing rules. Writers attempt to follow rules that fit morphology (words of same family tend to have close spelling) and phonology (words of similar pronunciations are written the same way). As languages evolve, their phonology changes faster than their writing rules. This module aims to unify writing rules with phonological rules by borrowing the representation of sound changes used by historical linguistics.
Based on many ideas in cltk.phonology.non.utils by Clément Besnier <clem@clementbesnier.fr>.
- class cltk.phonology.orthophonology.PhonologicalFeature(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
CLTKEnum
- class cltk.phonology.orthophonology.Consonantal(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
PhonologicalFeature
- neg = 1¶
- pos = 2¶
- class cltk.phonology.orthophonology.Voiced(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
PhonologicalFeature
- neg = 1¶
- pos = 2¶
- class cltk.phonology.orthophonology.Aspirated(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
PhonologicalFeature
- neg = 1¶
- pos = 2¶
- class cltk.phonology.orthophonology.Geminate(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
PhonologicalFeature
- neg = 1¶
- pos = 2¶
- class cltk.phonology.orthophonology.Roundedness(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
PhonologicalFeature
- neg = 1¶
- pos = 2¶
- class cltk.phonology.orthophonology.Length(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
PhonologicalFeature
- short = 1¶
- long = 2¶
- overlong = 3¶
- class cltk.phonology.orthophonology.Height(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
PhonologicalFeature
- close = 1¶
- near_close = 2¶
- close_mid = 3¶
- mid = 4¶
- open_mid = 5¶
- near_open = 6¶
- open = 7¶
- class cltk.phonology.orthophonology.Backness(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
PhonologicalFeature
- front = 1¶
- central = 2¶
- back = 3¶
- class cltk.phonology.orthophonology.Manner(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
PhonologicalFeature
- stop = 1¶
- fricative = 2¶
- affricate = 3¶
- nasal = 4¶
- lateral = 5¶
- trill = 6¶
- spirant = 7¶
- approximant = 8¶
- class cltk.phonology.orthophonology.Place(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
PhonologicalFeature
- bilabial = 1¶
- labio_dental = 2¶
- dental = 3¶
- alveolar = 4¶
- post_alveolar = 5¶
- retroflex = 6¶
- palatal = 7¶
- velar = 8¶
- uvular = 9¶
- glottal = 10¶
- class cltk.phonology.orthophonology.AbstractPhoneme(features=None, ipa=None)[source]¶
Bases:
object
An abstract phoneme is just a bundle of phonological features.
- merge(other)[source]¶
Returns a copy of this phoneme, with the features of other merged into this feature bundle. Other can be a list of phonemes, in which case the list is returned (for technical reasons). Other may also be a single feature value or a list of feature values.
- is_equal(other)[source]¶
Phonemes are equal if they share the same features. Note that the IPA symbol is not taken into account.
- matches(other)[source]¶
This phoneme matches other if other contains all the features of this phoneme, i.e. if this phoneme has an improper subset of other’s. If other is a disjunctive list, then a match is sought for any of the list. If other is a feature value or list of feature values, it is promoted to a phoneme first.
- cltk.phonology.orthophonology.make_phoneme(*feature_values)[source]¶
Creates an abstract phoneme made of the feature specifications given in the vararg.
- Return type:
- cltk.phonology.orthophonology.PositionedPhoneme(phoneme, word_initial=False, word_final=False, syllable_initial=False, syllable_final=False, env_start=False, env_end=False)[source]¶
A decorator for phonemes, used in applying rules over words. Returns a copy of the input phoneme, with additional attributes, specifying whether the phoneme occurs at a word or syllable boundary, or its position in an environment.
- class cltk.phonology.orthophonology.AlwaysMatchingPseudoPhoneme[source]¶
Bases:
AbstractPhoneme
A pseudo-phoneme that matches all other phonemes.
- matches(other)[source]¶
This phoneme matches other if other contains all the features of this phoneme, i.e. if this phoneme has an improper subset of other’s. If other is a disjunctive list, then a match is sought for any of the list. If other is a feature value or list of feature values, it is promoted to a phoneme first.
- Return type:
bool
- class cltk.phonology.orthophonology.WordBoundaryPseudoPhoneme[source]¶
Bases:
AbstractPhoneme
A pseudo-phoneme that only matches at the start or end of a word.
- matches(other)[source]¶
This phoneme matches other if other contains all the features of this phoneme, i.e. if this phoneme has an improper subset of other’s. If other is a disjunctive list, then a match is sought for any of the list. If other is a feature value or list of feature values, it is promoted to a phoneme first.
- Return type:
bool
- class cltk.phonology.orthophonology.SyllableBoundaryPseudoPhoneme[source]¶
Bases:
AbstractPhoneme
A pseudo-phoneme that matches at word boundaries and matches positioned phonemes that are at syllable boundaries.
- matches(other)[source]¶
This phoneme matches other if other contains all the features of this phoneme, i.e. if this phoneme has an improper subset of other’s. If other is a disjunctive list, then a match is sought for any of the list. If other is a feature value or list of feature values, it is promoted to a phoneme first.
- Return type:
bool
- class cltk.phonology.orthophonology.PhonemeDisjunction(*phonemes)[source]¶
Bases:
list
A list of phonemes, with special properties for disjunctive (“or”) matching.
- class cltk.phonology.orthophonology.Consonant(place, manner, voiced, ipa, geminate=Geminate.neg, aspirated=Aspirated.neg)[source]¶
Bases:
AbstractPhoneme
Based on cltk.phonology.utils by @clemsciences. A consonant is a phoneme that is specified for the features listed in the IPA chart for consonants: Place, Manner, Voicing. These may be read directly off the IPA chart, which also gives the IPA symbol. The Consonantal feature is set to positive, and the aspirated is defaulted to negative. See http://www.ipachart.com/
- is_more_sonorous(other)[source]¶
compare this phoneme to another for sonority. Used for SSP considerations.
- Return type:
bool
- class cltk.phonology.orthophonology.Vowel(height, backness, rounded, length, ipa)[source]¶
Bases:
AbstractPhoneme
The representation of a vowel by its features, as given in the IPA chart for vowels. See http://www.ipachart.com/
- lengthen()[source]¶
Returns a new Vowel with its Length lengthened, and “ː” appended to its IPA symbol.
- class cltk.phonology.orthophonology.BasePhonologicalRule(condition, action)[source]¶
Bases:
object
Base class for conditional phonological rules. A phonological rule relates an item (a phoneme) to its environment to define a transformation. Specifically, a rule specifies a condition and an action.
The condition characterizes the phonological environment of a phoneme in terms of the characteristics of the phomeme before it (if any), and after it (if any). In general it is a function taking three arguments: before, target, after, the phonemes in the environment, an returning a boolean for whether the rule should fire.
The action defines a transformation of the target phoneme, e.g. its vocalization. It is a function taking only the action, which returns the replacement phoneme OR a list of phonemes.
- class cltk.phonology.orthophonology.PhonologicalRule(condition, action)[source]¶
Bases:
BasePhonologicalRule
The most general phonological rule can apply anywhere in the word. before and after phonemes may therefore be null when calling the condition.
- exception cltk.phonology.orthophonology.PhonemeNotFound(phoneme)[source]¶
Bases:
Exception
Exception raised when a search for a phoneme in the investory fails.
- exception cltk.phonology.orthophonology.LetterNotFound(letter)[source]¶
Bases:
Exception
Exception raised when a search for a letter in the alphabet fails.
- class cltk.phonology.orthophonology.Orthophonology(sound_inventory, alphabet, diphthongs, digraphs, to_modern=({'m': 'm', 'n': 'n', 'n̥': 'ng', 'ŋ': 'ng', 'p': 'p', 'b': 'b', 't': 't', 'd': 'd', 'k': 'k', 'g': 'g', 't͡ʃ': 'ch', 'd͡ʒ': 'ge', 'f': 'f', 'v': 'v', 'θ': 'th', 'ð': 'th', 's': 's', 'z': 'z', 'ʃ': 'sh', 'ç': 'ch', 'x': 'ch', 'y': 'y', 'h': 'h', 'l': 'l', 'l̥': 'l', 'j': 'y', 'w': 'w', 'r': 'r', 'r̥': 'r', 'i': 'i', 'i:': 'ee', 'y:': 'y', 'u': 'u', 'u:': 'oo', 'e': 'e', 'e:': 'ee', 'ø': 'e', 'ø:': 'ee', 'o': 'o', 'o:': 'oo', 'æ': 'a', 'æ:': 'aa', 'ɑ': 'o', 'ɑ:': 'oo', 'æɑ': 'ao', 'æ:ɑ': 'ao', 'eo': 'eo', 'e:o': 'eeo', 'iu': 'iu', 'i:u': 'iiu'}, [('(^|(?<= ))hw', 'wh'), ('oo(.)(^|(?= ))', 'o\\1e')]))[source]¶
Bases:
object
The ortho-phonology of a language is described by:
The inventory of all the phonemes of the language.
A mapping of orthographic symbols to phonemes.
mappings of orthographic symbols pairs to:
diphthongs
phonemes (i.e. digraphs)
phonological rules for the contextual transformation of phonological representations.
The class is very clearly aimed at alphabetic orthographies. Its usefulness for e.g. pictographic orthographies is questionable.
- add_rule(rule)[source]¶
Adds a rule to the orthophonology. The order in which rules are added is critcial, since the first rule that matches fires.
- _position_phonemes(phonemes)[source]¶
Mark syllable boundaries, and, in future, other positional/suprasegmental features?
- transcribe_word(word)[source]¶
The heart of the transcription process. Similar to the system in in cltk.phonology.utils, the algorithm: 1) Applies digraphs and diphthongs to the text of the word. 2) Carries out a naive (“greedy”, per @clemsciences) substitution of letters to phonemes, according to the alphabet. 3) Applies the conditions of the rules to the environment of each phoneme in turn. The first rule matched fires. There is no restart and later rules are not tested. Also, if a rule returns multiple phonemes, these are never re-tested by the rule set.
- transcribe(text, as_phonemes=False)[source]¶
Transcribes a text, which is first tokenized for words, then each word is transcribed. If as_phonemes is true, returns a list of list of phoneme objects, else returns a string concatenation of the IPA symbols of the phonemes.
- Return type:
Union
[str
,list
]
- transcribe_to_modern(text)[source]¶
A very first attempt at transcribing from IPA to some modern orthography. The method is intended to provide the student with clues to the pronunciation of old orthographies.
- Return type:
str
- voice(consonant)[source]¶
Voices a consonant, by searching the sound inventory for a consonant having the same features as the argument, but +voice.
- Return type:
8.1.12.5. cltk.phonology.processes module¶
Processes for phonology.
8.1.12.6. cltk.phonology.syllabifier_processes module¶
This module implements syllabification processes for several languages. You may extend SyllabificationProcess and see pre-defined examples.
- class cltk.phonology.syllabifier_processes.SyllabificationProcess(language=None)[source]¶
Bases:
Process
This is the class to extend if you want to code your own syllabification process in the CLTK-style.
- class cltk.phonology.syllabifier_processes.GreekSyllabificationProcess(language=None)[source]¶
Bases:
SyllabificationProcess
Syllabification
Process
for Ancient Greek.>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import GreekTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk import NLP >>> a_pipeline = Pipeline(description="A custom Greek pipeline", processes=[GreekTokenizationProcess, DefaultPunctuationRemovalProcess, GreekSyllabificationProcess], language=get_lang("grc")) >>> nlp = NLP(language='grc', custom_pipeline=a_pipeline, suppress_banner=True) >>> text = get_example_text("grc") >>> cltk_doc = nlp(text) >>> [word.syllables for word in cltk_doc.words[:5]] [['ὅτι'], ['μὲν'], ['ὑμ', 'εῖς'], ['ὦ'], ['ἄν', 'δρ', 'ες']]
- description = 'The default Latin Syllabification process'¶
- algorithm¶
- class cltk.phonology.syllabifier_processes.LatinSyllabificationProcess(language=None)[source]¶
Bases:
SyllabificationProcess
Syllabification
Process
for Latin.>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import LatinTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk import NLP >>> a_pipeline = Pipeline(description="A custom Latin pipeline", processes=[LatinTokenizationProcess, DefaultPunctuationRemovalProcess, LatinSyllabificationProcess], language=get_lang("lat")) >>> nlp = NLP(language='lat', custom_pipeline=a_pipeline, suppress_banner=True) >>> text = get_example_text("lat") >>> cltk_doc = nlp(text) >>> [word.syllables for word in cltk_doc.words[:5]] [['gal', 'li', 'a'], ['est'], ['om', 'nis'], ['di', 'vi', 'sa'], ['in']]
- description = 'The default Latin Syllabification process'¶
- algorithm¶
- class cltk.phonology.syllabifier_processes.MiddleEnglishSyllabificationProcess(language=None)[source]¶
Bases:
SyllabificationProcess
Syllabification
Process
for Middle English.>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import MiddleEnglishTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Middle English pipeline", processes=[MiddleEnglishTokenizationProcess, DefaultPunctuationRemovalProcess, MiddleEnglishSyllabificationProcess], language=get_lang("enm")) >>> nlp = NLP(language='enm', custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text("enm").replace('\n', ' ') >>> cltk_doc = nlp(text) >>> [word.syllables for word in cltk_doc.words[:5]] [['whi', 'lom'], ['as'], ['ol', 'de'], ['sto', 'ries'], ['tellen']]
- description = 'The default Middle English Syllabification process'¶
- algorithm¶
- class cltk.phonology.syllabifier_processes.MiddleHighGermanSyllabificationProcess(language=None)[source]¶
Bases:
SyllabificationProcess
Syllabification
Process
for Middle High German.>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import MiddleHighGermanTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Middle High German pipeline", processes=[MiddleHighGermanTokenizationProcess, DefaultPunctuationRemovalProcess, MiddleHighGermanSyllabificationProcess], language=get_lang("gmh")) >>> nlp = NLP(language='gmh', custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text("gmh") >>> cltk_doc = nlp(text) >>> [word.syllables for word in cltk_doc.words[:5]] [['uns'], ['ist'], ['in'], ['al', 'ten'], ['mæ', 'ren']]
- description = 'The default Middle High German syllabification process'¶
- algorithm¶
- class cltk.phonology.syllabifier_processes.OldEnglishSyllabificationProcess(language=None)[source]¶
Bases:
SyllabificationProcess
Syllabification
Process
for Old English.>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import MiddleEnglishTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Old English pipeline", processes=[MiddleEnglishTokenizationProcess, DefaultPunctuationRemovalProcess, OldEnglishSyllabificationProcess], language=get_lang("ang")) >>> nlp = NLP(language='ang', custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text("ang") >>> cltk_doc = nlp(text) >>> [word.syllables for word in cltk_doc.words[:5]] [['hwæt'], ['we'], ['gar', 'den', 'a'], ['in'], ['gear', 'da', 'gum']]
- description = 'The default Old English syllabification process'¶
- algorithm¶
- class cltk.phonology.syllabifier_processes.OldNorseSyllabificationProcess(language=None)[source]¶
Bases:
SyllabificationProcess
Syllabification
Process
for Old Norse.>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import OldNorseTokenizationProcess >>> from cltk.text.processes import OldNorsePunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Old Norse pipeline", processes=[OldNorseTokenizationProcess, OldNorsePunctuationRemovalProcess, OldNorseSyllabificationProcess], language=get_lang("non")) >>> nlp = NLP(language='non', custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text("non") >>> cltk_doc = nlp(text) >>> [word.syllables for word in cltk_doc.words[:5]] [['gyl', 'fi'], ['ko', 'nungr'], ['réð'], ['þar'], ['lön', 'dum']]
- description = 'The default Old Norse syllabification process'¶
- algorithm¶
8.1.12.7. cltk.phonology.syllabify module¶
The syllabify module implements two main classes:
Syllabifier
Syllable
Syllabifier implements two general syllabification algorithms:
the Maximum Onset Principle,
the Sonority Sequence Principle.
They are both based on phonetic principles.
The Syllable class provides a way to linguistically represent a syllable.
- cltk.phonology.syllabify.get_onsets(text, vowels='aeiou', threshold=0.0002)[source]¶
Source: Resonances in Middle High German: New Methodologies in Prosody, 2017, C. L. Hench
- Parameters:
text – str list: text to be analysed
vowels – str: valid vowels constituting the syllable
threshold – minimum frequency count for valid onset, C. Hench noted that the algorithm produces the best result for an untagged wordset of MHG, when retaining onsets which appear in at least 0.02% of the words
Let’s test it on the opening lines of Nibelungenlied
>>> text = ['uns', 'ist', 'in', 'alten', 'mæren', 'wunders', 'vil', 'geseit', 'von', 'helden', 'lobebæren', 'von', 'grôzer', 'arebeit', 'von', 'fröuden', 'hôchgezîten', 'von', 'weinen', 'und', 'von', 'klagen', 'von', 'küener', 'recken', 'strîten', 'muget', 'ir', 'nu', 'wunder', 'hœren', 'sagen'] >>> vowels = "aeiouæœôîöü" >>> get_onsets(text, vowels=vowels) ['lt', 'm', 'r', 'w', 'nd', 'v', 'g', 's', 'h', 'ld', 'l', 'b', 'gr', 'z', 'fr', 'd', 'chg', 't', 'n', 'kl', 'k', 'ck', 'str']
Of course, this is an insignificant sample, but we could try and see how modifying the threshold affects the returned onset:
>>> get_onsets(text, threshold = 0.05, vowels=vowels) ['m', 'r', 'w', 'nd', 'v', 'g', 's', 'h', 'b', 'z', 't', 'n']
- class cltk.phonology.syllabify.Syllabifier(low_vowels=None, mid_vowels=None, high_vowels=None, flaps=None, laterals=None, nasals=None, fricatives=None, plosives=None, language=None, break_geminants=False, variant=None, sep=None)[source]¶
Bases:
object
Provides 2 main methods that syllabify words given phonology of its language.
- set_hierarchy(hierarchy)[source]¶
Sets an alternative sonority hierarchy, note that you will also need to specify the vowelset with the set_vowels, in order for the module to correctly identify each nucleus.
The order of the phonemes defined is by decreased consonantality
>>> s = Syllabifier() >>> s.set_hierarchy([['i', 'u'], ['e'], ['a'], ['r'], ['m', 'n'], ['f']]) >>> s.set_vowels(['i', 'u', 'e', 'a']) >>> s.syllabify('feminarum') ['fe', 'mi', 'na', 'rum']
- set_vowels(vowels)[source]¶
Define the vowel set of the syllabifier module
>>> s = Syllabifier() >>> s.set_vowels(['i', 'u', 'e', 'a']) >>> s.vowels ['i', 'u', 'e', 'a']
- syllabify(word, mode='SSP')[source]¶
- Parameters:
word (
str
) – word to syllabifymode – syllabification algorithm SSP (Sonority Sequence Principle) or MOP (Maximum Onset Principle)
- Return type:
Union
[List
[str
],str
]- Returns:
syllabifier word
- syllabify_ssp(word)[source]¶
Syllabifies a word according to the Sonority Sequencing Principle
- Parameters:
word (
str
) – Word to be syllabified- Return type:
List
[str
]- Returns:
List consisting of syllables
First you need to define the matters of articulation >>> high_vowels = [‘a’] >>> mid_vowels = [‘e’] >>> low_vowels = [‘i’, ‘u’] >>> flaps = [‘r’] >>> nasals = [‘m’, ‘n’] >>> fricatives = [‘f’] >>> s = Syllabifier(high_vowels=high_vowels, mid_vowels=mid_vowels, low_vowels=low_vowels, flaps=flaps, nasals=nasals, fricatives=fricatives) >>> s.syllabify(“feminarum”) [‘fe’, ‘mi’, ‘na’, ‘rum’]
Not specifying your alphabet results in an error: >>> s.syllabify(“foemina”) Traceback (most recent call last): … cltk.core.exceptions.CLTKException
Additionally, you can utilize the language parameter: >>> s = Syllabifier(language=’gmh’) >>> s.syllabify(‘lobebæren’) [‘lo’, ‘be’, ‘bæ’, ‘ren’] >>> s = Syllabifier(language=’enm’) >>> s.syllabify(“huntyng”) [‘hun’, ‘tyng’] >>> s = Syllabifier(language=’ang’) >>> s.syllabify(“arcebiscop”) [‘ar’, ‘ce’, ‘bis’, ‘cop’]
The break_geminants parameter ensures a breakpoint is placed between geminants: >>> geminant_s = Syllabifier(break_geminants=True) >>> hierarchy = [[“a”, “á”, “æ”, “e”, “é”, “i”, “í”, “o”, “ǫ”, “ø”, “ö”, “œ”, “ó”, “u”, “ú”, “y”, “ý”], [“j”], [“m”], [“n”], [“p”, “b”, “d”, “g”, “t”, “k”], [“c”, “f”, “s”, “h”, “v”, “x”, “þ”, “ð”], [“r”], [“l”]] >>> geminant_s.set_hierarchy(hierarchy) >>> geminant_s.set_vowels(hierarchy[0]) >>> geminant_s.syllabify(“ennitungl”) [‘en’, ‘ni’, ‘tungl’]
- onset_maximization(syllables)[source]¶
Applies onset maximisation principle to syllables :type syllables:
List
[str
] :param syllables: list of syllables :rtype:List
[str
] :return:
- legal_onsets(syllables)[source]¶
Filters syllable respecting the legality principle
- Parameters:
syllables (
List
[str
]) – list of syllables
The method scans for invalid syllable onsets:
>>> s = Syllabifier(["i", "u", "y"], ["o", "ø", "e"], ["a"], ["r"], ["l"], ["m", "n"], ["f", "v", "s", "h"], ["k", "g", "b", "p", "t", "d"]) >>> s.set_invalid_onsets(['lm']) >>> s.legal_onsets(['a', 'lma', 'tigr']) ['al', 'ma', 'tigr']
You can also define invalid syllable ultima:
>>> s.set_invalid_ultima(['gr']) >>> s.legal_onsets(['al', 'ma', 'ti', 'gr']) ['al', 'ma', 'tigr']
- Return type:
List
[str
]
- syllabify_mop(word)[source]¶
>>> from cltk.phonology.gmh.syllabifier import DIPHTHONGS, TRIPHTHONGS, SHORT_VOWELS, LONG_VOWELS, CONSONANTS >>> gmh_syllabifier = Syllabifier() >>> gmh_syllabifier.set_short_vowels(SHORT_VOWELS) >>> gmh_syllabifier.set_vowels(SHORT_VOWELS+LONG_VOWELS) >>> gmh_syllabifier.set_diphthongs(DIPHTHONGS) >>> gmh_syllabifier.set_triphthongs(TRIPHTHONGS) >>> gmh_syllabifier.set_consonants(CONSONANTS)
>>> gmh_syllabifier.syllabify_mop('entslâfen') ['ent', 'slâ', 'fen']
>>> gmh_syllabifier.syllabify_mop('fröude') ['fröu', 'de']
>>> gmh_syllabifier.syllabify_mop('füerest') ['füe', 'rest']
>>> from cltk.phonology.enm.syllabifier import DIPHTHONGS, TRIPHTHONGS, SHORT_VOWELS, LONG_VOWELS >>> enm_syllabifier = Syllabifier() >>> enm_syllabifier.set_short_vowels(SHORT_VOWELS) >>> enm_syllabifier.set_vowels(SHORT_VOWELS+LONG_VOWELS) >>> enm_syllabifier.set_diphthongs(DIPHTHONGS) >>> enm_syllabifier.set_triphthongs(TRIPHTHONGS)
>>> enm_syllabifier.syllabify_mop('heldis') ['hel', 'dis'] >>> enm_syllabifier.syllabify_mop('greef') ['greef']
Once you syllabify the word, the result will be saved as a class variable
>>> enm_syllabifier.syllabify_mop('commaundyd') ['com', 'mau', 'ndyd']
- Parameters:
word (
str
) – word to syllabify- Return type:
List
[str
]- Returns:
syllabified word
- class cltk.phonology.syllabify.Syllable(text, vowels, consonants)[source]¶
Bases:
object
A syllable has three main constituents:
onset
nucleus
coda
Source: https://en.wikipedia.org/wiki/Syllable
- _compute_syllable(text)[source]¶
>>> sylla1 = Syllable("armr", ["a"], ["r", "m"]) >>> sylla1.onset [] >>> sylla1.nucleus ['a'] >>> sylla1.coda ['r', 'm', 'r']
>>> sylla2 = Syllable("gangr", ["a"], ["g", "n", "r"]) >>> sylla2.onset ['g'] >>> sylla2.nucleus ['a'] >>> sylla2.coda ['n', 'g', 'r']
>>> sylla3 = Syllable("aurr", ["a", "u"], ["r"]) >>> sylla3.nucleus ['a', 'u'] >>> sylla3.coda ['r', 'r']
- Parameters:
text – a syllable
8.1.12.8. cltk.phonology.transcription_processes module¶
This module provides phonological/phonetic transcribers for several languages. PhonologicalTranscriptionProcess is the parent-class for all other custom transcription processes.
- class cltk.phonology.transcription_processes.PhonologicalTranscriptionProcess(language=None)[source]¶
Bases:
Process
General phonological transcription Process.
- class cltk.phonology.transcription_processes.GothicPhonologicalTranscriberProcess(language=None)[source]¶
Bases:
PhonologicalTranscriptionProcess
Phonological transcription Process for Gothic.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import OldNorseTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Gothic pipeline", processes=[OldNorseTokenizationProcess, DefaultPunctuationRemovalProcess, GothicPhonologicalTranscriberProcess], language=get_lang("got")) >>> nlp = NLP(language='got', custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text("got") >>> cltk_doc = nlp(text) >>> [word.phonetic_transcription for word in cltk_doc.words[:5]] ['swa', 'liuhtjɛ', 'liuhaθ', 'jzwar', 'jn']
- description = 'The default Gothic transcription process'¶
- algorithm¶
- class cltk.phonology.transcription_processes.GreekPhonologicalTranscriberProcess(language=None)[source]¶
Bases:
PhonologicalTranscriptionProcess
Phonological transcription Process for Ancient Greek.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import GreekTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Greek pipeline", processes=[GreekTokenizationProcess, DefaultPunctuationRemovalProcess, GreekPhonologicalTranscriberProcess], language=get_lang("grc")) >>> nlp = NLP(language='grc', custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text("grc") >>> cltk_doc = nlp(text) >>> [word.phonetic_transcription for word in cltk_doc.words[:5]] ['hó.ti', 'men', 'hy.mệːs', 'ɔ̂ː', 'ɑ́n.dres']
- description = 'The default Greek transcription process'¶
- algorithm¶
- class cltk.phonology.transcription_processes.LatinPhonologicalTranscriberProcess(language=None)[source]¶
Bases:
PhonologicalTranscriptionProcess
Phonological transcription Process for Latin.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import LatinTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk import NLP >>> a_pipeline = Pipeline(description="A custom Latin pipeline", processes=[LatinTokenizationProcess, DefaultPunctuationRemovalProcess, LatinPhonologicalTranscriberProcess], language=get_lang("lat")) >>> nlp = NLP(language="lat", custom_pipeline=a_pipeline, suppress_banner=True) >>> text = get_example_text("lat") >>> cltk_doc = nlp.analyze(text) >>> [word.phonetic_transcription for word in cltk_doc.words][:5] ['[gaɫlɪ̣ja]', '[ɛst̪]', '[ɔmn̪ɪs]', '[d̪ɪwɪsa]', '[ɪn̪]']
- description = 'The default Latin transcription process'¶
- algorithm¶
- class cltk.phonology.transcription_processes.MiddleHighGermanPhonologicalTranscriberProcess(language=None)[source]¶
Bases:
PhonologicalTranscriptionProcess
Phonological transcription Process for Middle High German. >>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import MiddleHighGermanTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description=”A custom Middle High German pipeline”, processes=[MiddleHighGermanTokenizationProcess, DefaultPunctuationRemovalProcess, MiddleHighGermanPhonologicalTranscriberProcess], language=get_lang(“gmh”)) >>> nlp = NLP(language=’gmh’, custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text(“gmh”) >>> cltk_doc = nlp(text) >>> [word.phonetic_transcription for word in cltk_doc.words[:5]] [‘ʊns’, ‘ɪst’, ‘ɪn’, ‘alten’, ‘mɛren’]
- description = 'The default Middle High German transcription process'¶
- algorithm¶
- class cltk.phonology.transcription_processes.OldEnglishPhonologicalTranscriberProcess(language=None)[source]¶
Bases:
PhonologicalTranscriptionProcess
Phonological transcription Process for Old English. >>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import MiddleEnglishTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description=”A custom Old English pipeline”, processes=[MiddleEnglishTokenizationProcess, DefaultPunctuationRemovalProcess, OldEnglishPhonologicalTranscriberProcess], language=get_lang(“ang”)) >>> nlp = NLP(language=’ang’, custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text(“ang”) >>> cltk_doc = nlp(text) >>> [word.phonetic_transcription for word in cltk_doc.words[:5]] [‘ʍæt’, ‘we’, ‘gɑrˠdenɑ’, ‘in’, ‘gæːɑrˠdɑgum’]
- description = 'The default Old English transcription process'¶
- algorithm¶
- class cltk.phonology.transcription_processes.OldNorsePhonologicalTranscriberProcess(language=None)[source]¶
Bases:
PhonologicalTranscriptionProcess
Phonological transcription Process for Old Norse.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import OldNorseTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Old Norse pipeline", processes=[OldNorseTokenizationProcess, DefaultPunctuationRemovalProcess, OldNorsePhonologicalTranscriberProcess], language=get_lang("non")) >>> nlp = NLP(language='non', custom_pipeline=pipe, suppress_banner=True) >>> text = get_example_text("non") >>> cltk_doc = nlp(text) >>> [word.phonetic_transcription for word in cltk_doc.words[:5]] ['gylvi', 'kɔnunɣr', 'reːð', 'θar', 'lœndum']
- description = 'The default Old Norse poetry process'¶
- algorithm¶
- class cltk.phonology.transcription_processes.OldSwedishPhonologicalTranscriberProcess(language=None)[source]¶
Bases:
PhonologicalTranscriptionProcess
Phonological transcription Process for Old Swedish.
>>> from cltk.core.data_types import Process, Pipeline >>> from cltk.tokenizers.processes import OldNorseTokenizationProcess >>> from cltk.text.processes import DefaultPunctuationRemovalProcess >>> from cltk.languages.utils import get_lang >>> from cltk.languages.example_texts import get_example_text >>> from cltk.nlp import NLP >>> pipe = Pipeline(description="A custom Old Swedish pipeline", processes=[OldNorseTokenizationProcess, DefaultPunctuationRemovalProcess, OldSwedishPhonologicalTranscriberProcess], language=get_lang("non")) >>> nlp = NLP(language='non', custom_pipeline=pipe, suppress_banner=True) >>> text = "Far man kunu oc dör han för en hun far barn. oc sigher hun oc hænnæ frændær." >>> cltk_doc = nlp(text) >>> [word.phonetic_transcription for word in cltk_doc.words[:5]] ['far', 'man', 'kunu', 'ok', 'dør']
- description = 'The default Old Swedish transcription process'¶
- algorithm¶