8.1.12.1.2.1.1.1.1. cltk.phonology.arb.utils.pyarabic package¶
8.1.12.1.2.1.1.1.1.1. Submodules¶
8.1.12.1.2.1.1.1.1.2. cltk.phonology.arb.utils.pyarabic.araby module¶
Arabic module
8.1.12.1.2.1.1.1.1.2.1. Features:¶
Arabic letters classification
Text tokenization
Strip Harakat ( all, except Shadda, tatweel, last_haraka)
Sperate and join Letters and Harakat
Reduce tashkeel
Mesure tashkeel similarity ( Harakats, fully or partially vocalized, similarity with a template)
Letters normalization ( Ligatures and Hamza)
Includes code written by ‘Arabtechies’, ‘Arabeyes’, ‘Taha Zerrouki’.
Todo
Remove, rewrite, and/or refactor this due to GPL.
- cltk.phonology.arb.utils.pyarabic.araby.is_sukun(archar)[source]¶
Checks for Arabic Sukun Mark. @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_shadda(archar)[source]¶
Checks for Arabic Shadda Mark. @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_tatweel(archar)[source]¶
Checks for Arabic Tatweel letter modifier. @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_tanwin(archar)[source]¶
Checks for Arabic Tanwin Marks (FATHATAN, DAMMATAN, KASRATAN). @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_tashkeel(archar)[source]¶
Checks for Arabic Tashkeel Marks:
FATHA, DAMMA, KASRA, SUKUN,
SHADDA,
FATHATAN, DAMMATAN, KASRATAN.
@param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_haraka(archar)[source]¶
Checks for Arabic Harakat Marks (FATHA, DAMMA, KASRA, SUKUN, TANWIN). @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_shortharaka(archar)[source]¶
Checks for Arabic short Harakat Marks (FATHA, DAMMA, KASRA, SUKUN). @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_ligature(archar)[source]¶
Checks for Arabic Ligatures like LamAlef. (LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE) @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_hamza(archar)[source]¶
Checks for Arabic Hamza forms. HAMZAT are (HAMZA, WAW_HAMZA, YEH_HAMZA, HAMZA_ABOVE, HAMZA_BELOW, ALEF_HAMZA_BELOW, ALEF_HAMZA_ABOVE ) @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_alef(archar)[source]¶
Checks for Arabic Alef forms. ALEFAT = (ALEF, ALEF_MADDA, ALEF_HAMZA_ABOVE, ALEF_HAMZA_BELOW, ALEF_WASLA, ALEF_MAKSURA ) @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_yehlike(archar)[source]¶
Checks for Arabic Yeh forms. Yeh forms : YEH, YEH_HAMZA, SMALL_YEH, ALEF_MAKSURA @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_wawlike(archar)[source]¶
Checks for Arabic Waw like forms. Waw forms : WAW, WAW_HAMZA, SMALL_WAW @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_teh(archar)[source]¶
Checks for Arabic Teh forms. Teh forms : TEH, TEH_MARBUTA @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_small(archar)[source]¶
Checks for Arabic Small letters. SMALL Letters : SMALL ALEF, SMALL WAW, SMALL YEH @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_weak(archar)[source]¶
Checks for Arabic Weak letters. Weak Letters : ALEF, WAW, YEH, ALEF_MAKSURA @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_moon(archar)[source]¶
Checks for Arabic Moon letters. Moon Letters : @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_sun(archar)[source]¶
Checks for Arabic Sun letters. Moon Letters : @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.order(archar)[source]¶
return Arabic letter order between 1 and 29. Alef order is 1, Yeh is 28, Hamza is 29. Teh Marbuta has the same ordre with Teh, 3. @param archar: arabic unicode char @type archar: unicode @return: arabic order. @rtype: integer
- cltk.phonology.arb.utils.pyarabic.araby.name(archar)[source]¶
return Arabic letter name in arabic. Alef order is 1, Yeh is 28, Hamza is 29. Teh Marbuta has the same ordre with Teh, 3. @param archar: arabic unicode char @type archar: unicode @return: arabic name. @rtype: unicode
- cltk.phonology.arb.utils.pyarabic.araby.arabicrange()[source]¶
return a list of arabic characteres . Return a list of characteres between ، to ْ @return: list of arabic characteres. @rtype: unicode
- cltk.phonology.arb.utils.pyarabic.araby.has_shadda(word)[source]¶
Checks if the arabic word contains shadda. @param word: arabic unicode char @type word: unicode @return: if shadda exists @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_vocalized(word)[source]¶
Checks if the arabic word is vocalized. the word musn’t have any spaces and pounctuations. @param word: arabic unicode char @type word: unicode @return: if the word is vocalized @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_vocalizedtext(text)[source]¶
Checks if the arabic text is vocalized. The text can contain many words and spaces @param text: arabic unicode char @type text: unicode @return: if the word is vocalized @rtype:Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_arabicstring(text)[source]¶
Checks for an Arabic standard Unicode block characters An arabic string can contain spaces, digits and pounctuation. but only arabic standard characters, not extended arabic @param text: input text @type text: unicode @return: True if all charaters are in Arabic block @rtype: Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_arabicrange(text)[source]¶
Checks for an Arabic Unicode block characters @param text: input text @type text: unicode @return: True if all charaters are in Arabic block @rtype: Boolean
- cltk.phonology.arb.utils.pyarabic.araby.is_arabicword(word)[source]¶
Checks for an valid Arabic word. An Arabic word not contains spaces, digits and pounctuation avoid some spelling error, TEH_MARBUTA must be at the end. @param word: input word @type word: unicode @return: True if all charaters are in Arabic block @rtype: Boolean
- cltk.phonology.arb.utils.pyarabic.araby.first_char(word)[source]¶
Return the first char @param word: given word @type word: unicode @return: the first char @rtype: unicode char
- cltk.phonology.arb.utils.pyarabic.araby.second_char(word)[source]¶
Return the second char @param word: given word @type word: unicode @return: the first char @rtype: unicode char
- cltk.phonology.arb.utils.pyarabic.araby.last_char(word)[source]¶
Return the last letter example: zerrouki; ‘i’ is the last. @param word: given word @type word: unicode @return: the last letter @rtype: unicode char
- cltk.phonology.arb.utils.pyarabic.araby.secondlast_char(word)[source]¶
Return the second last letter example: zerrouki; ‘k’ is the second last. @param word: given word @type word: unicode @return: the second last letter @rtype: unicode char
- cltk.phonology.arb.utils.pyarabic.araby.strip_harakat(text)[source]¶
Strip Harakat from arabic word except Shadda. The striped marks are :
FATHA, DAMMA, KASRA
SUKUN
FATHATAN, DAMMATAN, KASRATAN,
@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.
- cltk.phonology.arb.utils.pyarabic.araby.strip_lastharaka(text)[source]¶
Strip the last Haraka from arabic word except Shadda. The striped marks are:
FATHA, DAMMA, KASRA
SUKUN
FATHATAN, DAMMATAN, KASRATAN
@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.
- cltk.phonology.arb.utils.pyarabic.araby.strip_tashkeel(text)[source]¶
Strip vowels from a text, include Shadda. The striped marks are:
FATHA, DAMMA, KASRA
SUKUN
SHADDA
FATHATAN, DAMMATAN, KASRATAN
@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.
- cltk.phonology.arb.utils.pyarabic.araby.strip_tatweel(text)[source]¶
Strip tatweel from a text and return a result text.
@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.
- cltk.phonology.arb.utils.pyarabic.araby.strip_shadda(text)[source]¶
Strip Shadda from a text and return a result text.
@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.
- cltk.phonology.arb.utils.pyarabic.araby.normalize_ligature(text)[source]¶
Normalize Lam Alef ligatures into two letters (LAM and ALEF), and Tand return a result text. Some systems present lamAlef ligature as a single letter, this function convert it into two letters, The converted letters into LAM and ALEF are: LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE
@param text: arabic text. @type text: unicode. @return: return a converted text. @rtype: unicode.
- cltk.phonology.arb.utils.pyarabic.araby.normalize_hamza(word)[source]¶
Standardize the Hamzat into one form of hamza, replace Madda by hamza and alef. Replace the LamAlefs by simplified letters.
@param word: arabic text. @type word: unicode. @return: return a converted text. @rtype: unicode.
- cltk.phonology.arb.utils.pyarabic.araby.separate(word, extract_shadda=False)[source]¶
separate the letters from the vowels, in arabic word, if a letter hasn’t a haraka, the not definited haraka is attributed. return ( letters, vowels) @param word: the input word @type word: unicode @param extract_shadda: extract shadda as seperate text @type extract_shadda: Boolean @return: ( letters, vowels) @rtype:couple of unicode
- cltk.phonology.arb.utils.pyarabic.araby.joint(letters, marks)[source]¶
joint the letters with the marks the length ot letters and marks must be equal return word @param letters: the word letters @type letters: unicode @param marks: the word marks @type marks: unicode @return: word @rtype: unicode
- cltk.phonology.arb.utils.pyarabic.araby.vocalizedlike(word1, word2)[source]¶
if the two words has the same letters and the same harakats, this fuction return True. The two words can be full vocalized, or partial vocalized
@param word1: first word @type word1: unicode @param word2: second word @type word2: unicode @return: if two words have similar vocalization @rtype: Boolean
- cltk.phonology.arb.utils.pyarabic.araby.waznlike(word1, wazn)[source]¶
If the word1 is like a wazn (pattern), the letters must be equal, the wazn has FEH, AIN, LAM letters. this are as generic letters. The two words can be full vocalized, or partial vocalized
@param word1: input word @type word1: unicode @param wazn: given word template وزن @type wazn: unicode @return: if two words have similar vocalization @rtype: Boolean
- cltk.phonology.arb.utils.pyarabic.araby.shaddalike(partial, fully)[source]¶
If the two words has the same letters and the same harakats, this fuction return True. The first word is partially vocalized, the second is fully if the partially contians a shadda, it must be at the same place in the fully
@param partial: the partially vocalized word @type partial: unicode @param fully: the fully vocalized word @type fully: unicode @return: if contains shadda @rtype: Boolean
- cltk.phonology.arb.utils.pyarabic.araby.reduce_tashkeel(text)[source]¶
Reduce the Tashkeel, by deleting evident cases.
@param text: the input text fully vocalized. @type text: unicode. @return : partially vocalized text. @rtype: unicode.
- cltk.phonology.arb.utils.pyarabic.araby.vocalized_similarity(word1, word2)[source]¶
if the two words has the same letters and the same harakats, this function return True. The two words can be full vocalized, or partial vocalized
@param word1: first word @type word1: unicode @param word2: second word @type word2: unicode @return: return if words are similar, else return negative number of errors @rtype: Boolean / int
8.1.12.1.2.1.1.1.1.3. cltk.phonology.arb.utils.pyarabic.stack module¶
Stack module
Includes code written by ‘Arabtechies’, ‘Arabeyes’, ‘Taha Zerrouki’.