8.1.12.1.2.1.1.1.1. cltk.phonology.arb.utils.pyarabic package¶
8.1.12.1.2.1.1.1.1.1. Submodules¶
8.1.12.1.2.1.1.1.1.2. cltk.phonology.arb.utils.pyarabic.araby module¶
Arabic module
8.1.12.1.2.1.1.1.1.2.1. Features:¶
Arabic letters classification
Text tokenization
Strip Harakat ( all, except Shadda, tatweel, last_haraka)
Sperate and join Letters and Harakat
Reduce tashkeel
Mesure tashkeel similarity ( Harakats, fully or partially vocalized, similarity with a template)
Letters normalization ( Ligatures and Hamza)
Includes code written by ‘Arabtechies’, ‘Arabeyes’, ‘Taha Zerrouki’.
Todo
Remove, rewrite, and/or refactor this due to GPL.
-
cltk.phonology.arb.utils.pyarabic.araby.
is_sukun
(archar)[source]¶ Checks for Arabic Sukun Mark. @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_shadda
(archar)[source]¶ Checks for Arabic Shadda Mark. @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_tatweel
(archar)[source]¶ Checks for Arabic Tatweel letter modifier. @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_tanwin
(archar)[source]¶ Checks for Arabic Tanwin Marks (FATHATAN, DAMMATAN, KASRATAN). @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_tashkeel
(archar)[source]¶ Checks for Arabic Tashkeel Marks:
FATHA, DAMMA, KASRA, SUKUN,
SHADDA,
FATHATAN, DAMMATAN, KASRATAN.
@param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_haraka
(archar)[source]¶ Checks for Arabic Harakat Marks (FATHA, DAMMA, KASRA, SUKUN, TANWIN). @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_shortharaka
(archar)[source]¶ Checks for Arabic short Harakat Marks (FATHA, DAMMA, KASRA, SUKUN). @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_ligature
(archar)[source]¶ Checks for Arabic Ligatures like LamAlef. (LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE) @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_hamza
(archar)[source]¶ Checks for Arabic Hamza forms. HAMZAT are (HAMZA, WAW_HAMZA, YEH_HAMZA, HAMZA_ABOVE, HAMZA_BELOW, ALEF_HAMZA_BELOW, ALEF_HAMZA_ABOVE ) @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_alef
(archar)[source]¶ Checks for Arabic Alef forms. ALEFAT = (ALEF, ALEF_MADDA, ALEF_HAMZA_ABOVE, ALEF_HAMZA_BELOW, ALEF_WASLA, ALEF_MAKSURA ) @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_yehlike
(archar)[source]¶ Checks for Arabic Yeh forms. Yeh forms : YEH, YEH_HAMZA, SMALL_YEH, ALEF_MAKSURA @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_wawlike
(archar)[source]¶ Checks for Arabic Waw like forms. Waw forms : WAW, WAW_HAMZA, SMALL_WAW @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_teh
(archar)[source]¶ Checks for Arabic Teh forms. Teh forms : TEH, TEH_MARBUTA @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_small
(archar)[source]¶ Checks for Arabic Small letters. SMALL Letters : SMALL ALEF, SMALL WAW, SMALL YEH @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_weak
(archar)[source]¶ Checks for Arabic Weak letters. Weak Letters : ALEF, WAW, YEH, ALEF_MAKSURA @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_moon
(archar)[source]¶ Checks for Arabic Moon letters. Moon Letters : @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_sun
(archar)[source]¶ Checks for Arabic Sun letters. Moon Letters : @param archar: arabic unicode char @type archar: unicode @return: @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
order
(archar)[source]¶ return Arabic letter order between 1 and 29. Alef order is 1, Yeh is 28, Hamza is 29. Teh Marbuta has the same ordre with Teh, 3. @param archar: arabic unicode char @type archar: unicode @return: arabic order. @rtype: integer
-
cltk.phonology.arb.utils.pyarabic.araby.
name
(archar)[source]¶ return Arabic letter name in arabic. Alef order is 1, Yeh is 28, Hamza is 29. Teh Marbuta has the same ordre with Teh, 3. @param archar: arabic unicode char @type archar: unicode @return: arabic name. @rtype: unicode
-
cltk.phonology.arb.utils.pyarabic.araby.
arabicrange
()[source]¶ return a list of arabic characteres . Return a list of characteres between ، to ْ @return: list of arabic characteres. @rtype: unicode
-
cltk.phonology.arb.utils.pyarabic.araby.
has_shadda
(word)[source]¶ Checks if the arabic word contains shadda. @param word: arabic unicode char @type word: unicode @return: if shadda exists @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_vocalized
(word)[source]¶ Checks if the arabic word is vocalized. the word musn’t have any spaces and pounctuations. @param word: arabic unicode char @type word: unicode @return: if the word is vocalized @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_vocalizedtext
(text)[source]¶ Checks if the arabic text is vocalized. The text can contain many words and spaces @param text: arabic unicode char @type text: unicode @return: if the word is vocalized @rtype:Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_arabicstring
(text)[source]¶ Checks for an Arabic standard Unicode block characters An arabic string can contain spaces, digits and pounctuation. but only arabic standard characters, not extended arabic @param text: input text @type text: unicode @return: True if all charaters are in Arabic block @rtype: Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_arabicrange
(text)[source]¶ Checks for an Arabic Unicode block characters @param text: input text @type text: unicode @return: True if all charaters are in Arabic block @rtype: Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
is_arabicword
(word)[source]¶ Checks for an valid Arabic word. An Arabic word not contains spaces, digits and pounctuation avoid some spelling error, TEH_MARBUTA must be at the end. @param word: input word @type word: unicode @return: True if all charaters are in Arabic block @rtype: Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
first_char
(word)[source]¶ Return the first char @param word: given word @type word: unicode @return: the first char @rtype: unicode char
-
cltk.phonology.arb.utils.pyarabic.araby.
second_char
(word)[source]¶ Return the second char @param word: given word @type word: unicode @return: the first char @rtype: unicode char
-
cltk.phonology.arb.utils.pyarabic.araby.
last_char
(word)[source]¶ Return the last letter example: zerrouki; ‘i’ is the last. @param word: given word @type word: unicode @return: the last letter @rtype: unicode char
-
cltk.phonology.arb.utils.pyarabic.araby.
secondlast_char
(word)[source]¶ Return the second last letter example: zerrouki; ‘k’ is the second last. @param word: given word @type word: unicode @return: the second last letter @rtype: unicode char
-
cltk.phonology.arb.utils.pyarabic.araby.
strip_harakat
(text)[source]¶ Strip Harakat from arabic word except Shadda. The striped marks are :
FATHA, DAMMA, KASRA
SUKUN
FATHATAN, DAMMATAN, KASRATAN,
@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.
-
cltk.phonology.arb.utils.pyarabic.araby.
strip_lastharaka
(text)[source]¶ Strip the last Haraka from arabic word except Shadda. The striped marks are:
FATHA, DAMMA, KASRA
SUKUN
FATHATAN, DAMMATAN, KASRATAN
@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.
-
cltk.phonology.arb.utils.pyarabic.araby.
strip_tashkeel
(text)[source]¶ Strip vowels from a text, include Shadda. The striped marks are:
FATHA, DAMMA, KASRA
SUKUN
SHADDA
FATHATAN, DAMMATAN, KASRATAN
@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.
-
cltk.phonology.arb.utils.pyarabic.araby.
strip_tatweel
(text)[source]¶ Strip tatweel from a text and return a result text.
@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.
-
cltk.phonology.arb.utils.pyarabic.araby.
strip_shadda
(text)[source]¶ Strip Shadda from a text and return a result text.
@param text: arabic text. @type text: unicode. @return: return a striped text. @rtype: unicode.
-
cltk.phonology.arb.utils.pyarabic.araby.
normalize_ligature
(text)[source]¶ Normalize Lam Alef ligatures into two letters (LAM and ALEF), and Tand return a result text. Some systems present lamAlef ligature as a single letter, this function convert it into two letters, The converted letters into LAM and ALEF are: LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE
@param text: arabic text. @type text: unicode. @return: return a converted text. @rtype: unicode.
-
cltk.phonology.arb.utils.pyarabic.araby.
normalize_hamza
(word)[source]¶ Standardize the Hamzat into one form of hamza, replace Madda by hamza and alef. Replace the LamAlefs by simplified letters.
@param word: arabic text. @type word: unicode. @return: return a converted text. @rtype: unicode.
-
cltk.phonology.arb.utils.pyarabic.araby.
separate
(word, extract_shadda=False)[source]¶ separate the letters from the vowels, in arabic word, if a letter hasn’t a haraka, the not definited haraka is attributed. return ( letters, vowels) @param word: the input word @type word: unicode @param extract_shadda: extract shadda as seperate text @type extract_shadda: Boolean @return: ( letters, vowels) @rtype:couple of unicode
-
cltk.phonology.arb.utils.pyarabic.araby.
joint
(letters, marks)[source]¶ joint the letters with the marks the length ot letters and marks must be equal return word @param letters: the word letters @type letters: unicode @param marks: the word marks @type marks: unicode @return: word @rtype: unicode
-
cltk.phonology.arb.utils.pyarabic.araby.
vocalizedlike
(word1, word2)[source]¶ if the two words has the same letters and the same harakats, this fuction return True. The two words can be full vocalized, or partial vocalized
@param word1: first word @type word1: unicode @param word2: second word @type word2: unicode @return: if two words have similar vocalization @rtype: Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
waznlike
(word1, wazn)[source]¶ If the word1 is like a wazn (pattern), the letters must be equal, the wazn has FEH, AIN, LAM letters. this are as generic letters. The two words can be full vocalized, or partial vocalized
@param word1: input word @type word1: unicode @param wazn: given word template وزن @type wazn: unicode @return: if two words have similar vocalization @rtype: Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
shaddalike
(partial, fully)[source]¶ If the two words has the same letters and the same harakats, this fuction return True. The first word is partially vocalized, the second is fully if the partially contians a shadda, it must be at the same place in the fully
@param partial: the partially vocalized word @type partial: unicode @param fully: the fully vocalized word @type fully: unicode @return: if contains shadda @rtype: Boolean
-
cltk.phonology.arb.utils.pyarabic.araby.
reduce_tashkeel
(text)[source]¶ Reduce the Tashkeel, by deleting evident cases.
@param text: the input text fully vocalized. @type text: unicode. @return : partially vocalized text. @rtype: unicode.
-
cltk.phonology.arb.utils.pyarabic.araby.
vocalized_similarity
(word1, word2)[source]¶ if the two words has the same letters and the same harakats, this function return True. The two words can be full vocalized, or partial vocalized
@param word1: first word @type word1: unicode @param word2: second word @type word2: unicode @return: return if words are similar, else return negative number of errors @rtype: Boolean / int
-
cltk.phonology.arb.utils.pyarabic.araby.
tokenize
(text='')[source]¶ Tokenize text into words.
@param text: the input text. @type text: unicode. @return: list of words. @rtype: list.
8.1.12.1.2.1.1.1.1.3. cltk.phonology.arb.utils.pyarabic.stack module¶
Stack module
Includes code written by ‘Arabtechies’, ‘Arabeyes’, ‘Taha Zerrouki’.