8.1.16. cltk.stops package¶

8.1.16.1. Submodules¶

8.1.16.2. cltk.stops.akk module¶

This stopword list was compiled by M. Willis Monroe

8.1.16.3. cltk.stops.ang module¶

Old English: ‘Sourav Singh <ssouravsingh12@gmail.com>’. adapted Introduction to Old English website at https://lrc.la.utexas.edu/eieol/engol.

8.1.16.4. cltk.stops.arb module¶

this list inspired from Arabic Stop Words Project https://github.com/linuxscout/arabicstopwords

8.1.16.5. cltk.stops.cop module¶

This list is adapted from https://github.com/computationalstylistics/tidystopwords which in turn is based on UniversalDependencies treebanks.

8.1.16.6. cltk.stops.enm module¶

Middle English: - people.stanford.edu/widner/content/text-mining-middle-ages (slide 13), textifier.com/resources/common-english-words.txt, en.wikipedia.org/wiki/Middle_English, en.wiktionary.org/wiki/Category:Middle_English_prepositions, en.wiktionary.org/wiki/Category:MIddle_English_determiners, en.wiktionary.org/wiki/Category:MIddle_English_conjunctions

8.1.16.7. cltk.stops.fro module¶

This list was compiled from the 100 most frequently occurring words in the french_text corpus, with content words removed. It also includes forms of auxiliary verbs taken from Anglade (1931) and retrieved from https://fr.wikisource.org/wiki/Grammaire_élémentaire_de_l’ancien_français (available under a Attribution-ShareAlike 3.0 Creative Commons license.

Code used to determine most frequent words in the corpus:

import nltk
import re
from nltk.probability import FreqDist
from cltk.tokenize.word import WordTokenizer
determines 100 most common words and number of occurrences in the French corpus
ignores punctuation and upper-case
file_content = open("~/cltk/cltk/stop/french/frenchtexts.txt").read()
(n.b.: this file has been moved to fro_models_cltk)

word_tokenizer = WordTokenizer('french')
words = word_tokenizer.tokenize(file_content)
fdist = FreqDist(words)
prints 100 most common words
common_words=fdist.most_common(125)
cw_list = [x[0] for x in common_words]
outputs 100 most common words to .txt file
with open('french_prov_stops.txt', 'a') as f:
    for item in cw_list:
        print(item, file=f)

8.1.16.8. cltk.stops.gmh module¶

Middle High German: “Eleftheria Chatziargyriou <ele.hatzy@gmail.com>” using TFIDF method. Source of texts: http://www.gutenberg.org/files/22636/22636-h/22636-h.htm , http://texte.mediaevum.de/12mhd.htm

8.1.16.9. cltk.stops.grc module¶

Greek: ‘Kyle P. Johnson <kyle@kyle-p-johnson.com>’, from the Perseus Hopper source [http://sourceforge.net/projects/perseus-hopper], found at “/sgml/reading/build/stoplists”, though this only contained acute accents on the ultima. There has been added to this grave accents to the ultima of each. Perseus source is made available under the Mozilla Public License 1.1 (MPL 1.1) [http://www.mozilla.org/MPL/1.1/].

8.1.16.10. cltk.stops.hin module¶

Classical Hindi Stopwords This list is composed from 100 most frequently occuring words in classical_hindi corpus <https://github.com/cltk/hindi_text_ltrc> in CLTK. source code : <https://gist.github.com/inishchith/ad4bc0da200110de638f5408c64bb14c>

8.1.16.11. cltk.stops.lat module¶

Latin: from the Perseus Hopper source at /sgml/reading/build/stoplists. Source at http://sourceforge.net/projects/perseus-hopper/. Perseus data licensed under the Mozilla Public License 1.1 (MPL 1.1, http://www.mozilla.org/MPL/1.1/).

8.1.16.12. cltk.stops.non module¶

Old Norse: “Clément Besnier <clem@clementbesnier.fr>”. Stopwords were defined by picking up in Altnordisches Elementarbuch by Ranke and Hofmann A new introduction to Old Norse by Barnes Viking Language 1 by Byock (this book provides a list of most frequent words in the sagas sorted by part of speech)

8.1.16.13. cltk.stops.omr module¶

Marathi: from 100 most frequently occuring words in Marathi corpus in CLTK.

8.1.16.14. cltk.stops.pan module¶

Panjabi: ‘Nimit Bhardwaj <nimitbhardwaj@gmail.com>’. Sahib Singh, from the site http://gurbanifiles.org/gurmukhi/index.htm, these words are the most frequent words in the Guru Granth Sahib

Note: This is in the Gurmukhi alphabet.

8.1.16.15. cltk.stops.processes module¶

class cltk.stops.processes.StopsProcess(language: str = None)[source]¶

Bases: cltk.core.data_types.Process

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.stops.processes import StopsProcess
>>> from cltk.languages.example_texts import get_example_text
>>> lang = "lat"
>>> words = [Word(string=token) for token in split_punct_ws(get_example_text(lang))]
>>> stops_process = StopsProcess(language=lang)
>>> output_doc = stops_process.run(Doc(raw=get_example_text(lang), words=words))
>>> output_doc.words[1].string
'est'
>>> output_doc.words[1].stop
True

algorithm¶

run(input_doc)[source]¶

Note this marks a word a stop if there is a match on either the inflected form (Word.string) or the lemma (Word.lemma).

Return type:: Doc

8.1.16.16. cltk.stops.san module¶

Sanskrit: ‘Akhilesh S. Chobey <akhileshchobey03@gmail.com>’. Further explanations at: https://gist.github.com/Akhilesh28/b012159a10a642ed5c34e551db76f236

8.1.16.17. cltk.stops.words module¶

Stopwords for languages.

TODO: Give definition here of stopwords.

class cltk.stops.words.Stops(iso_code)[source]¶

Bases: object

Class for filtering stopwords.

>>> from cltk.stops.words import Stops
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> stops_obj = Stops(iso_code="lat")
>>> tokens = split_punct_ws(get_example_text("lat"))
>>> len(tokens)
178
>>> tokens[25:30]
['legibus', 'inter', 'se', 'differunt', 'Gallos']
>>> tokens_filtered = stops_obj.remove_stopwords(tokens=tokens)
>>> len(tokens_filtered)
142
>>> tokens_filtered[22:26]
['legibus', 'se', 'differunt', 'Gallos']

get_stopwords()[source]¶

Take language code, return list of stopwords.

Return type:: List[str]

remove_stopwords(tokens, extra_stops=None)[source]¶

Take list of strings and remove stopwords.

Return type:: List[str]