Tutorial: From a list of links to a frequency list¶

Get your system up and running¶

Installation: see dedicated page
Ensure you have installed the latest version: pip install -U trafilatura
Additional software for this tutorial: pip install -U SoMaJo

The following consists of command-line instructions. For an introduction see the page on command-line usage.

Process a list of links¶

For the collection and filtering of links see this tutorial and this blog post.

Two major options are necessary here:

-i or --inputfile to select an input list to read links from
-o or --outputdir to define a directory to eventually store the results

The input list will be read sequentially, and only lines beginning with a valid URL will be read; any other information contained in the file will be discarded.

The output directory can be created on demand, but it has to be writable.

$ trafilatura -i list.txt -o txtfiles       # output as raw text
$ trafilatura --xml -i list.txt -o xmlfiles # output in XML format

The second instruction creates a collection of XML files which can be edited with a basic text editor or a full-fledged text-editing package or IDE such as Atom.

Build frequency lists¶

Step-by-step¶

Tokenization¶

The SoMaJo tokenizer splits text into words and sentences. It works with Python and gets good results when applied to texts in German and English.

Assuming the output directory you are working with is called txtfiles:

# concatenate all files
$ cat txtfiles/*.txt > txtfiles/all.txt
# output all tokens
$ somajo-tokenizer txtfiles/all.txt > tokens.txt
# sort the tokens by decreasing frequency and output up to 10 most frequent tokens
$ sort tokens.txt | uniq -c | sort -nrk1 | head -10

Filtering words¶

# further filtering: remove punctuation, delete empty lines and lowercase strings
$ < tokens.txt sed -e "s/[[:punct:]]//g" -e "/^$/d" -e "s/.*/\L\0/" > tokens-filtered.txt
# display most frequent tokens
$ < tokens-filtered.txt sort | uniq -c | sort -nrk1 | head -20
# store frequency information in a CSV-file
$ < tokens.txt sort | uniq -c | sort -nrk1 | sed -e "s|^ *||g" -e  "s| |\t|" > txtfiles/frequencies.csv

Further filtering steps:

with a list of stopwords: egrep -vixFf stopwords.txt
alternative to convert to lower case: uconv -x lower

Collocations and multi-word units¶

# word bigrams
$ < tokens-filtered.txt tr "\n" " " | awk '{for (i=1; i<NF; i++) print $i, $(i+1)}' | sort | uniq -c | sort -nrk1 | head -20
# word trigrams
$ < tokens-filtered.txt tr "\n" " " | awk '{for (i=1; i<NF; i++) print $i, $(i+1), $(i+2)}' | sort | uniq -c | sort -nrk1 | head -20

Further information¶

Unix™ for Poets (count and sort words, compute ngram statistics, make a Concordance)
Word analysis and N-grams
N-Grams with NLTK and collocations howto
Analyzing Documents with Term Frequency - Inverse Document Frequency (tf-idf), both a corpus exploration method and a pre-processing step for many other text-mining measures and models

Additional information for XML files¶

Assuming the output directory you are working with is called xmlfiles:

# tokenize a file
$ somajo-tokenizer --xml xmlfiles/filename.xml
# remove tags
$ somajo-tokenizer --xml xmlfiles/filename.xml | sed -e "s|</*.*>||g" -e "/^$/d"
# continue with the steps above...

Tutorial: Gathering a custom web corpus

Tutorial: Validation of TEI files