Python library

To use Linguistica 5 as a Python library, an essential step is to initialize a Linguistica object. The way this can be done depends on the nature of your data source:

Data source

read_corpus(file_path[, encoding])

Create a Linguistica object with a corpus data file.

read_wordlist(file_path[, encoding])

Create a Linguistica object with a wordlist file.

from_corpus(corpus_object, **kwargs)

Create a Linguistica object with a corpus object.

from_wordlist(wordlist_object, **kwargs)

Create a Linguistica object with a wordlist object.

For instance, if the Brown corpus is available on your local drive (see Raw corpus text):

>>> import linguistica as lxa
>>> lxa_object = lxa.read_corpus('path/to/english-brown.txt')

Use read_wordlist() if you have a wordlist text file instead (see Wordlist).

Use from_corpus() or from_wordlist() if your data is an in-memory Python object (either a corpus text or a wordlist).

Parameters

The functions introduced in Data source all allow optional keyword arguments which are parameters for the Linguistica object. Different Linguistica modules make use of different parameters; see Full API documentation.

For example, to deal with only the first 500,000 word tokens in the Brown corpus:

>>> import linguistica as lxa
>>> lxa_object = lxa.read_corpus('path/to/english-brown.txt', max_word_tokens=500000)

Parameter

Meaning

Default

max_word_tokens

maximum number of word tokens to be handled

0 (= all)

max_word_types

maximum number of word types to be handled

1000

min_stem_length

minimum stem length

4

max_affix_length

maximum affix length

4

min_sig_count

minimum number of stems for a valid signature

5

min_context_count

minimum number of occurrences for a valid context

3

n_neighbors

number of syntactic word neighbors

9

n_eigenvectors

number of eigenvectors (in dimensionality reduction)

11

suffixing

whether the language is suffixing

1 (= yes)

keep_case

whether case distinctions (“the” vs “The”) are kept

0 (= no)

The method parameters() returns the parameters and their values as a dict:

>>> from pprint import pprint
>>> pprint(lxa_object.parameters())
{'keep_case': 0,
 'max_affix_length': 4,
 'max_word_tokens': 0,
 'max_word_types': 1000,
 'min_context_count': 3,
 'min_sig_count': 5,
 'min_stem_length': 4,
 'n_eigenvectors': 11,
 'n_neighbors': 9,
 'suffixing': 1}

To change one or multiple parameters of a Linguistica object, use change_parameters() with keyword arguments:

>>> lxa_object.parameters()['min_stem_length']  # before the change
4
>>> lxa_object.change_parameters(min_stem_length=3)
>>> lxa_object.parameters()['min_stem_length']  # after the change
3

To reset all parameters to their default values, use use_default_parameters():

>>> lxa_object.parameters()['min_stem_length']  # non-default value
3
>>> lxa_object.use_default_parameters()
>>> lxa_object.parameters()['min_stem_length']
4
linguistica.from_corpus(corpus_object, **kwargs)

Create a Linguistica object with a corpus object.

Parameters:
  • corpus_object – either a long string of text (with spaces separating word tokens) or a list of strings as word tokens

  • kwargs – keyword arguments for parameters and their values.

linguistica.from_wordlist(wordlist_object, **kwargs)

Create a Linguistica object with a wordlist object.

Parameters:
  • wordlist_object – either a dict of word types (as strings) mapped to their token counts or an iterable of word types (as strings).

  • kwargs – keyword arguments for parameters and their values.

linguistica.read_corpus(file_path, encoding='utf8', **kwargs)

Create a Linguistica object with a corpus data file.

Parameters:
  • file_path – path of input corpus file

  • encoding – encoding of the file at file_path. Default: 'utf8'

  • kwargs – keyword arguments for parameters and their values.

linguistica.read_wordlist(file_path, encoding='utf8', **kwargs)

Create a Linguistica object with a wordlist file.

Parameters:
  • file_path – path of input wordlist file where each line contains one word type (and, optionally, a whitespace plus the token count for that word).

  • encoding – encoding of the file at file_path. Default: 'utf8'

  • kwargs – keyword arguments for parameters and their values.