Python library¶
To use Linguistica 5 as a Python library, an essential step is to initialize a Linguistica object. The way this can be done depends on the nature of your data source:
Data source¶
|
Create a Linguistica object with a corpus data file. |
|
Create a Linguistica object with a wordlist file. |
|
Create a Linguistica object with a corpus object. |
|
Create a Linguistica object with a wordlist object. |
For instance, if the Brown corpus is available on your local drive (see Raw corpus text):
>>> import linguistica as lxa
>>> lxa_object = lxa.read_corpus('path/to/english-brown.txt')
Use read_wordlist()
if you have a wordlist text file instead
(see Wordlist).
Use from_corpus() or from_wordlist()
if your data is an in-memory Python object (either a corpus text or a wordlist).
Parameters¶
The functions introduced in Data source all allow optional keyword arguments which are parameters for the Linguistica object. Different Linguistica modules make use of different parameters; see Full API documentation.
For example, to deal with only the first 500,000 word tokens in the Brown corpus:
>>> import linguistica as lxa
>>> lxa_object = lxa.read_corpus('path/to/english-brown.txt', max_word_tokens=500000)
Parameter |
Meaning |
Default |
|---|---|---|
|
maximum number of word tokens to be handled |
0 (= all) |
|
maximum number of word types to be handled |
1000 |
|
minimum stem length |
4 |
|
maximum affix length |
4 |
|
minimum number of stems for a valid signature |
5 |
|
minimum number of occurrences for a valid context |
3 |
|
number of syntactic word neighbors |
9 |
|
number of eigenvectors (in dimensionality reduction) |
11 |
|
whether the language is suffixing |
1 (= yes) |
|
whether case distinctions (“the” vs “The”) are kept |
0 (= no) |
The method parameters() returns the parameters and their values as a dict:
>>> from pprint import pprint
>>> pprint(lxa_object.parameters())
{'keep_case': 0,
'max_affix_length': 4,
'max_word_tokens': 0,
'max_word_types': 1000,
'min_context_count': 3,
'min_sig_count': 5,
'min_stem_length': 4,
'n_eigenvectors': 11,
'n_neighbors': 9,
'suffixing': 1}
To change one or multiple parameters of a Linguistica object,
use change_parameters() with keyword arguments:
>>> lxa_object.parameters()['min_stem_length'] # before the change
4
>>> lxa_object.change_parameters(min_stem_length=3)
>>> lxa_object.parameters()['min_stem_length'] # after the change
3
To reset all parameters to their default values,
use use_default_parameters():
>>> lxa_object.parameters()['min_stem_length'] # non-default value
3
>>> lxa_object.use_default_parameters()
>>> lxa_object.parameters()['min_stem_length']
4
- linguistica.from_corpus(corpus_object, **kwargs)¶
Create a Linguistica object with a corpus object.
- Parameters:
corpus_object – either a long string of text (with spaces separating word tokens) or a list of strings as word tokens
kwargs – keyword arguments for parameters and their values.
- linguistica.from_wordlist(wordlist_object, **kwargs)¶
Create a Linguistica object with a wordlist object.
- Parameters:
wordlist_object – either a dict of word types (as strings) mapped to their token counts or an iterable of word types (as strings).
kwargs – keyword arguments for parameters and their values.
- linguistica.read_corpus(file_path, encoding='utf8', **kwargs)¶
Create a Linguistica object with a corpus data file.
- Parameters:
file_path – path of input corpus file
encoding – encoding of the file at file_path. Default:
'utf8'kwargs – keyword arguments for parameters and their values.
- linguistica.read_wordlist(file_path, encoding='utf8', **kwargs)¶
Create a Linguistica object with a wordlist file.
- Parameters:
file_path – path of input wordlist file where each line contains one word type (and, optionally, a whitespace plus the token count for that word).
encoding – encoding of the file at file_path. Default:
'utf8'kwargs – keyword arguments for parameters and their values.