Lingua::EN::Tagger - Part-of-speech tagger for English natural
language processing.
# Create a parser object
my $p = new Lingua::EN::Tagger;
# Add part of speech tags to a text
my $tagged_text = $p->add_tags($text);
...
# Get a list of all nouns and noun phrases with occurrence counts
my %word_list = $p->get_words($text);
...
# Get a readable version of the tagged text
my $readable_text = $p->get_readable($text);
The module is a probability based, corpus-trained tagger that
assigns POS tags to English text based on a lookup dictionary and a set of
probability values. The tagger assigns appropriate tags based on conditional
probabilities - it examines the preceding tag to determine the appropriate
tag for the current word. Unknown words are classified according to word
morphology or can be set to be treated as nouns or other parts of
speech.
The tagger also extracts as many nouns and noun phrases as it can,
using a set of regular expressions.
- new %PARAMS
- Class constructor. Takes a hash with the following parameters (shown with
default values):
- unknown_word_tag
=> ''
- Tag to assign to unknown words
- stem => 0
- Stem single words using Lingua::Stem::EN
- weight_noun_phrases
=> 0
- When returning occurrence counts for a noun phrase, multiply the value by
the number of words in the NP.
- longest_noun_phrase
=> 5
- Will ignore noun phrases longer than this threshold. This affects only the
get_words() and get_nouns() methods.
- relax => 0
- Relax the Hidden Markov Model: this may improve accuracy for uncommon
words, particularly words used polysemously
- add_tags TEXT
- Examine the string provided and return it fully tagged (XML style)
- add_tags_incrementally
TEXT
- Examine the string provided and return it fully tagged (XML style) but do
not reset the internal part-of-speech state between invocations.
- get_words
TEXT
- Given a text string, return as many nouns and noun phrases as possible.
Applies add_tags and involves three stages:
* Tag the text
* Extract all the maximal noun phrases
* Recursively extract all noun phrases from the MNPs
- get_readable
TEXT
- Return an easy-on-the-eyes tagged version of a text string. Applies
add_tags and reformats to be easier to read.
- get_sentences
TEXT
- Returns an anonymous array of sentences (without POS tags) from a
text.
- get_proper_nouns
TAGGED_TEXT
- Given a POS-tagged text, this method returns a hash of all proper nouns
and their occurrence frequencies. The method is greedy and will return
multi-word phrases, if possible, so it would find ``Linguistic Data
Consortium'' as a single unit, rather than as three individual proper
nouns. This method does not stem the found words.
- get_nouns
TAGGED_TEXT
- Given a POS-tagged text, this method returns all nouns and their
occurrence frequencies.
- get_max_noun_phrases
TAGGED_TEXT
- Given a POS-tagged text, this method returns only the maximal noun
phrases. May be called directly, but is also used by get_noun_phrases
- get_noun_phrases
TAGGED_TEXT
- Similar to get_words, but requires a POS-tagged text as an argument.
- install
- Reads some included corpus data and saves it in a stored hash on the local
file system. This is called automatically if the tagger can't find the
stored lexicon.
Aaron Coburn <acoburn@apache.org>
Maciej Ceglowski <developer@ceglowski.com>
Eric Nichols, Nara Institute of Science and Technology
Copyright 2003-2010 Aaron Coburn <acoburn@apache.org>
This program is free software; you can redistribute it and/or modify
it under the terms of version 3 of the GNU General Public License as
published by the Free Software Foundation.