Lingua::Stem::En(3pm) | User Contributed Perl Documentation | Lingua::Stem::En(3pm) |
Lingua::Stem::En - Porter's stemming algorithm for 'generic' English
use Lingua::Stem::En; my $stems = Lingua::Stem::En::stem({ -words => $word_list_reference, -locale => 'en', -exceptions => $exceptions_hash, });
This routine applies the Porter Stemming Algorithm to its parameters, returning the stemmed words.
It is derived from the C program "stemmer.c" as found in freewais and elsewhere, which contains these notes:
Purpose: Implementation of the Porter stemming algorithm documented in: Porter, M.F., "An Algorithm For Suffix Stripping," Program 14 (3), July 1980, pp. 130-137. Provenance: Written by B. Frakes and C. Cox, 1986.
I have re-interpreted areas that use Frakes and Cox's "WordSize" function. My version may misbehave on short words starting with "y", but I can't think of any examples.
The step numbers correspond to Frakes and Cox, and are probably in Porter's article (which I've not seen). Porter's algorithm still has rough spots (e.g current/currency, -ings words), which I've not attempted to cure, although I have added support for the British -ise suffix.
1999.06.15 - Changed to '.pm' module, moved into Lingua::Stem namespace, optionalized the export of the 'stem' routine into the caller's namespace, added named parameters 1999.06.24 - Switch core implementation of the Porter stemmer to the one written by Jim Richardson <jimr@maths.usyd.edu.au> 2000.08.25 - 2.11 Added stemming cache 2000.09.14 - 2.12 Fixed *major* :( implementation error of Porter's algorithm Error was entirely my fault - I completely forgot to include rule sets 2,3, and 4 starting with Lingua::Stem 0.30. -- Jerilyn Franz 2003.09.28 - 2.13 Corrected documentation error pointed out by Simon Cozens. 2005.11.20 - 2.14 Changed rule declarations to conform to Perl style convention for 'private' subroutines. Changed Exporter invokation to more portable 'require' vice 'use'. 2006.02.14 - 2.15 Added ability to pass word list by 'handle' for in-place stemming. 2009.07.27 - 2.16 Documentation Fix 2020.06.20 - 2.30 Version renumber for module consistency. 2020.09.26 - 2.31 Fix for Latin1/UTF8 issue in documentation
Example:
my @words = ( 'wordy', 'another' ); my $stemmed_words = Lingua::Stem::En::stem({ -words => \@words, -locale => 'en', -exceptions => \%exceptions, });
If the first element of @words is a list reference, then the stemming is performed 'in place' on that list (modifying the passed list directly instead of copying it to a new array).
This is only useful if you do not need to keep the original list. If you do need to keep the original list, use the normal semantic of having 'stem' return a new list instead - that is faster than making your own copy and using the 'in place' semantics since the primary difference between 'in place' and 'by value' stemming is the creation of a copy of the original list. If you don't need the original list, then the 'in place' stemming is about 60% faster.
Example of 'in place' stemming:
my $words = [ 'wordy', 'another' ]; my $stemmed_words = Lingua::Stem::En::stem({ -words => [$words], -locale => 'en', -exceptions => \%exceptions, });
The 'in place' mode returns a reference to the original list with the words stemmed.
'0' means 'no caching'. This is the default level.
'1' means 'cache per run'. This caches stemming results during
a single
call to 'stem'.
'2' means 'cache indefinitely'. This caches stemming results
until
either the process exits or the 'clear_stem_cache' method is called.
This code is almost entirely derived from the Porter 2.1 module written by Jim Richardson.
Lingua::Stem
Jim Richardson, University of Sydney jimr@maths.usyd.edu.au or http://www.maths.usyd.edu.au:8000/jimr.html Integration in Lingua::Stem by Jerilyn Franz, FreeRun Technologies, <cpan@jerilyn.info>
Jim Richardson, University of Sydney Jerilyn Franz, FreeRun Technologies
This code is freely available under the same terms as Perl.
2022-07-15 | perl v5.34.0 |