JELLYFISH(3) | jellyfish | JELLYFISH(3) |
jellyfish - jellyfish Documentation
jellyfish is a library of functions for approximate and phonetic matching of strings.
Source code is available on GitHub
The library provides implementations of the following algorithms:
These algorithms convert a string to a normalized phonetic encoding, converting a word to a representation of its pronunciation. Each takes a single string and returns a coded representation.
Soundex is an algorithm to convert a word (typically a name) to a four digit code in the form 'A123' where 'A' is the first letter of the name and the digits represent similar sounds.
For example soundex('Ann') == soundex('Anne') == 'A500' and soundex('Rupert') == soundex('Robert') == 'R163'.
See the Soundex article at Wikipedia for more details.
The metaphone algorithm was designed as an improvement on Soundex. It transforms a word into a string consisting of '0BFHJKLMNPRSTWXY' where '0' is pronounced 'th' and 'X' is a '[sc]h' sound.
For example metaphone('Klumpz') == metaphone('Clumps') == 'KLMPS'.
See the Metaphone article at Wikipedia for more details.
The NYSIIS algorithm is an algorithm developed by the New York State Identification and Intelligence System. It transforms a word into a phonetic code. Like soundex and metaphone it is primarily intended for use on names (as they would be pronounced in English).
For example nysiis('John') == nysiis('Jan') == JAN.
See the NYSIIS article at Wikipedia for more details.
The Match rating approach algorithm is an algorithm for determining whether or not two names are pronounced similarly. The algorithm consists of an encoding function (similar to soundex or nysiis) which is implemented here as well as match_rating_comparison() which does the actual comparison.
See the Match Rating Approach article at Wikipedia for more details.
Stemming is the process of reducing a word to its root form, for example 'stemmed' to 'stem'.
Martin Porter's algorithm is a common algorithm used for stemming that works for many purposes.
See the official homepage for the Porter Stemming Algorithm for more details.
These methods are all measures of the difference (aka edit distance) between two strings.
Levenshtein distance represents the number of insertions, deletions, and substitutions required to change one word to another.
For example: levenshtein_distance('berne', 'born') == 2 representing the transformation of the first e to o and the deletion of the second e.
See the Levenshtein distance article at Wikipedia for more details.
A modification of Levenshtein distance, Damerau-Levenshtein distance counts transpositions (such as ifsh for fish) as a single edit.
Where levenshtein_distance('fish', 'ifsh') == 2 as it would require a deletion and an insertion, though damerau_levenshtein_distance('fish', 'ifsh') == 1 as this counts as a transposition.
See the Damerau-Levenshtein distance article at Wikipedia for more details.
Hamming distance is the measure of the number of characters that differ between two strings.
Typically Hamming distance is undefined when strings are of different length, but this implementation considers extra characters as differing. For example hamming_distance('abc', 'abcd') == 1.
See the Hamming distance article at Wikipedia for more details.
Jaro distance is a string-edit distance that gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings.
WARNING:
Jaro-Winkler is a modification/improvement to Jaro distance, like Jaro it gives a floating point response in [0,1] where 0 represents two completely dissimilar strings and 1 represents identical strings.
WARNING:
See the Jaro-Winkler distance article at Wikipedia for more details.
The Match rating approach algorithm is an algorithm for determining whether or not two names are pronounced similarly. Strings are first encoded using match_rating_codex() then compared according to the MRA algorithm.
See the Match Rating Approach article at Wikipedia for more details.
Each algorithm has C and Python implementations.
On a typical CPython install the C implementation will be used. The Python versions are available for PyPy and systems where compiling the CPython extension is not possible.
To explicitly use a specific implementation, refer to the appropriate module:
import jellyfish._jellyfish as pyjellyfish import jellyfish.cjellyfish as cjellyfish
If you've already imported jellyfish and are not sure what implementation you are using, you can check by querying jellyfish.library:
if jellyfish.library == 'Python':
# Python implementation elif jellyfish.library == 'C':
# C implementation
James Turk
2021, James Turk
October 29, 2021 | 0.8 |