Fulltext analyzers¶
Table of contents
Overview¶
Analyzers are used for creating fulltext-indexes. They take the content of a field and split it into tokens, which are then searched. Analyzers filter, reorder and/or transform the content of a field before it becomes the final stream of tokens.
An analyzer consists of one tokenizer, zero or more token-filters, and zero or more char-filters.
When a field-content is analyzed to become a stream of tokens, the char-filter is applied at first. It is used to filter some special chars from the stream of characters that make up the content.
Tokenizers take a possibly filtered stream of characters and split it into a stream of tokens.
Token-filters can add tokens, delete tokens or transform them.
With these elements in place, analyzer provide fine-grained control over building a token stream used for fulltext search. For example you can use language specific analyzers, tokenizers and token-filters to get proper search results for data provided in a certain language.
Below the builtin analyzers, tokenizers, token-filters and char-filters are listed. They can be used as is or can be extended.
See also
Fulltext indices for examples showing how to create tables which make use of analyzers.
Creating a custom analyzer for an example showing how to create a custom analyzer.
CREATE ANALYZER for the syntax reference.
Built-in analyzers¶
standard¶
type='standard'
An analyzer of type standard is built using the Standard tokenizer tokenizer with the standard Token Filter, lowercase Token Filter, and stop Token Filter.
Lowercase all Tokens, uses NO stopwords and excludes tokens longer than 255 characters. This analyzer uses unicode text segmentation, which is defined by UAX#29.
For example, the standard analyzer converts the sentence
The quick brown fox jumps Over the lAzY DOG.
into the following tokens
quick, brown, fox, jumps, lazy, dog
Parameters
- stopwords
 A list of stopwords to initialize the stop filter with. Defaults to the english stop words.
- max_token_length
 The maximum token length. If a token exceeds this length it is split in max_token_length chunks. Defaults to
255.
plain¶
type='plain'
The plain analyzer is an alias for the keyword analyzer and cannot be extended. You must extend the keyword analyzer instead.
stop¶
type='stop'
Uses a Lowercase tokenizer tokenizer, with stop Token Filter.
Parameters
- stopwords
 A list of stopwords to initialize the :ref:’stop-tokenfilter` filter with. Defaults to the english stop words.
- stopwords_path
 A path (either relative to configuration location, or absolute) to a stopwords file configuration.
pattern¶
type='pattern'
An analyzer of type pattern that can flexibly separate text into terms via a regular expression.
Parameters
- lowercase
 Should terms be lowercased or not. Defaults to true.
- pattern
 The regular expression pattern, defaults to W+.
- flags
 The regular expression flags.
Note
The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, e.g. CASE_INSENSITIVE|COMMENTS. Check Java
Pattern API for more details about flags options.
language¶
type='<language-name>'
The following types are supported:
arabic, armenian, basque, brazilian, bengali,
bulgarian, catalan, chinese, cjk, czech, danish,
dutch, english, finnish, french, galician, german,
greek, hindi, hungarian, indonesian, italian,  latvian,
lithuanian, norwegian, persian, portuguese, romanian,
russian, sorani, spanish, swedish, turkish, thai.
Parameters
- stopwords
 A list of stopwords to initialize the stop filter with. Defaults to the english stop words.
- stopwords_path
 A path (either relative to configuration location, or absolute) to a stopwords file configuration.
- stem_exclusion
 The stem_exclusion parameter allows you to specify an array of lowercase words that should not be stemmed. The following analyzers support setting stem_exclusion:
arabic,armenian,basque,brazilian,bengali,bulgarian,catalan,czech,danish,dutch,english,finnish,french,galician,german,hindi,hungarian,indonesian,italian,latvian,lithuanian,norwegian,portuguese,romanian,russian,spanish,swedish,turkish.
snowball¶
type='snowball'
Uses the Standard tokenizer tokenizer, with standard filter, lowercase filter, stop filter, and snowball filter.
Parameters
- stopwords
 A list of stopwords to initialize the stop filter with. Defaults to the english stop words.
- language
 See the language-parameter of snowball.
fingerprint¶
type='fingerprint'
The fingerprint analyzer implements a fingerprinting algorithm which is used by the OpenRefine project to assist in clustering. Input text is lowercased, normalized to remove extended characters, sorted, de-duplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed. It uses the Standard tokenizer tokenizer and the following filters: lowercase, asciifolding, fingerprint and ref:stop-tokenfilter.
Parameters
- separator
 The character to use to concatenate the terms. Defaults to a space.
- max_output_size
 The maximum token size to emit, tokens larger than this size will be discarded. Defaults to
255.- stopwords
 A pre-defined stop words list like _english_ or an array containing a list of stop words. Defaults to
\_none_.- stopwords_path
 The path to a file containing stop words.
Built-in tokenizers¶
Standard tokenizer¶
type='standard'
The tokenizer of type standard is providing a grammar based tokenizer,
which is a good tokenizer for most European language documents. The tokenizer
implements the Unicode Text Segmentation algorithm, as specified in Unicode
Standard Annex #29.
Parameters
- max_token_length
 The maximum token length. If a token exceeds this length it is split in max_token_length chunks. Defaults to
255.
Classic tokenizer¶
type='classic'
The classic tokenizer is a grammar based tokenizer that is good for English
language documents. This tokenizer has heuristics for special treatment of
acronyms, company names, email addresses, and internet host names. However,
these rules don’t always work, and the tokenizer doesn’t work well for most
languages other than English.
Parameters
- max_token_length
 The maximum token length. If a token exceeds this length it is split in max_token_length chunks. Defaults to
255.
Thai tokenizer¶
type='thai'
The thai tokenizer splits Thai text correctly, treats all other languages
like the standard-tokenizer does.
Lowercase tokenizer¶
type='lowercase'
The lowercase tokenizer performs the function of Letter tokenizer
and lowercase together. It divides text at non-letters and
converts them to lower case.
Whitespace tokenizer¶
type='whitespace'
The whitespace tokenizer splits text at whitespace.
Parameters
- max_token_length
 The maximum token length. If a token exceeds this length it is split in max_token_length chunks. Defaults to
255.
UAX URL email tokenizer¶
type='uax_url_email'
The uax_url_email tokenizer behaves like the Standard tokenizer, but
tokenizes emails and URLs as single tokens.
Parameters
- max_token_length
 The maximum token length. If a token exceeds this length it is split in max_token_length chunks. Defaults to
255.
N-gram tokenizer¶
type='ngram'
Parameters
- min_gram
 Minimum length of characters in a gram. default: 1.
- max_gram
 Maximum length of characters in a gram. default: 2.
- token_chars
 Characters classes to keep in the tokens, will split on characters that don’t belong to any of these classes. default: [] (Keep all characters).
Classes: letter, digit, whitespace, punctuation, symbol
Edge n-gram tokenizer¶
type='edge_ngram'
The edge_ngram tokenizer is very similar to N-gram tokenizer but only
keeps n-grams which start at the beginning of a token.
Parameters
- min_gram
 Minimum length of characters in a gram. default: 1
- max_gram
 Maximum length of characters in a gram. default: 2
- token_chars
 Characters classes to keep in the tokens, will split on characters that don’t belong to any of these classes. default: [] (Keep all characters).
Classes: letter, digit, whitespace, punctuation, symbol
Keyword tokenizer¶
type='keyword'
The keyworkd tokenizer emits the entire input as a single token.
Parameters
- buffer_size
 The term buffer size. Defaults to
256.
Pattern tokenizer¶
type='pattern'
The pattern tokenizer separates text into terms via a regular
expression.
Parameters
- pattern
 The regular expression pattern, defaults to \W+.
- flags
 The regular expression flags.
- group
 Which group to extract into tokens. Defaults to -1 (split).
Note
The regular expression should match the token separators, not the tokens themselves.
Flags should be pipe-separated, e.g. CASE_INSENSITIVE|COMMENTS. Check Java
Pattern API for more details about flags options.
Simple pattern tokenizer¶
type='simple_pattern'
Similar to the pattern tokenizer, this tokenizer uses a regular
expression to split matching text into terms,
however with a limited, more restrictive subset of expressions. This is in
general faster than the normal pattern tokenizer, but does not support
splitting on pattern.
Parameters
- pattern
 A Lucene regular expression, defaults to empty string.
Simple pattern split tokenizer¶
type='simple_patten_split'
The simple_pattern_split tokenizer operates with the same restricted subset
of regular expressions as the
simple_pattern tokenizer, but it splits the input on the pattern, rather
than the matching pattern.
Parameters
- pattern
 A Lucene regular expression, defaults to empty string.
Path hierarchy tokenizer¶
type='path_hierarchy'
Takes something like this:
/something/something/else
And produces tokens:
/something
/something/something
/something/something/else
Parameters
- delimiter
 The character delimiter to use, defaults to /.
- replacement
 An optional replacement character to use. Defaults to the delimiter.
- buffer_size
 The buffer size to use, defaults to 1024.
- reverse
 Generates tokens in reverse order, defaults to false.
- skip
 Controls initial tokens to skip, defaults to 0.
Char group tokenizer¶
type=char_group
Breaks text into terms whenever it encounters a character that is part of a predefined set.
Parameters
- tokenize_on_chars
 A list containing characters to tokenize on.
Built-in token filters¶
classic¶
type='classic'
Does optional post-processing of terms that are generated by the classic tokenizer. It removes the english possessive from the end of words, and it removes dots from acronyms.
asciifolding¶
type='asciifolding'
Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the “Basic Latin” Unicode block) into their ASCII equivalents, if one exists.
length¶
type='length'
Removes words that are too long or too short for the stream.
Parameters
- min
 The minimum number. Defaults to 0.
- max
 The maximum number. Defaults to Integer.MAX_VALUE.
lowercase¶
type='lowercase'
Normalizes token text to lower case.
Parameters
- language
 For options, see language analyzer.
edge_ngram¶
type='edge_ngram'
Parameters
- min_gram
 Defaults to 1.
- max_gram
 Defaults to 2.
- side
 Either front or back. Defaults to front.
porter_stem¶
type='porter_stem'
Transforms the token stream as per the Porter stemming algorithm.
Note
The input to the stemming filter must already be in lower case, so you will need to use Lower Case Token Filter or Lower Case tokenizer farther down the tokenizer chain in order for this to work properly! For example, when using custom analyzer, make sure the lowercase filter comes before the porterStem filter in the list of filters.
shingle¶
type='shingle'
Constructs shingles (token n-grams), combinations of tokens as a single token, from a token stream.
Parameters
- max_shingle_size
 The maximum shingle size. Defaults to 2.
- min_shingle_sizes
 The minimum shingle size. Defaults to 2.
- output_unigrams
 If true the output will contain the input tokens (unigrams) as well as the shingles. Defaults to true.
- output_unigrams_if_no_shingles
 If output_unigrams is false the output will contain the input tokens (unigrams) if no shingles are available. Note if output_unigrams is set to true this setting has no effect. Defaults to false.
- token_separator
 The string to use when joining adjacent tokens to form a shingle. Defaults to ” “.
stop¶
type='stop'
Removes stop words from token streams.
Parameters
- stopwords
 A list of stop words to use. Defaults to english stop words.
- stopwords_path
 A path (either relative to configuration location, or absolute) to a stopwords file configuration. Each stop word should be in its own “line” (separated by a line break). The file must be UTF-8 encoded.
- ignore_case
 Set to true to lower case all words first. Defaults to false.
- remove_trailing
 Set to false in order to not ignore the last term of a search if it is a stop word. Defaults to true
word_delimiter¶
type='word_delimiter'
Splits words into subwords and performs optional transformations on subword groups.
Parameters
- generate_word_parts
 If true causes parts of words to be generated: “PowerShot” ⇒ “Power” “Shot”. Defaults to true.
- generate_number_parts
 If true causes number subwords to be generated: “500-42” ⇒ “500” “42”. Defaults to true.
- catenate_words
 If true causes maximum runs of word parts to be catenated:
wi-fi⇒wifi. Defaults to false.- catenate_numbers
 If true causes maximum runs of number parts to be catenated: “500-42” ⇒ “50042”. Defaults to false.
- catenate_all
 If true causes all subword parts to be catenated: “wi-fi-4000” ⇒ “wifi4000”. Defaults to false.
- split_on_case_change
 If true causes “PowerShot” to be two tokens; (“Power-Shot” remains two parts regards). Defaults to true.
- preserve_original
 If true includes original words in subwords: “500-42” ⇒ “500-42” “500” “42”. Defaults to false.
- split_on_numerics
 If true causes
j2seto be three tokens;j2se. Defaults to true.- stem_english_possessive
 If true causes trailing “‘s” to be removed for each subword: “O’Neil’s” ⇒ “O”, “Neil”. Defaults to true.
- protected_words
 A list of words protected from being delimiter.
- protected_words_path
 A relative or absolute path to a file configured with protected words (one on each line). If relative, automatically resolves to
config/based location if exists.- type_table
 A custom type mapping table
stemmer¶
type='stemmer'
A filter that stems words (similar to snowball, but with more options).
Parameters
- language/name
 arabic, armenian, basque, brazilian, bulgarian, catalan, czech, danish, dutch, english, finnish, french, german, german2, greek, hungarian, italian, kp, kstem, lovins, latvian, norwegian, minimal_norwegian, porter, portuguese, romanian, russian, spanish, swedish, turkish, minimal_english, possessive_english, light_finnish, light_french, minimal_french, light_german, minimal_german, hindi, light_hungarian, indonesian, light_italian, light_portuguese, minimal_portuguese, portuguese, light_russian, light_spanish, light_swedish.
keyword_marker¶
type='keyword_marker'
Protects words from being modified by stemmers. Must be placed before any stemming filters.
Parameters
- keywords
 A list of words to use.
- keywords_path
 A path (either relative to configuration location, or absolute) to a list of words.
- ignore_case
 Set to true to lower case all words first. Defaults to false.
kstem¶
type='kstem'
High performance filter for english.
All terms must already be lowercased (use lowercase filter) for this filter to work correctly.
snowball¶
type='snowball'
A filter that stems words using a Snowball-generated stemmer.
Parameters
- language
 Possible values: Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, German2, Hungarian, Italian, Kp, Lovins, Norwegian, Porter, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
synonym¶
type='synonym'
Allows to easily handle synonyms during the analysis process. Synonyms are configured using a file in the Solr/WordNet synonym format.
Parameters
- synonyms_path
 Path to synonyms configuration file, relative to the configuration directory.
- ignore_case
 Defaults to
false- expand
 Defaults to
true
*_decompounder¶
type='dictionary_decompounder' or type='hyphenation_decompounder'
Decomposes compound words.
Parameters
- word_list
 A list of words to use.
- word_list_path
 A path (either relative to configuration location, or absolute) to a list of words.
- min_word_size
 Minimum word size(Integer). Defaults to 5.
- min_subword_size
 Minimum subword size(Integer). Defaults to 2.
- max_subword_size
 Maximum subword size(Integer). Defaults to 15.
- only_longest_match
 Only matching the longest(Boolean). Defaults to false
elision¶
type='elision'
Removes elisions.
Parameters
- articles
 A set of stop words articles, for example
['j', 'l']for content likeJ'aime l'odeur.
truncate¶
type='truncate'
Truncates tokens to a specific length.
Parameters
- length
 Number of characters to truncate to. default 10
unique¶
type='unique'
Used to only index unique tokens during analysis. By default it is applied on all the token stream.
Parameters
- only_on_same_position
 If set to true, it will only remove duplicate tokens on the same position.
pattern_capture¶
type='pattern_capture'
Emits a token for every capture group in the regular expression.
Parameters
- preserve_original
 If set to true (the default) then it would also emit the original token
pattern_replace¶
type='pattern_replace'
Handle string replacements based on a regular expression.
Parameters
- pattern
 Regular expression whose matches will be replaced.
- replacement
 The replacement, can reference the original text with
$1-like (the first matched group) references.
limit¶
type='limit'
Limits the number of tokens that are indexed per document and field.
Parameters
- max_token_count
 The maximum number of tokens that should be indexed per document and field. The default is 1
- consume_all_tokens
 If set to true the filter exhaust the stream even if max_token_count tokens have been consumed already. The default is false.
hunspell¶
type='hunspell'
Basic support for Hunspell stemming. Hunspell dictionaries will be picked up
from the dedicated directory <path.conf>/hunspell. Each dictionary is
expected to have its own directory named after its associated locale
(language). This dictionary directory is expected to hold both the *.aff and
*.dic files (all of which will automatically be picked up).
Parameters
- ignore_case
 If true, dictionary matching will be case insensitive (defaults to false)
- strict_affix_parsing
 Determines whether errors while reading a affix rules file will cause exception or simply be ignored (defaults to true)
- locale
 A locale for this filter. If this is unset, the lang or language are used instead - so one of these has to be set.
- dictionary
 The name of a dictionary contained in
<path.conf>/hunspell.- dedup
 If only unique terms should be returned, this needs to be set to true. Defaults to true.
- recursion_level
 Configures the recursion level a stemmer can go into. Defaults to 2. Some languages (for example czech) give better results when set to 1 or 0, so you should test it out.
common_grams¶
type='common_grams'
Generates bigrams for frequently occurring terms. Single terms are still indexed. It can be used as an alternative to the stop Token filter when we don’t want to completely ignore common terms.
Parameters
- common_words
 A list of common words to use.
- common_words_path
 A path (either relative to configuration location, or absolute) to a list of common words. Each word should be in its own “line” (separated by a line break). The file must be UTF-8 encoded.
- ignore_case
 If true, common words matching will be case insensitive (defaults to false).
- query_mode
 Generates bigrams then removes common words and single terms followed by a common word (defaults to false).
Note
Either common_words or common_words_path must be given.
*_normalization¶
type='<language>_normalization'
Normalizes special characters of several languages.
Available languages:
arabic
bengali
german
hindi
indic
persian
scandinavian
serbian
sorani
delimited_payload¶
type='delimited_payload'
Split tokens up by delimiter (default |) into the real token being indexed
and the payload stored additionally into the index. For example
Trillian|65535 will be indexed as Trillian with 65535 as payload.
Parameters
- encoding
 How the payload should be interpreted, possible values are
realfor float values,integerfor integer values andidentityfor keeping the payload as byte array (string).- delimiter
 The string used to separate the token and its payload.
keep¶
type='keep'
Only keep tokens defined within the settings of this filter keep_words and
variations.
All other tokens will be filtered. This filter works like an inverse stop-tokenfilter filter.
Parameters
- keep_words
 A list of words to keep and index as tokens.
- keep_words_path
 A path (either relative to configuration location, or absolute) to a list of words to keep and index.
Each word should be in its own “line” (separated by a line break). The file must be UTF-8 encoded.
stemmer_override¶
type='stemmer_override'
Override any previous stemmer that recognizes keywords with a custom mapping,
defined by rules or rules_path. One of these settings has to be set.
Parameters
- rules
 A list of rules for overriding, in the form of
[<source>=><replacement>] e.g. "foo=>bar"- rules_path
 A path to a file with one rule per line, like above.
cjk_bigram¶
type='cjk_bigram'
Handle Chinese, Japanese and Korean (CJK) bigrams.
Parameters
- output_bigrams
 Boolean flag to enable a combined unigram+bigram approach.
Default is
false, so single CJK characters that do not form a bigram are passed as unigrams.All non CJK characters are output unmodified.
- ignored_scripts
 Scripts to ignore. possible values:
han,hiragana,katakana,hangul
*_stem¶
type='arabic_stem' ortype='brazilian_stem' ortype='czech_stem' ortype='dutch_stem' ortype='french_stem' ortype='german_stem' ortype='russian_stem'A group of filters that applies language specific stemmers to the token stream.
To prevent terms from being stemmed put a keywordmarker-tokenfilter before
this filter into the token_filter chain.
decimal_digit¶
A token filter that folds unicode digits to 0-9
remove_duplicates¶
A token filter that drops identical tokens at the same position.
phonetic¶
A token filter which converts tokens to their phonetic representation using
Soundex, Metaphone, and a variety of other algorithms.
Parameters
- encoder
 Which phonetic encoder to use. Accepts
metaphone(default),double_metaphone,soundex,refined_soundex,caverphone1,caverphone2,cologne,nysiis,koelnerphonetik,haasephonetik,beider_morse,daitch_mokotoff.- replace
 Whether or not the original token should be replaced by the phonetic token. Accepts
true(default) andfalse. Not supported bybeider_morseencoding.
Note
Be aware that replace: false can lead to unexpected behavior since the
original and the phonetically analyzed version are both kept at the same
token position. Some queries handle these stacked tokens in special ways. For
example, the fuzzy match query does not apply
fuzziness to stacked synonym tokens. This can lead to issues that are
difficult to diagnose and reason about. For this reason, it is often
beneficial to use separate fields for analysis with and without phonetic
filtering. That way searches can be run against both fields with differing
boosts and trade-offs (e.g. only run a fuzzy match query on the original text
field, but not on the phonetic version).
double_metaphone¶
If the double_metaphone encoder is used, then this additional parameter is supported:
Parameters
max_code_lenThe maximum length of the emitted
metaphonetoken. Defaults to4.
beider_morse¶
If the beider_morse encoder is used, then these additional parameters are supported:
Parameters
rule_typeWhether matching should be
exactorapprox(default).name_typeWhether names are
ashkenazi,sephardic, orgeneric(default).languagesetAn array of languages to check. If not specified, then the language will be guessed. Accepts:
any,common,cyrillic,english,french,german,hebrew,hungarian,polish,romanian,russian,spanish.
Built-in char filter¶
mapping¶
type='mapping'
Parameters
- mappings
 A list of mappings as strings of the form
[<source>=><replacement>], e.g."ph=>f".- mappings_path
 A path to a file with one mapping per line, like above.
pattern_replace¶
type='pattern_replace'
Manipulates the characters in a string before analysis with a regex.
Parameters
- pattern
 Regex whose matches will be replaced
- replacement
 Replacement string, can reference replaced text by
$1like references (first matched element)
keep_types¶
type='keep_types'
Keeps only the tokens with a token type contained in a predefined set.
Parameters
- types
 A list of token types to keep.
min_hash¶
type='min_hash'
Hashes each token of the token stream and divides the resulting hashes into buckets, keeping the lowest-valued hashes per bucket. It then returns these hashes as tokens.
Parameters
- hash_count
 The number of hashes to hash the token stream with. Defaults to
1.- bucket_count
 The number of buckets to divide the min hashes into. Defaults to
512.- hash_set_size
 The number of min hashes to keep per bucket. Defaults to
1.- with_rotation
 Whether or not to fill empty buckets with the value of the first non-empty bucket to its circular right. Only takes effect if
hash_set_sizeis equal to one. Defaults totrueifbucket_countis greater than1, elsefalse.
fingerprint¶
type='fingerprint'
Emits a single token which is useful for fingerprinting a body of text, and/or providing a token that can be clustered on. It does this by sorting the tokens, de-duplicating and then concatenating them back into a single token.
Parameters
- separator
 Separator which is used for concatenating the tokens. Defaults to a space.
- max_output_size
 If the concatenated fingerprint grows larger than
max_output_size, the token filter will exit and will not emit a token. Defaults to255.