How does search on AMO work?¶
High-level overview¶
AMO add-ons are indexed in our Elasticsearch cluster. For each search query someone makes on AMO, we run a custom set of full-text queries against that cluster.
Our autocomplete (that you can see when starting to type a few characters in the search field) uses the exact same implementation as a regular search underneath.
Rules¶
For each search query, we apply a number of rules that attempt to find the search terms in each add-on name, summary and description. Each rule generates a score that depends on:
The frequency of the terms in the field we’re looking at
The importance of each term in the overall index (the more common the term is across all add-ons, the less it impacts the score)
The length of the field (shorter fields give a higher score as the search term is considered more relevant if they make up a larger part of the field)
Each rule is also given a specific boost affecting its score, making matches against the add-on name more important and matches against the summary or description.
Add-on names receive special treatment: Partial or misspelled matches are accepted to some extent while exact matches receive a significantly higher score.
Scoring¶
Each score for each rule is combined into a final score which we modify depending on the add-on popularity on a logarithm scale. “Recommended” and “By Firefox” add-ons get an additional, significant boost to their score.
Finally, results are returned according to their score in descending order.
Technical overview¶
We store two kinds of data in the addons index: indexed fields that are used for search purposes, and non-indexed fields that are meant to be returned (often as-is with no transformations) by the search API (allowing us to return search results data without hitting the database). The latter is not relevant to this document.
Our search can be reached either via the API through /api/v5/addons/search/ or /api/v5/addons/autocomplete/ which are used by our frontend.
Indexing¶
The key fields we search against are name, summary and description. Because all can be translated, we index them multiple times:
Once with the translation in the default locale of the add-on, under
{field}, analyzed with just thesnowballanalyzer fordescriptionandsummary, and a custom analyzer fornamethat applies the following filters:standard,word_delimiter(a custom version withpreserve_originalset totrue),lowercase,stop, anddictionary_decompounder(with a specific word list) andunique.Once for every translation that exists for that field, using Elasticsearch language-specific analyzer if supported, under
{field}_l10n_{analyzer}.
- In addition, for the name, we also have:
For all fields described above also contains a subfield called
rawthat holds a non-analyzed variant for exact matches in the corresponding language (stored as akeyword, with alowercasenormalizer).A
name.trigramvariant for the field in the default language, which is using a custom analyzer that depends on angramtokenizer (withmin_gram=3,max_gram=3andtoken_chars=["letter", "digit"]).
Flow of a search query through AMO¶
Let’s assume we search on addons-frontend (not legacy) the search query hits the API and gets handled by AddonSearchView, which directly queries ElasticSearch and doesn’t involve the database at all.
There are a few filters that are described in the /api/v5/addons/search/ docs but most of them are not very relevant for text search queries. Examples are filters by guid, platform, category, add-on type or appversion (application version compatibility). Those filters are applied using a filter clause and shouldn’t affect scoring.
Much more relevant for text searches (and this is primarily used when you use the search on the frontend) is SearchQueryFilter.
It composes various rules to define a more or less usable ranking:
Primary rules¶
These are the ones using the strongest boosts, so they are only applied to the add-on name.
Applied rules (merged via should):
A
dis_maxquery withtermmatches onname_l10n_{analyzer}.rawandname.rawif the language of the request matches a known language-specific analyzer, or just atermquery onname.raw(boost=100.0) otherwise - our attempt to implement exact matchesIf we have a matching language-specific analyzer, we add a
matchquery toname_l10n_{analyzer}(boost=5.0,operator=and)A
phrasematch onnamethat allows swapped terms (boost=8.0,slop=1)A
matchonname, using the standard text analyzer (boost=6.0,analyzer=standard,operator=and)A
prefixmatch onname(boost=3.0)If a query is < 20 characters long, a
dis_maxquery (boost=4.0) composed of a fuzzy match onname(boost=4.0,prefix_length=2,fuzziness=AUTO,minimum_should_match=2<2 3<-25%) and amatchquery onname.trigram, with aminimum_should_match=66%to avoid noise
Secondary rules¶
These are the ones using the weakest boosts, they are applied to fields containing more text like description, summary and tags.
Applied rules (merged via should):
Look for matches inside the summary (
boost=3.0,operator=and)Look for matches inside the description (
boost=2.0,operator=and)
If the language of the request matches a known language-specific analyzer, those are made using a multi_match query using summary or description and the corresponding {field}_l10n_{analyzer}, similar to how exact name matches are performed above, in order to support potential translations.
Scoring¶
We combine scores through a function_score query that multiplies the score by several factors:
A first multiplier is always applied through the
field_value_factorfunction onaverage_daily_userswith alog2pmodifierAn additional
4.0weight is applied if the add-on is public & non-experimental.Finally,
5.0weight is applied to By Firefox and Recommended add-ons.
On top of the two sets of rules above, a rescore query is applied with a window_size of 10. In production, we have 5 shards, so that should re-adjust the score of the top 50 results returned only. The rules used for rescoring are the same used in the secondary rules above, with just one difference: it’s using match_phrase instead of match, with a slop of 10.
General query flow¶
Fetch current translation
Fetch locale specific analyzer (List of analyzers)
Apply primary and secondary should rules
Determine the score
Rescore the top 10 results per shard