Advanced understanding of Addok engine
Foreword
Importing
What's a document ?
By default, Addok imports documents in ndjson, where one line (reformated) looks like this:
{
"id": "22003_0120",
"name": "Rue des Lilas",
"postcode": "22100",
"citycode": [
"22003"
],
"oldcitycode": null,
"lon": -2.126067,
"lat": 48.457051,
"x": 321263.77,
"y": 6829717.72,
"city": [
"Aucaleuc"
],
"oldcity": null,
"context": "22, Côtes-d'Armor, Bretagne",
"type": "street",
"importance": 0.3562,
"housenumbers": {
"1": {
"id": "22003_0120_00001",
"x": 321239.59,
"y": 6829718.52,
"lon": -2.126394,
"lat": 48.457044
},
"2": {
"id": "22003_0120_00002",
"x": 321242.31,
"y": 6829714.77,
"lon": -2.126354,
"lat": 48.457012
},
"4": {
"id": "22003_0120_00004",
"x": 321309.47,
"y": 6829719.77,
"lon": -2.125452,
"lat": 48.457096
}
}
}
This document must match the FIELDS
in the configuration, which looks like:
FIELDS = [
{"key": "name", "boost": 4, "null": False},
{"key": "street"},
{"key": "city"},
{"key": "housenumbers"},
{"key": "context"},
]
Those are the fields that Addok will indexe, and that will be searchable. A document can contain many other fields, which will be returned in the final response.
Addok wants at least a name
field, which can either be named name
, or have
another name and have the type="name"
attribute, or be declared through the
NAME_FIELD
configuration key.
If you want to control the internal document ids (usefull for reindexing, for example),
you need to either declare an ID_FIELD
, or to have a field _id
in the document,
or a field with the type="id"
attribute.
See configuration for more about this setting.
Also, see BATCH_FILE_LOADER_PYPATH
to load other formats of files.
Preparing
Before indexing, for each document, Addok will iterate over the BATCH_PROCESSORS
, which
are by default:
BATCH_PROCESSORS_PYPATHS = [
"addok.batch.to_json",
# Apply the string preprocessor to the housenumbers values, so they can be
# searched like any other string. Eg. with addok-fr, "1" will be turned into
# "un".
"addok.helpers.index.prepare_housenumbers",
# Asks to the documents store to actually store this given document.
"addok.ds.store_documents",
# Actually indexes the document.
"addok.helpers.index.index_documents",
]
Those processors are to be used for preparing the data, before doing the index. One may for example add another processor to compute the document importance automatically based on another field (eg. population), or to derive from a json structure to the final structure Addok expects, etc.
Indexing
The indexing process is controlled by the configuration key INDEXERS
, which
looks like:
INDEXERS_PYPATHS = [
"addok.helpers.index.HousenumbersIndexer",
"addok.helpers.index.FieldsIndexer",
# Pairs indexer must be after `FieldsIndexer`.
"addok.pairs.PairsIndexer",
# Edge ngram indexer must be after `FieldsIndexer`.
"addok.autocomplete.EdgeNgramIndexer",
"addok.helpers.index.FiltersIndexer",
"addok.helpers.index.GeohashIndexer",
]
What does "indexing" means in details ? This will create keys and values in the Redis database.
The main indexer is the FieldsIndexer
. It will break down each fields in tokens
(small pieces of text, see "String processing" below) and create sorted sets in Redis.
Basically, each token will become a sorted set key, where the value will be the
document id, with a computed weight score (see "Computing weight" below).
For example, indexing the example document shown above will create an index structure like this (non exhaustive; dumy score between parens):
w|ru | w|de | w|lila | w|aukaleuk |
---|---|---|---|
22003_0120 (1.008) | 22003_0120 (1.008) | 22003_0120 (1.008) | 22003_0120 (1.032) |
Now if we also index a document Rue des Lauriers
in the same city, the index
will look like this:
w|ru | w|de | w|lila | w|aukaleuk | w|laurier |
---|---|---|---|---|
22003_0120 (1.008) | 22003_0120 (1.008) | 22003_0120 (1.008) | 22003_0120 (1.032) | |
22003_0100 (1.008) | 22003_0100 (1.008) | 22003_0100 (1.032) | 22003_0100 (1.032) |
The HousenumbersIndexer
and GeohashIndexer
will create geohash index
entries, in order to perform geolocation bias and reverse search.
The FiltersIndexer
will create sets with the values of the FILTERS
configuration
key. Example with the previous documents:
f|citycode|22003 |
---|
22003_0120 |
22003_0100 |
The PairsIndexer
and the EdgeNgramIndexer
will not create the same index
structure: values will not be document ids. The PairsIndexer
will index all
tokens that have been seen with a given token. Still with the same example, it
will looks like this:
p|ru | p|de | p|lila | p|aukaleuk | p|laurier |
---|---|---|---|---|
de | lila | de | de | de |
lila | ru | ru | ru | ru |
laurier | laurier | laurier | laurier | aukaleuk |
aukaleuk | aukaleuk | aukaleuk | lila | lila |
This index will be used to target relevant tokens when trying to compute autocomplete and fuzzy.
The EdgeNgramIndexer
will list all tokens that starts with a given token. For
example:
n|lil | n|lau | n|laur | n|lauri |
---|---|---|---|
lila | laurier | laurier | laurier |
Computing weight
When indexing, each document is associated to all the tokens (words) it contains.
This relation is weighted. For example, in France, streets are often called
rue de Something
, but there is also a city called Rue
. So if someone types
rue
without any other info (no location bias, no filter…), we want to find
the city, and the thousands of streets are not relevant in this case.
This is why addok uses sorted set
and not simple set
.
This "boost" value is computed this way:
- a given field (as in FIELDS
) can have a boost (which can also be a callable
that takes the document as argument); so one may boost differently the name
field (higher) than the state
(lower), for example. Also, using a callable
may allow to boost some field (eg. citycode
) only for some document types
(eg. city
)
- a given document can have an importance (which is usually a property given
in the dataset); this importance may for example be related to the size of the
city, so a same street rue des Lilas
(without any other context) will first
return the street of a big city instead of a random small town
- those two first scores are added, then divided by the number of tokens in the
indexed field. So for example lilas
with return first the city les lilas
than a street called rue des lilas bleux
String processing
In Addok, each field value is splitted and reduced in small tokens, that are
roughly words cleaned. Addok core does a part of this work, but given it's often
language dependant, the most part is done in plugins (for French addok-fr
).
This part is controlled by the PROCESSORS
, which is by default:
PROCESSORS_PYPATHS = [
"addok.helpers.text.tokenize",
"addok.helpers.text.normalize",
"addok.helpers.text.flag_housenumber",
"addok.helpers.text.synonymize",
]
The tokenize
function will actually split the value in, say, actual instances
of addok.helpers.text.Token
, which is roughly an extension of the python str
,
with custom behaviour and properties (like the position
in the initial string,
the original string value (before any cleaning), or many helpers to compute the
frequency in the index…).
The normalize
helper will remove any diacritics and lower case all the tokens.
The flag_housenumber
is a very minimal way of marking a token as a housenumber.
It is supposed to be overriden with country specific rules (which is done for
example in the addok-france
plugin).
The synonymize
will replace tokens by others, using mapping files set in the
configuration. This is also usually custom and business specific.
A classic France based Addok instance, will have those string processors:
PROCESSORS_PYPATHS = [
"addok.helpers.text.tokenize",
"addok.helpers.text.normalize",
"addok_france.glue_ordinal",
"addok_france.fold_ordinal",
"addok_france.flag_housenumber",
"addok.helpers.text.synonymize",
"addok_fr.phonemicize",
]
Where we can see:
glue_ordinal
andfold_ordinal
, which will deal withbis
,ter
and so on- an override of the default
flag_housenumber
helper - a
phonemicize
helper, that will try to reduce a token to its phonem, so trying to remove the plural or such
Searching
Cleaning the input
Before trying to search in the index, Addok will clean the human input. While the document values during the import phase is supposed to be clean (eg. no useless words), this may not be the case when dealing with a value coming from humans.
This part is controlled by the QUERY_PROCESSORS
. Here again, the core
Addok does very little, and let the plugin implement country and business specific
rules.
Here is a classic configuration for a France instance:
QUERY_PROCESSORS_PYPATHS = [
"addok.helpers.text.check_query_length",
"addok_france.extract_address",
"addok_france.clean_query",
"addok_france.remove_leading_zeros",
]
Where extract_address
and clean_query
will try to extract the relevant part
of the address, excluding any useless information (like bâtiment B
or Cedex XX
).
Preparing tokens
As first step, Addok we'll qualify each token, in order to separate them in: - housenumber tokens - common tokens (means very common in the index) - not found tokens (so not in the index) - other tokens, which are called "meaningfull tokens" in the code, and are considered the most relevant one to identify the search.
This first step is controlled by the configuration key SEARCH_PREPROCESSORS
.
Querying the index
Here is the core part of the "search".
Keep in mind that what we call "search" here (a.k.a. geocoding) consist in trying to guess a structured document from an unstructured. The key word here is "guess".
This part is controlled by the RESULTS_COLLECTORS
.
Here is what it looks like:
RESULTS_COLLECTORS_PYPATHS = [
"addok.autocomplete.only_commons_but_geohash_try_autocomplete_collector",
"addok.helpers.collectors.no_tokens_but_housenumbers_and_geohash",
"addok.helpers.collectors.no_available_tokens_abort",
"addok.helpers.collectors.only_commons",
"addok.autocomplete.no_meaningful_but_common_try_autocomplete_collector",
"addok.autocomplete.only_commons_try_autocomplete_collector",
"addok.helpers.collectors.bucket_with_meaningful",
"addok.helpers.collectors.reduce_with_other_commons",
"addok.helpers.collectors.ensure_geohash_results_are_included_if_center_is_given", # noqa
"addok.autocomplete.autocomplete_meaningful_collector",
"addok.fuzzy.fuzzy_collector",
"addok.helpers.collectors.extend_results_extrapoling_relations",
"addok.helpers.collectors.extend_results_reducing_tokens",
]
The key concept to have in mind is that Addok will select some tokens and try to make sorted set intersections in Redis. Remember that each sorted set has a token (a word) as key, and document ids as values. So we basically ask Redis: do you have some documents ids that are in all those sorted set, in other words: do you know documents ids that have all those words.
All the results collection is just about trying and trying again with some tokens, adding tokens or removing tokens, to adjust the collect results.
Addok calls the results collection a "bucket". It will try to fill in this bucket until it has enough results.
Meanwhile it collect those results, it will compute a score (see "Scoring results" below), in order to continue or not to collect, according to the "quality" of those results. The "quality" result is internally called the "cream" (like in a bucket of fresh milk).
So let's take a very minimal example, still based on the same documents we've used before. Our index looks like this:
w|ru | w|de | w|lila | w|aukaleuk | w|laurier |
---|---|---|---|---|
22003_0120 (1.008) | 22003_0120 (1.008) | 22003_0120 (1.008) | 22003_0120 (1.032) | |
22003_0100 (1.008) | 22003_0100 (1.008) | 22003_0100 (1.032) | 22003_0100 (1.008) |
We certainly have others "rue des Lilas" in other cities.
Let's imagine someone searches "rue des lilas Aucaleuc". Addok will first qualify
the tokens, more or less like this:
- housenumber tokens: none
- common tokens: ru
, de
(those are very common in the index, which means that
their sorted set have a lot of values)
- not found token: none
- meaningfull tokens: lila
, aukaleuk
Then Addok will ask Redis an intersect between those two sorted sets: w|lila
and
w|aukaleuk
. It will find one document id.
At this stage, the bucket contains one result (we know it's the "good" result, but Addok does not).
One result is not enough for Addok to have confidence it has searched
enough, so it will now try again with fewer tokens, so in this case, with only
one token, the less frequent between w|lila
and w|aukaleuk
.
It certainly will start by aukaleuk
, because there is certainly fewer document
related to Aucaleuc
city than documents related to Lilas
in the whole country.
Those results will be added to the bucket. If the bucket contains less than
config.BUCKET_MIN
, Addok will issue a new request with the other token.
Addok will again try to extract the "cream" from the bucket, and given it has one good result (see "Scoring results"), and it has a bucket with enough results, it will return the better scored results and stop here.
This is a very simple version of the bucket heuristic. Basically, each collector
from the RESULTS_COLLECTORS
will:
- try to determine weither it whould act or not according to the bucket state:
- is it empty ?
- does it have few results (less than config.BUCKET_MIN
) ?
- does it overflow (more results than config.BUCKET_MAX
) ?
- does it have cream (results with score > config.MATCH_THRESHOLD
)
- compute a set of keys to do an intersect in Redis
- add the results in the bucket
- let the next collector act
Each collector has its own heuristic, for example:
addok.autocomplete.only_commons_but_geohash_try_autocomplete_collector
: there are only common tokens (so high cost of intersect), but there is a location bias in the request, so let's intersect those common tokens PLUS the geohash setaddok.helpers.collectors.no_tokens_but_housenumbers_and_geohash
: very specific case for searches starting with a housenumber, like8 rue des
when autocomplete is active.addok.helpers.collectors.no_available_tokens_abort
: no usable tokenaddok.helpers.collectors.only_commons
: this is one of the main collectors, that deals with case where we have only common tokensaddok.fuzzy.fuzzy_collector
: we'll try to extend tokens with fuzzy matching to find more results in case the bucket is emptyaddok.helpers.collectors.extend_results_extrapoling_relations
: that one tries to be smart finding the tokens that are more often seen together, so to do intersect with more chance of finding somethingaddok.helpers.collectors.extend_results_reducing_tokens
: that one is a bit of a broom wagon: we have found nothing until then, let's try with fewer tokens- …
Example of a custom collector:
from addok.helpers import keys
def target_city(helper):
# Dumb way of targeting only cities when the search string is very small.
if len(helper.query) <= 4 and not helper.autocomplete:
helper.filters.append(keys.filter_key("municipality"))
RESULTS_COLLECTORS_PYPATHS = [
"addok.autocomplete.only_commons_but_geohash_try_autocomplete_collector",
...,
target_city,
...
]
Filters
Filters are also indexed as set
in Redis, and are used in the intersect when
collecting result.
Filters are fields that have been listed in the FILTERS
configuration key.
Unlike indexed fields, the filters are not processed at all, and only perfect match are considered.
Scoring results
While collecting results, from time to time (when needing to know whether to continue collecting or not), Addok will compute a score for each result.
This score is controlled by the SEARCH_RESULT_PROCESSORS
configuration key.
This score is composed by:
- mainly, the string distance between the original searched string and some computed labels for the result
- a score based on importance of the document
- if location bias is included in the search, a score based on geographical distance
Hacking with the shell
Addok comes with a shell to debug its internal and better understand its behaviour.
Here is an example of running a simple search:
$ addok shell
Addok 1.1.0rc1
Loaded local config from addok/config/local.py
Loaded plugins:
addok.shell==1.1.0rc1, addok.http.base==1.1.0rc1, addok.batch==1.1.0rc1, addok.pairs==1.1.0rc1, addok.fuzzy==1.1.0rc1, addok.autocomplete==1.1.0rc1, csv==1.1.0, addok_fr_admin==0.0.1, fr==1.0.1, france==1.1.3
Welcome to the Addok shell o/
Type HELP or ? to list commands.
Type QUIT or ctrl-C or ctrl-D to quit.
> rue des lilas
Rue des Lilas 75019 Paris (73Ow | 0.976310909090909)
Rue des Lilas 77500 Chelles (3r99 | 0.9700127272727271)
Rue des Lilas 22190 Plérin (r16B | 0.9618445454545452)
Rue des Lilas 77330 Ozoir-la-Ferrière (OZMg | 0.960891818181818)
Rue des Lilas 22440 Ploufragan (6qBz | 0.9599363636363636)
Rue des Lilas 22960 Plédran (G0Gy | 0.9592181818181817)
Rue des Lilas 22950 Trégueux (xmvr | 0.959041818181818)
Rue des Lilas 77700 Magny-le-Hongre (lD8V | 0.9575345454545454)
Rue des Lilas 77360 Vaires-sur-Marne (919P | 0.9572699999999998)
Rue des Lilas 22400 Lamballe-Armor (0wRK | 0.9571363636363636)
12.7 ms — 1 run(s) — 10 results
--------------------------------------------------------------------------------
Same search, but in EXPLAIN mode:
> EXPLAIN rue des lilas
[0.959] Taken tokens: [<Token lila>]
[0.982] Common tokens: [<Token ru>, <Token de>]
[0.984] Housenumbers token: []
[0.987] Not found tokens: []
[1.005] Filters: []
[1.011] ** TARGET_CITY **
[1.014] ** ONLY_COMMONS_BUT_GEOHASH_TRY_AUTOCOMPLETE_COLLECTOR **
[1.017] ** NO_TOKENS_BUT_HOUSENUMBERS_AND_GEOHASH **
[1.020] ** NO_AVAILABLE_TOKENS_ABORT **
[1.023] ** ONLY_COMMONS **
[1.026] ** NO_MEANINGFUL_BUT_COMMON_TRY_AUTOCOMPLETE_COLLECTOR **
[1.028] ** ONLY_COMMONS_TRY_AUTOCOMPLETE_COLLECTOR **
[1.031] ** BUCKET_WITH_MEANINGFUL **
[1.039] New bucket with keys ['w|lila', 'w|ru'] and limit 10
[1.443] 10 ids in bucket so far
[1.453] New bucket with keys ['w|lila', 'w|ru'] and limit 0
[1.814] 51 ids in bucket so far
[1.821] ** REDUCE_WITH_OTHER_COMMONS **
[1.830] ** ENSURE_GEOHASH_RESULTS_ARE_INCLUDED_IF_CENTER_IS_GIVEN **
[1.833] ** FUZZY_COLLECTOR **
[1.840] ** AUTOCOMPLETE_MEANINGFUL_COLLECTOR **
[1.845] Autocompleting lila
[2.019] No candidates. Aborting.
[2.023] ** EXTEND_RESULTS_EXTRAPOLING_RELATIONS **
[2.026] ** EXTEND_RESULTS_REDUCING_TOKENS **
[2.032] Computing results
[2.041] Done getting results data
[7.994] Done computing results
Rue des Lilas 75019 Paris (73Ow | importance: 0.0739/0.1, str_distance: 1.0/1.0)
Rue des Lilas 77500 Chelles (3r99 | importance: 0.067/0.1, str_distance: 1.0/1.0)
Rue des Lilas 22190 Plérin (r16B | importance: 0.058/0.1, str_distance: 1.0/1.0)
Rue des Lilas 77330 Ozoir-la-Ferrière (OZMg | importance: 0.057/0.1, str_distance: 1.0/1.0)
Rue des Lilas 22440 Ploufragan (6qBz | importance: 0.0559/0.1, str_distance: 1.0/1.0)
Rue des Lilas 22960 Plédran (G0Gy | importance: 0.0551/0.1, str_distance: 1.0/1.0)
Rue des Lilas 22950 Trégueux (xmvr | importance: 0.0549/0.1, str_distance: 1.0/1.0)
Rue des Lilas 77700 Magny-le-Hongre (lD8V | importance: 0.0533/0.1, str_distance: 1.0/1.0)
Rue des Lilas 77360 Vaires-sur-Marne (919P | importance: 0.053/0.1, str_distance: 1.0/1.0)
Rue des Lilas 22400 Lamballe-Armor (0wRK | importance: 0.0529/0.1, str_distance: 1.0/1.0)
8.3 ms — 1 run(s) — 10 results
--------------------------------------------------------------------------------