Server side search
Read the Docs uses Elasticsearch instead of the built in Sphinx search for providing better search results. Documents are indexed in the Elasticsearch index and the search is made through the API. All the Search Code is open source and lives in the GitHub Repository. Currently we are using Elasticsearch 6.3.
Local development configuration
Elasticsearch is installed and run as part of the development installation guide.
Indexing into Elasticsearch
For using search, you need to index data to the Elasticsearch Index. Run reindex_elasticsearch
management command:
inv docker.manage reindex_elasticsearch
For performance optimization, we implemented our own version of management command rather than the built in management command provided by the django-elasticsearch-dsl package.
Auto indexing
By default, Auto Indexing is turned off in development mode. To turn it on, change the
ELASTICSEARCH_DSL_AUTOSYNC
settings to True
in the readthedocs/settings/dev.py
file.
After that, whenever a documentation successfully builds, or project gets added,
the search index will update automatically.
Architecture
The search architecture is divided into 2 parts.
One part is responsible for indexing the documents and projects (
documents.py
)The other part is responsible for querying the Index to show the proper results to users (
faceted_search.py
)
We use the django-elasticsearch-dsl package for our Document abstraction. django-elasticsearch-dsl is a wrapper around elasticsearch-dsl for easy configuration with Django.
Indexing
All the Sphinx documents are indexed into Elasticsearch after the build is successful. Currently, we do not index MkDocs documents to elasticsearch, but any kind of help is welcome.
Troubleshooting
If you get an error like:
RequestError(400, 'search_phase_execution_exception', 'failed to create query: ...
You can fix this by deleting the page index and re-indexing:
inv docker.manage 'search_index --delete'
inv docker.manage reindex_elasticsearch
How we index documentations
After any build is successfully finished, HTMLFile
objects are created for each of the
HTML
files and the old version’s HTMLFile
object is deleted. By default,
django-elasticsearch-dsl package listens to the post_create
/post_delete
signals
to index/delete documents, but it has performance drawbacks as it send HTTP request whenever
any HTMLFile
objects is created or deleted. To optimize the performance, bulk_post_create
and bulk_post_delete
signals are dispatched with list of HTMLFIle
objects so its possible
to bulk index documents in elasticsearch ( bulk_post_create
signal is dispatched for created
and bulk_post_delete
is dispatched for deleted objects). Both of the signals are dispatched
with the list of the instances of HTMLFile
in instance_list
parameter.
We listen to the bulk_post_create
and bulk_post_delete
signals in our Search
application
and index/delete the documentation content from the HTMLFile
instances.
How we index projects
We also index project information in our search index so that the user can search for projects
from the main site. We listen to the post_create
and post_delete
signals of
Project
model and index/delete into Elasticsearch accordingly.
Elasticsearch document
elasticsearch-dsl provides a model-like wrapper for the Elasticsearch document.
As per requirements of django-elasticsearch-dsl, it is stored in the
readthedocs/search/documents.py
file.
ProjectDocument: It is used for indexing projects. Signal listener of django-elasticsearch-dsl listens to the
post_save
signal ofProject
model and then index/delete into Elasticsearch.PageDocument: It is used for indexing documentation of projects. As mentioned above, our
Search
app listens to thebulk_post_create
andbulk_post_delete
signals and indexes/deleted documentation into Elasticsearch. The signal listeners are in thereadthedocs/search/signals.py
file. Both of the signals are dispatched after a successful documentation build.The fields and ES Datatypes are specified in the
PageDocument
. The indexable data is taken fromprocessed_json
property ofHTMLFile
. This property provides python dictionary with document data liketitle
,sections
,path
etc.