Refactor RemoteRepository
object
This document describes the current usage of RemoteRepository
objects and proposes a new normalized modeling.
Goals
De-duplicate data stored in our database.
Save only one
RemoteRepository
per GitHub repository.Use an intermediate table between
RemoteRepository
andUser
to store associated remote data for the specific user.Make this model usable from our SSO implementation (adding
remote_id
field inRemote
objects).Use Post
JSONField
to store associatedjson
remote data.Make
Project
connect directly toRemoteRepository
without being linked to a specificUser
.Do not disconnect
Project
andRemoteRepository
when a user delete/disconnects their account.
Non-goals
Keep
RemoteRepository
in sync with GitHub repositories.Delete
RemoteRepository
objects deleted from GitHub.Listen to GitHub events to detect
full_name
changes and update our objects.
Note
We may need/want some of these non-goals in the future. They are just outside the scope of this document.
Current implementation
When a user connect their account to a social account, we create a
allauth.socialaccount.models.SocialAccount
* basic information (provider, last login, etc) * provider’s specific data saved in a JSON underextra_data
allauthsocialaccount.models.SocialToken
* token to hit the API on behalf the user
We don’t create any RemoteRepository
at this point.
They are created when the user jumps into “Import Project” page and hit the circled arrows.
It triggers sync_remote_repostories
task in background that updates or creates RemoteRepositories
,
but it does not delete them (after #7183 and #7310 got merged, they will be deleted).
One RemoteRepository
is created per repository the User
has access to.
Note
In corporate, we are automatically syncing RemoteRepository
and RemoteOganization
at signup (foreground) and login (background) via a signal. We should eventually move these to community.
Where RemoteRepository
is used?
List of available repositories to import under “Import Project”
Show a “+”, “External Arrow” or a “Lock” sign next to the element in the list * +: it’s available to be imported * External Arrow: the repository is already imported (see RemoteRepository.matches method) * Lock: user doesn’t have (admin) permissions to import this repository (uses
RemoteRepository.private
andRemoteRepository.admin
)Avatar URL in the list of project available to import
Update webhook when user clicks “Resync webhook” from the Admin > Integrations tab
Send build status when building Pull Requests
New normalized implementation
The ManyToMany
relation RemoteRepository.users
will be changed to be ManyToMany(through='RemoteRelation')
to add extra fields in the relation that are specific only for the User.
Allows us to have only one RemoteRepository
per GitHub repository with multiple relationships to User
.
With this modeling, we can avoid the disconnection Project
and RemoteRepository
only by removing the RemoteRelation
.
Note
All the points mentioned in the previous section may need to be adapted to use the new normalized modeling. However, it may be only field renaming or small query changes over new fields.
Use this modeling for SSO
We can get the list of Project
where a user as access:
admin_remote_repositories = RemoteRepository.objects.filter(
users__contains=request.user,
users__remoterelation__admin=True, # False for read-only access
)
Project.objects.filter(remote_repository__in=admin_remote_repositories)
Rollout plan
Due the constraints we have in the RemoteRepository
table and its size,
we can’t just do the data migration at the same time of the deploy.
Because of this we need to be more creative here and find a way to re-sync the data from VCS providers,
while the site continue working.
To achieve this, we thought on following this steps:
1. modify all the Python code to use the new modeling in .org and .com (will help us to find out bugs locally in an easier way)
1. QA this locally with test data
1. enable Django signal to re-sync RemoteRepository on login async (we already have this in .com). New active users will have updated data immediately
1. spin up a new instance with the new refactored code
1. run migrations to create a new table for RemoteRepository
1. re-sync everything from VCS providers into the new table for 1-week or so
1. dump-n-load Project - RemoteRepository
relations
1. create a migration to use the new table with synced data
1. deploy new code once the sync is finished
See these issues for more context: * https://github.com/readthedocs/readthedocs.org/pull/7536#issuecomment-724102640 * https://github.com/readthedocs/readthedocs.org/pull/7675#issuecomment-732756118