Primary log-based replication¶
Reads must return data written by any write which completed (where the
client could possibly have received a commit message). There are lots
of ways to handle this, but Ceph’s architecture makes it easy for
everyone at any map epoch to know who the primary is. Thus, the easy
answer is to route all writes for a particular PG through a single
ordering primary and then out to the replicas. Though we only
actually need to serialize writes on a single RADOS object (and even then,
the partial ordering only really needs to provide an ordering between
writes on overlapping regions), we might as well serialize writes on
the whole PG since it lets us represent the current state of the PG
using two numbers: the epoch of the map on the primary in which the
most recent write started (this is a bit stranger than it might seem
since map distribution itself is asynchronous – see Peering and the
concept of interval changes) and an increasing per-PG version number
– this is referred to in the code with type eversion_t
and stored as
pg_info_t::last_update
. Furthermore, we maintain a log of “recent”
operations extending back at least far enough to include any
unstable writes (writes which have been started but not committed)
and objects which aren’t uptodate locally (see recovery and
backfill). In practice, the log will extend much further
(osd_min_pg_log_entries
when clean and osd_max_pg_log_entries
when not
clean) because it’s handy for quickly performing recovery.
Using this log, as long as we talk to a non-empty subset of the OSDs
which must have accepted any completed writes from the most recent
interval in which we accepted writes, we can determine a conservative
log which must contain any write which has been reported to a client
as committed. There is some freedom here, we can choose any log entry
between the oldest head remembered by an element of that set (any
newer cannot have completed without that log containing it) and the
newest head remembered (clearly, all writes in the log were started,
so it’s fine for us to remember them) as the new head. This is the
main point of divergence between replicated pools and EC pools in
PG/PrimaryLogPG
: replicated pools try to choose the newest valid
option to avoid the client needing to replay those operations and
instead recover the other copies. EC pools instead try to choose
the oldest option available to them.
The reason for this gets to the heart of the rest of the differences
in implementation: one copy will not generally be enough to
reconstruct an EC object. Indeed, there are encodings where some log
combinations would leave unrecoverable objects (as with a k=4,m=2
encoding
where 3 of the replicas remember a write, but the other 3 do not – we
don’t have 3 copies of either version). For this reason, log entries
representing unstable writes (writes not yet committed to the
client) must be rollbackable using only local information on EC pools.
Log entries in general may therefore be rollbackable (and in that case,
via a delayed application or via a set of instructions for rolling
back an inplace update) or not. Replicated pool log entries are
never able to be rolled back.
For more details, see PGLog.h/cc
, osd_types.h:pg_log_t
,
osd_types.h:pg_log_entry_t
, and peering in general.