Terminology
A Ceph cluster may have zero or more CephFS file systems. Each CephFS has
a human readable name (set at creation time with fs new
) and an integer
ID. The ID is called the file system cluster ID, or FSCID.
Each CephFS file system has a number of ranks, numbered beginning with zero. By default there is one rank per file system. A rank may be thought of as a metadata shard. Management of ranks is described in Configuring multiple active MDS daemons .
Each CephFS ceph-mds
daemon starts without a rank. It may be assigned one
by the cluster’s monitors. A daemon may only hold one rank at a time, and only
give up a rank when the ceph-mds
process stops.
If a rank is not associated with any daemon, that rank is considered failed
.
Once a rank is assigned to a daemon, the rank is considered up
.
Each ceph-mds
daemon has a name that is assigned statically by the
administrator when the daemon is first configured. Each daemon’s name is
typically that of the hostname where the process runs.
A ceph-mds
daemon may be assigned to a specific file system by
setting its mds_join_fs
configuration option to the file system’s
name
.
When a ceph-mds
daemon starts, it is also assigned an integer GID
,
which is unique to this current daemon’s process. In other words, when a
ceph-mds
daemon is restarted, it runs as a new process and is assigned a
new GID
that is different from that of the previous process.
Referring to MDS daemons
Most administrative commands that refer to a ceph-mds
daemon (MDS)
accept a flexible argument format that may specify a rank
, a GID
or a name
.
Where a rank
is used, it may optionally be qualified by
a leading file system name
or GID
. If a daemon is a standby (i.e.
it is not currently assigned a rank
), then it may only be
referred to by GID
or name
.
For example, say we have an MDS daemon with name
‘myhost’ and
GID
5446, and which is assigned rank
0 for the file system ‘myfs’
with FSCID
3. Any of the following are suitable forms of the fail
command:
ceph mds fail 5446 # GID
ceph mds fail myhost # Daemon name
ceph mds fail 0 # Unqualified rank
ceph mds fail 3:0 # FSCID and rank
ceph mds fail myfs:0 # File System name and rank
Managing failover
If an MDS daemon stops communicating with the cluster’s monitors, the monitors
will wait mds_beacon_grace
seconds (default 15) before marking the daemon as
laggy. If a standby MDS is available, the monitor will immediately replace the
laggy daemon.
Each file system may specify a minimum number of standby daemons in order to be
considered healthy. This number includes daemons in the standby-replay
state
waiting for a rank
to fail. Note that a standby-replay
daemon will not
be assigned to take over a failure for another rank
or a failure in a
different CephFS file system). The pool of standby daemons not in replay
counts towards any file system count.
Each file system may set the desired number of standby daemons by:
ceph fs set <fs name> standby_count_wanted <count>
Setting count
to 0 will disable the health check.
Configuring standby-replay
Each CephFS file system may be configured to add standby-replay
daemons.
These standby daemons follow the active MDS’s metadata journal in order to
reduce failover time in the event that the active MDS becomes unavailable. Each
active MDS may have only one standby-replay
daemon following it.
Configuration of standby-replay
on a file system is done using the below:
ceph fs set <fs name> allow_standby_replay <bool>
Once set, the monitors will assign available standby daemons to follow the active MDSs in that file system.
Once an MDS has entered the standby-replay
state, it will only be used as a
standby for the rank
that it is following. If another rank
fails, this
standby-replay
daemon will not be used as a replacement, even if no other
standbys are available. For this reason, it is advised that if standby-replay
is used then every active MDS should have a standby-replay
daemon.
Configuring MDS file system affinity
You might elect to dedicate an MDS to a particular file system. Or, perhaps you
have MDSs that run on better hardware that should be preferred over a last-resort
standby on modest or over-provisioned systems. To configure this preference,
CephFS provides a configuration option for MDS called mds_join_fs
which
enforces this affinity.
When failing over MDS daemons, a cluster’s monitors will prefer standby daemons with
mds_join_fs
equal to the file system name
with the failed rank
. If no
standby exists with mds_join_fs
equal to the file system name
, it will
choose an unqualified standby (no setting for mds_join_fs
) for the replacement,
or any other available standby, as a last resort. Note, this does not change the
behavior that standby-replay
daemons are always selected before
other standbys.
Even further, the monitors will regularly examine the CephFS file systems even when
stable to check if a standby with stronger affinity is available to replace an
MDS with lower affinity. This process is also done for standby-replay
daemons:
if a regular standby has stronger affinity than the standby-replay
MDS, it will
replace the standby-replay MDS.
For example, given this stable and healthy file system:
$ ceph fs dump
dumped fsmap epoch 399
...
Filesystem 'cephfs' (27)
...
e399
max_mds 1
in 0
up {0=20384}
failed
damaged
stopped
...
[mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]
Standby daemons:
[mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
You may set mds_join_fs
on the standby to enforce your preference:
$ ceph config set mds.b mds_join_fs cephfs
after automatic failover:
$ ceph fs dump
dumped fsmap epoch 405
e405
...
Filesystem 'cephfs' (27)
...
max_mds 1
in 0
up {0=10420}
failed
damaged
stopped
...
[mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]
Standby daemons:
[mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]
Note in the above example that mds.b
now has join_fscid=27
. In this
output, the file system name from mds_join_fs
is changed to the file system
identifier (27). If the file system is recreated with the same name, the
standby will follow the new file system as expected.
Finally, if the file system is degraded or undersized, no failover will occur
to enforce mds_join_fs
.