BlueStore Migration¶
Each OSD must be formatted as either Filestore or BlueStore. However, a Ceph cluster can operate with a mixture of both Filestore OSDs and BlueStore OSDs. Because BlueStore is superior to Filestore in performance and robustness, and because Filestore is not supported by Ceph releases beginning with Reef, users deploying Filestore OSDs should transition to BlueStore. There are several strategies for making the transition to BlueStore.
BlueStore is so different from Filestore that an individual OSD cannot be converted in place. Instead, the conversion process must use either (1) the cluster’s normal replication and healing support, or (2) tools and strategies that copy OSD content from an old (Filestore) device to a new (BlueStore) one.
Deploying new OSDs with BlueStore¶
Use BlueStore when deploying new OSDs (for example, when the cluster is expanded). Because this is the default behavior, no specific change is needed.
Similarly, use BlueStore for any OSDs that have been reprovisioned after a failed drive was replaced.
Converting existing OSDs¶
“Mark-out
” replacement¶
The simplest approach is to verify that the cluster is healthy and
then follow these steps for each Filestore OSD in succession: mark the OSD
out
, wait for the data to replicate across the cluster, reprovision the OSD,
mark the OSD back in
, and wait for recovery to complete before proceeding
to the next OSD. This approach is easy to automate, but it entails unnecessary
data migration that carries costs in time and SSD wear.
Identify a Filestore OSD to replace:
ID=<osd-id-number> DEVICE=<disk-device>
Determine whether a given OSD is Filestore or BlueStore:
ceph osd metadata $ID | grep osd_objectstore
Get a current count of Filestore and BlueStore OSDs:
ceph osd count-metadata osd_objectstore
Mark a Filestore OSD
out
:ceph osd out $ID
Wait for the data to migrate off this OSD:
while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done
Stop the OSD:
systemctl kill ceph-osd@$ID
Note which device the OSD is using:
mount | grep /var/lib/ceph/osd/ceph-$ID
Unmount the OSD:
umount /var/lib/ceph/osd/ceph-$ID
Destroy the OSD’s data. Be EXTREMELY CAREFUL! These commands will destroy the contents of the device; you must be certain that the data on the device is not needed (in other words, that the cluster is healthy) before proceeding:
ceph-volume lvm zap $DEVICE
Tell the cluster that the OSD has been destroyed (and that a new OSD can be reprovisioned with the same OSD ID):
ceph osd destroy $ID --yes-i-really-mean-it
Provision a BlueStore OSD in place by using the same OSD ID. This requires you to identify which device to wipe, and to make certain that you target the correct and intended device, using the information that was retrieved in the “Note which device the OSD is using” step. BE CAREFUL! Note that you may need to modify these commands when dealing with hybrid OSDs:
ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID
Repeat.
You may opt to (1) have the balancing of the replacement BlueStore OSD take
place concurrently with the draining of the next Filestore OSD, or instead
(2) follow the same procedure for multiple OSDs in parallel. In either case,
however, you must ensure that the cluster is fully clean (in other words, that
all data has all replicas) before destroying any OSDs. If you opt to reprovision
multiple OSDs in parallel, be very careful to destroy OSDs only within a
single CRUSH failure domain (for example, host
or rack
). Failure to
satisfy this requirement will reduce the redundancy and availability of your
data and increase the risk of data loss (or even guarantee data loss).
Advantages:
Simple.
Can be done on a device-by-device basis.
No spare devices or hosts are required.
Disadvantages:
Data is copied over the network twice: once to another OSD in the cluster (to maintain the specified number of replicas), and again back to the reprovisioned BlueStore OSD.
“Whole host” replacement¶
If you have a spare host in the cluster, or sufficient free space to evacuate an entire host for use as a spare, then the conversion can be done on a host-by-host basis so that each stored copy of the data is migrated only once.
To use this approach, you need an empty host that has no OSDs provisioned. There are two ways to do this: either by using a new, empty host that is not yet part of the cluster, or by offloading data from an existing host that is already part of the cluster.
Using a new, empty host¶
Ideally the host will have roughly the same capacity as each of the other hosts you will be converting. Add the host to the CRUSH hierarchy, but do not attach it to the root:
NEWHOST=<empty-host-name>
ceph osd crush add-bucket $NEWHOST host
Make sure that Ceph packages are installed on the new host.
Using an existing host¶
If you would like to use an existing host that is already part of the cluster, and if there is sufficient free space on that host so that all of its data can be migrated off to other cluster hosts, you can do the following (instead of using a new, empty host):
OLDHOST=<existing-cluster-host-to-offload>
ceph osd crush unlink $OLDHOST default
where “default” is the immediate ancestor in the CRUSH map. (For smaller clusters with unmodified configurations this will normally be “default”, but it might instead be a rack name.) You should now see the host at the top of the OSD tree output with no parent:
bin/ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-5 0 host oldhost
10 ssd 1.00000 osd.10 up 1.00000 1.00000
11 ssd 1.00000 osd.11 up 1.00000 1.00000
12 ssd 1.00000 osd.12 up 1.00000 1.00000
-1 3.00000 root default
-2 3.00000 host foo
0 ssd 1.00000 osd.0 up 1.00000 1.00000
1 ssd 1.00000 osd.1 up 1.00000 1.00000
2 ssd 1.00000 osd.2 up 1.00000 1.00000
...
If everything looks good, jump directly to the “Wait for the data migration to complete” step below and proceed from there to clean up the old OSDs.
Migration process¶
If you’re using a new host, start at the first step. If you’re using an existing host, jump to this step.
Provision new BlueStore OSDs for all devices:
ceph-volume lvm create --bluestore --data /dev/$DEVICE
Verify that the new OSDs have joined the cluster:
ceph osd tree
You should see the new host
$NEWHOST
with all of the OSDs beneath it, but the host should not be nested beneath any other node in the hierarchy (likeroot default
). For example, ifnewhost
is the empty host, you might see something like:$ bin/ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -5 0 host newhost 10 ssd 1.00000 osd.10 up 1.00000 1.00000 11 ssd 1.00000 osd.11 up 1.00000 1.00000 12 ssd 1.00000 osd.12 up 1.00000 1.00000 -1 3.00000 root default -2 3.00000 host oldhost1 0 ssd 1.00000 osd.0 up 1.00000 1.00000 1 ssd 1.00000 osd.1 up 1.00000 1.00000 2 ssd 1.00000 osd.2 up 1.00000 1.00000 ...
Identify the first target host to convert :
OLDHOST=<existing-cluster-host-to-convert>
Swap the new host into the old host’s position in the cluster:
ceph osd crush swap-bucket $NEWHOST $OLDHOST
At this point all data on
$OLDHOST
will begin migrating to the OSDs on$NEWHOST
. If there is a difference between the total capacity of the old hosts and the total capacity of the new hosts, you may also see some data migrate to or from other nodes in the cluster. Provided that the hosts are similarly sized, however, this will be a relatively small amount of data.Wait for the data migration to complete:
while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done
Stop all old OSDs on the now-empty
$OLDHOST
:ssh $OLDHOST systemctl kill ceph-osd.target umount /var/lib/ceph/osd/ceph-*
Destroy and purge the old OSDs:
for osd in `ceph osd ls-tree $OLDHOST`; do ceph osd purge $osd --yes-i-really-mean-it done
Wipe the old OSDs. This requires you to identify which devices are to be wiped manually. BE CAREFUL! For each device:
ceph-volume lvm zap $DEVICE
Use the now-empty host as the new host, and repeat:
NEWHOST=$OLDHOST
Advantages:
Data is copied over the network only once.
An entire host’s OSDs are converted at once.
Can be parallelized, to make possible the conversion of multiple hosts at the same time.
No host involved in this process needs to have a spare device.
Disadvantages:
A spare host is required.
An entire host’s worth of OSDs will be migrating data at a time. This is likely to impact overall cluster performance.
All migrated data still makes one full hop over the network.
Per-OSD device copy¶
A single logical OSD can be converted by using the copy
function
included in ceph-objectstore-tool
. This requires that the host have one or more free
devices to provision a new, empty BlueStore OSD. For
example, if each host in your cluster has twelve OSDs, then you need a
thirteenth unused OSD so that each OSD can be converted before the
previous OSD is reclaimed to convert the next OSD.
Caveats:
This approach requires that we prepare an empty BlueStore OSD but that we do not allocate a new OSD ID to it. The
ceph-volume
tool does not support such an operation. IMPORTANT: because the setup of dmcrypt is closely tied to the identity of the OSD, this approach does not work with encrypted OSDs.The device must be manually partitioned.
An unsupported user-contributed script that demonstrates this process may be found here: https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash
Advantages:
Provided that the ‘noout’ or the ‘norecover’/’norebalance’ flags are set on the OSD or the cluster while the conversion process is underway, little or no data migrates over the network during the conversion.
Disadvantages:
Tooling is not fully implemented, supported, or documented.
Each host must have an appropriate spare or empty device for staging.
The OSD is offline during the conversion, which means new writes to PGs with the OSD in their acting set may not be ideally redundant until the subject OSD comes up and recovers. This increases the risk of data loss due to an overlapping failure. However, if another OSD fails before conversion and startup have completed, the original Filestore OSD can be started to provide access to its original data.