BlueStore Migration

Each OSD must be formatted as either Filestore or BlueStore. However, a Ceph cluster can operate with a mixture of both Filestore OSDs and BlueStore OSDs. Because BlueStore is superior to Filestore in performance and robustness, and because Filestore is not supported by Ceph releases beginning with Reef, users deploying Filestore OSDs should transition to BlueStore. There are several strategies for making the transition to BlueStore.

BlueStore is so different from Filestore that an individual OSD cannot be converted in place. Instead, the conversion process must use either (1) the cluster’s normal replication and healing support, or (2) tools and strategies that copy OSD content from an old (Filestore) device to a new (BlueStore) one.

Deploying new OSDs with BlueStore

Use BlueStore when deploying new OSDs (for example, when the cluster is expanded). Because this is the default behavior, no specific change is needed.

Similarly, use BlueStore for any OSDs that have been reprovisioned after a failed drive was replaced.

Converting existing OSDs

“Mark-out” replacement

The simplest approach is to verify that the cluster is healthy and then follow these steps for each Filestore OSD in succession: mark the OSD out, wait for the data to replicate across the cluster, reprovision the OSD, mark the OSD back in, and wait for recovery to complete before proceeding to the next OSD. This approach is easy to automate, but it entails unnecessary data migration that carries costs in time and SSD wear.

  1. Identify a Filestore OSD to replace:

    ID=<osd-id-number>
    DEVICE=<disk-device>
    
    1. Determine whether a given OSD is Filestore or BlueStore:

      ceph osd metadata $ID | grep osd_objectstore
      
    2. Get a current count of Filestore and BlueStore OSDs:

      ceph osd count-metadata osd_objectstore
      
  2. Mark a Filestore OSD out:

    ceph osd out $ID
    
  3. Wait for the data to migrate off this OSD:

    while ! ceph osd safe-to-destroy $ID ; do sleep 60 ; done
    
  4. Stop the OSD:

    systemctl kill ceph-osd@$ID
    
  5. Note which device the OSD is using:

    mount | grep /var/lib/ceph/osd/ceph-$ID
    
  6. Unmount the OSD:

    umount /var/lib/ceph/osd/ceph-$ID
    
  7. Destroy the OSD’s data. Be EXTREMELY CAREFUL! These commands will destroy the contents of the device; you must be certain that the data on the device is not needed (in other words, that the cluster is healthy) before proceeding:

    ceph-volume lvm zap $DEVICE
    
  8. Tell the cluster that the OSD has been destroyed (and that a new OSD can be reprovisioned with the same OSD ID):

    ceph osd destroy $ID --yes-i-really-mean-it
    
  9. Provision a BlueStore OSD in place by using the same OSD ID. This requires you to identify which device to wipe, and to make certain that you target the correct and intended device, using the information that was retrieved in the “Note which device the OSD is using” step. BE CAREFUL! Note that you may need to modify these commands when dealing with hybrid OSDs:

    ceph-volume lvm create --bluestore --data $DEVICE --osd-id $ID
    
  10. Repeat.

You may opt to (1) have the balancing of the replacement BlueStore OSD take place concurrently with the draining of the next Filestore OSD, or instead (2) follow the same procedure for multiple OSDs in parallel. In either case, however, you must ensure that the cluster is fully clean (in other words, that all data has all replicas) before destroying any OSDs. If you opt to reprovision multiple OSDs in parallel, be very careful to destroy OSDs only within a single CRUSH failure domain (for example, host or rack). Failure to satisfy this requirement will reduce the redundancy and availability of your data and increase the risk of data loss (or even guarantee data loss).

Advantages:

  • Simple.

  • Can be done on a device-by-device basis.

  • No spare devices or hosts are required.

Disadvantages:

  • Data is copied over the network twice: once to another OSD in the cluster (to maintain the specified number of replicas), and again back to the reprovisioned BlueStore OSD.

“Whole host” replacement

If you have a spare host in the cluster, or sufficient free space to evacuate an entire host for use as a spare, then the conversion can be done on a host-by-host basis so that each stored copy of the data is migrated only once.

To use this approach, you need an empty host that has no OSDs provisioned. There are two ways to do this: either by using a new, empty host that is not yet part of the cluster, or by offloading data from an existing host that is already part of the cluster.

Using a new, empty host

Ideally the host will have roughly the same capacity as each of the other hosts you will be converting. Add the host to the CRUSH hierarchy, but do not attach it to the root:

NEWHOST=<empty-host-name>
ceph osd crush add-bucket $NEWHOST host

Make sure that Ceph packages are installed on the new host.

Using an existing host

If you would like to use an existing host that is already part of the cluster, and if there is sufficient free space on that host so that all of its data can be migrated off to other cluster hosts, you can do the following (instead of using a new, empty host):

OLDHOST=<existing-cluster-host-to-offload>
ceph osd crush unlink $OLDHOST default

where “default” is the immediate ancestor in the CRUSH map. (For smaller clusters with unmodified configurations this will normally be “default”, but it might instead be a rack name.) You should now see the host at the top of the OSD tree output with no parent:

bin/ceph osd tree
ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
-5             0 host oldhost
10   ssd 1.00000     osd.10        up  1.00000 1.00000
11   ssd 1.00000     osd.11        up  1.00000 1.00000
12   ssd 1.00000     osd.12        up  1.00000 1.00000
-1       3.00000 root default
-2       3.00000     host foo
 0   ssd 1.00000         osd.0     up  1.00000 1.00000
 1   ssd 1.00000         osd.1     up  1.00000 1.00000
 2   ssd 1.00000         osd.2     up  1.00000 1.00000
...

If everything looks good, jump directly to the “Wait for the data migration to complete” step below and proceed from there to clean up the old OSDs.

Migration process

If you’re using a new host, start at the first step. If you’re using an existing host, jump to this step.

  1. Provision new BlueStore OSDs for all devices:

    ceph-volume lvm create --bluestore --data /dev/$DEVICE
    
  2. Verify that the new OSDs have joined the cluster:

    ceph osd tree
    

    You should see the new host $NEWHOST with all of the OSDs beneath it, but the host should not be nested beneath any other node in the hierarchy (like root default). For example, if newhost is the empty host, you might see something like:

    $ bin/ceph osd tree
    ID CLASS WEIGHT  TYPE NAME     STATUS REWEIGHT PRI-AFF
    -5             0 host newhost
    10   ssd 1.00000     osd.10        up  1.00000 1.00000
    11   ssd 1.00000     osd.11        up  1.00000 1.00000
    12   ssd 1.00000     osd.12        up  1.00000 1.00000
    -1       3.00000 root default
    -2       3.00000     host oldhost1
     0   ssd 1.00000         osd.0     up  1.00000 1.00000
     1   ssd 1.00000         osd.1     up  1.00000 1.00000
     2   ssd 1.00000         osd.2     up  1.00000 1.00000
    ...
    
  3. Identify the first target host to convert :

    OLDHOST=<existing-cluster-host-to-convert>
    
  4. Swap the new host into the old host’s position in the cluster:

    ceph osd crush swap-bucket $NEWHOST $OLDHOST
    

    At this point all data on $OLDHOST will begin migrating to the OSDs on $NEWHOST. If there is a difference between the total capacity of the old hosts and the total capacity of the new hosts, you may also see some data migrate to or from other nodes in the cluster. Provided that the hosts are similarly sized, however, this will be a relatively small amount of data.

  5. Wait for the data migration to complete:

    while ! ceph osd safe-to-destroy $(ceph osd ls-tree $OLDHOST); do sleep 60 ; done
    
  6. Stop all old OSDs on the now-empty $OLDHOST:

    ssh $OLDHOST
    systemctl kill ceph-osd.target
    umount /var/lib/ceph/osd/ceph-*
    
  7. Destroy and purge the old OSDs:

    for osd in `ceph osd ls-tree $OLDHOST`; do
       ceph osd purge $osd --yes-i-really-mean-it
    done
    
  8. Wipe the old OSDs. This requires you to identify which devices are to be wiped manually. BE CAREFUL! For each device:

    ceph-volume lvm zap $DEVICE
    
  9. Use the now-empty host as the new host, and repeat:

    NEWHOST=$OLDHOST
    

Advantages:

  • Data is copied over the network only once.

  • An entire host’s OSDs are converted at once.

  • Can be parallelized, to make possible the conversion of multiple hosts at the same time.

  • No host involved in this process needs to have a spare device.

Disadvantages:

  • A spare host is required.

  • An entire host’s worth of OSDs will be migrating data at a time. This is likely to impact overall cluster performance.

  • All migrated data still makes one full hop over the network.

Per-OSD device copy

A single logical OSD can be converted by using the copy function included in ceph-objectstore-tool. This requires that the host have one or more free devices to provision a new, empty BlueStore OSD. For example, if each host in your cluster has twelve OSDs, then you need a thirteenth unused OSD so that each OSD can be converted before the previous OSD is reclaimed to convert the next OSD.

Caveats:

  • This approach requires that we prepare an empty BlueStore OSD but that we do not allocate a new OSD ID to it. The ceph-volume tool does not support such an operation. IMPORTANT: because the setup of dmcrypt is closely tied to the identity of the OSD, this approach does not work with encrypted OSDs.

  • The device must be manually partitioned.

  • An unsupported user-contributed script that demonstrates this process may be found here: https://github.com/ceph/ceph/blob/master/src/script/contrib/ceph-migrate-bluestore.bash

Advantages:

  • Provided that the ‘noout’ or the ‘norecover’/’norebalance’ flags are set on the OSD or the cluster while the conversion process is underway, little or no data migrates over the network during the conversion.

Disadvantages:

  • Tooling is not fully implemented, supported, or documented.

  • Each host must have an appropriate spare or empty device for staging.

  • The OSD is offline during the conversion, which means new writes to PGs with the OSD in their acting set may not be ideally redundant until the subject OSD comes up and recovers. This increases the risk of data loss due to an overlapping failure. However, if another OSD fails before conversion and startup have completed, the original Filestore OSD can be started to provide access to its original data.