Backfill Reservation

When a new OSD joins a cluster all PGs with it in their acting sets must eventually backfill. If all of these backfills happen simultaneously they will present excessive load on the OSD: the “thundering herd” effect.

The osd_max_backfills tunable limits the number of outgoing or incoming backfills that are active on a given OSD. Note that this limit is applied separately to incoming and to outgoing backfill operations. Thus there can be as many as osd_max_backfills * 2 backfill operations in flight on each OSD. This subtlety is often missed, and Ceph operators can be puzzled as to why more ops are observed than expected.

Each OSDService now has two AsyncReserver instances: one for backfills going from the OSD (local_reserver) and one for backfills going to the OSD (remote_reserver). An AsyncReserver (common/AsyncReserver.h) manages a queue by priority of waiting items and a set of current reservation holders. When a slot frees up, the AsyncReserver queues the Context* associated with the next item on the highest priority queue in the finisher provided to the constructor.

For a primary to initiate a backfill it must first obtain a reservation from its own local_reserver. Then it must obtain a reservation from the backfill target’s remote_reserver via a MBackfillReserve message. This process is managed by sub-states of Active and ReplicaActive (see the sub-states of Active in PG.h). The reservations are dropped either on the Backfilled event, which is sent on the primary before calling recovery_complete and on the replica on receipt of the BackfillComplete progress message), or upon leaving Active or ReplicaActive.

It’s important to always grab the local reservation before the remote reservation in order to prevent a circular dependency.

We minimize the risk of data loss by prioritizing the order in which PGs are recovered. Admins can override the default order by using force-recovery or force-backfill. A force-recovery with op priority 255 will start before a force-backfill op at priority 254.

If recovery is needed because a PG is below min_size a base priority of 220 is used. This is incremented by the number of OSDs short of the pool’s min_size as well as a value relative to the pool’s recovery_priority. The resultant priority is capped at 253 so that it does not confound forced ops as described above. Under ordinary circumstances a recovery op is prioritized at 180 plus a value relative to the pool’s recovery_priority. The resultant priority is capped at 219.

If backfill is needed because the number of acting OSDs is less than the pool’s min_size, a priority of 220 is used. The number of OSDs short of the pool’s min_size is added as well as a value relative to the pool’s recovery_priority. The total priority is limited to 253.

If backfill is needed because a PG is undersized, a priority of 140 is used. The number of OSDs below the size of the pool is added as well as a value relative to the pool’s recovery_priority. The resultant priority is capped at 179. If a backfill op is needed because a PG is degraded, a priority of 140 is used. A value relative to the pool’s recovery_priority is added. The resultant priority is capped at 179 . Under ordinary circumstances a backfill op priority of 100 is used. A value relative to the pool’s recovery_priority is added. The total priority is capped at 139.

Backfill and Recovery op priorities

Description

Base priority

Maximum priority

Backfill

100

139

Degraded Backfill

140

179

Recovery

180

219

Inactive Recovery

220

253

Inactive Backfill

220

253

force-backfill

254

force-recovery

255