CVE-2021-20288: Unauthorized global_id reuse in cephx

NIST information page

Summary

Ceph was not ensuring that reconnecting/renewing clients were presenting an existing ticket when reclaiming their global_id value. An attacker that was able to authenticate could claim a global_id in use by a different client and potentially disrupt other cluster services.

Background

Each authenticated client or daemon in Ceph is assigned a numeric global_id identifier. That value is assumed to be unique across the cluster. When clients reconnect to the monitor (e.g., due to a network disconnection) or renew their ticket, they are supposed to present their old ticket to prove prior possession of their global_id so that it can be reclaimed and thus remain constant over the lifetime of that client instance.

Ceph was not correctly checking that the old ticket was valid, allowing an arbitrary global_id to be reclaimed, even if it was in use by another active client in the system.

Attacker Requirements

Any potential attacker must:

have a valid authentication key for the cluster
know or guess the global_id of another client
run a modified version of the Ceph client code to reclaim another client’s global_id
construct appropriate client messages or requests to disrupt service or exploit Ceph daemon assumptions about global_id uniqueness

Impact

Confidentiality Impact

None

Integrity Impact

Partial. An attacker could potentially exploit assumptions around global_id uniqueness to disrupt other clients’ access or disrupt Ceph daemons.

Availability Impact

High. An attacker could potentially exploit assumptions around global_id uniqueness to disrupt other clients’ access or disrupt Ceph daemons.

Access Complexity

High. The client must make use of modified client code in order to exploit specific assumptions in the behavior of other Ceph daemons.

Authentication

Yes. The attacker must also be authenticated and have access to the same services as a client it is wishing to impersonate or disrupt.

Gained Access

Partial. An attacker can partially impersonate another client.

Affected versions

All prior versions of Ceph monitors fail to ensure that global_id reclaim attempts are authentic.

In addition, all user-space daemons and clients starting from Luminous v12.2.0 were failing to securely reclaim their global_id following commit a2eb6ae3fb57 (“mon/monclient: hunt for multiple monitor in parallel”).

All versions of the Linux kernel client properly authenticate.

Fixed versions

Pacific v16.2.1 (and later)
Octopus v15.2.11 (and later)
Nautilus v14.2.20 (and later)

Fix details

Patched monitors now properly require that clients securely reclaim their global_id when the auth_allow_insecure_global_id_reclaim is false. Initially, by default, this option is set to true so that existing clients can continue to function without disruption until all clients have been upgraded. When this option is set to false, then an unpatched client will not be able to reconnect to the cluster after an intermittent network disruption breaking its connect to a monitor, or be able to renew its authentication ticket when it times out (by default, after 72 hours).

Patched monitors raise the AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED health alert if auth_allow_insecure_global_id_reclaim is enabled. This health alert can be muted with:
```
ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 1w
```
Although it is not recommended, the alert can also be disabled with:
```
ceph config set mon mon_warn_on_insecure_global_id_reclaim_allowed false
```
Patched monitors can disconnect new clients right after they have authenticated (forcing them to reconnect and reclaim) in order to determine whether they securely reclaim global_ids. This allows the cluster and users to discover quickly whether clients would be affected by requiring secure global_id reclaim: most clients will report an authentication error immediately. This behavior can be disabled by setting auth_expose_insecure_global_id_reclaim to false:
```
ceph config set mon auth_expose_insecure_global_id_reclaim false
```
Patched monitors will raise the AUTH_INSECURE_GLOBAL_ID_RECLAIM health alert for any clients or daemons that are not securely reclaiming their global_id. These clients should be upgraded before disabling the auth_allow_insecure_global_id_reclaim option to avoid disrupting client access.

By default (if auth_expose_insecure_global_id_reclaim has not been disabled), clients’ failure to securely reclaim global_id will immediately be exposed and raise this health alert. However, if auth_expose_insecure_global_id_reclaim has been disabled, this alert will not be triggered for a client until it is forced to reconnect to a monitor (e.g., due to a network disruption) or the client renews its authentication ticket (by default, after 72 hours).
The default time-to-live (TTL) for authentication tickets has been increased from 12 hours to 72 hours. Because we previously were not ensuring that a client’s prior ticket was valid when reclaiming their global_id, a client could tolerate a network outage that lasted longer than the ticket TTL and still reclaim its global_id. Once the cluster starts requiring secure global_id reclaim, a client that is disconnected for longer than the TTL may fail to reclaim its global_id, fail to reauthenticate, and be unable to continue communicating with the cluster until it is restarted. The default TTL was increased to minimize the impact of this change on users.

Recommendations

Users should upgrade to a patched version of Ceph at their earliest convenience.
Users should upgrade any unpatched clients at their earliest convenience. By default, these clients can be easily identified by checking the ceph health detail output for the AUTH_INSECURE_GLOBAL_ID_RECLAIM alert.

If all clients cannot be upgraded immediately, the health alerts can be temporarily muted with:

ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM 1w  # 1 week
ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 1w  # 1 week

After all clients have been updated and the AUTH_INSECURE_GLOBAL_ID_RECLAIM alert is no longer present, the cluster should be set to prevent insecure global_id reclaim with:
```
ceph config set mon auth_allow_insecure_global_id_reclaim false
```