Troubleshooting PGs
Placement Groups Never Get Clean
If, after you have created your cluster, any Placement Groups (PGs) remain in
the active
status, the active+remapped
status or the
active+degraded
status and never achieves an active+clean
status, you
likely have a problem with your configuration.
In such a situation, it may be necessary to review the settings in the Pool, PG and CRUSH Config Reference and make appropriate adjustments.
As a general rule, run your cluster with more than one OSD and a pool size greater than two object replicas.
One Node Cluster
Ceph no longer provides documentation for operating on a single node. Systems designed for distributed computing by definition do not run on a single node. The mounting of client kernel modules on a single node that contains a Ceph daemon may cause a deadlock due to issues with the Linux kernel itself (unless VMs are used as clients). You can experiment with Ceph in a one-node configuration, in spite of the limitations as described herein.
To create a cluster on a single node, you must change the
osd_crush_chooseleaf_type
setting from the default of 1
(meaning
host
or node
) to 0
(meaning osd
) in your Ceph configuration
file before you create your monitors and OSDs. This tells Ceph that an OSD is
permitted to place another OSD on the same host. If you are trying to set up a
single-node cluster and osd_crush_chooseleaf_type
is greater than 0
,
Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on
another node, chassis, rack, row, or datacenter depending on the setting.
Tip
DO NOT mount kernel clients directly on the same node as your Ceph Storage Cluster. Kernel conflicts can arise. However, you can mount kernel clients within virtual machines (VMs) on a single node.
If you are creating OSDs using a single disk, you must manually create directories for the data first.
Fewer OSDs than Replicas
If two OSDs are in an up
and in
state, but the placement groups are not
in an active + clean
state, you may have an osd_pool_default_size
set
to greater than 2
.
There are a few ways to address this situation. If you want to operate your
cluster in an active + degraded
state with two replicas, you can set the
osd_pool_default_min_size
to 2
so that you can write objects in an
active + degraded
state. You may also set the osd_pool_default_size
setting to 2
so that you have only two stored replicas (the original and
one replica). In such a case, the cluster should achieve an active + clean
state.
Note
You can make the changes while the cluster is running. If you make the changes in your Ceph configuration file, you might need to restart your cluster.
Pool Size = 1
If you have osd_pool_default_size
set to 1
, you will have only one copy
of the object. OSDs rely on other OSDs to tell them which objects they should
have. If one OSD has a copy of an object and there is no second copy, then
there is no second OSD to tell the first OSD that it should have that copy. For
each placement group mapped to the first OSD (see ceph pg dump
), you can
force the first OSD to notice the placement groups it needs by running a
command of the following form:
ceph osd force-create-pg <pgid>
CRUSH Map Errors
If any placement groups in your cluster are unclean, then there might be errors in your CRUSH map.
Stuck Placement Groups
It is normal for placement groups to enter “degraded” or “peering” states after a component failure. Normally, these states reflect the expected progression through the failure recovery process. However, a placement group that stays in one of these states for a long time might be an indication of a larger problem. For this reason, the Ceph Monitors will warn when placement groups get “stuck” in a non-optimal state. Specifically, we check for:
inactive
- The placement group has not beenactive
for too long (that is, it hasn’t been able to service read/write requests).unclean
- The placement group has not beenclean
for too long (that is, it hasn’t been able to completely recover from a previous failure).stale
- The placement group status has not been updated by aceph-osd
. This indicates that all nodes storing this placement group may bedown
.
List stuck placement groups by running one of the following commands:
ceph pg dump_stuck stale
ceph pg dump_stuck inactive
ceph pg dump_stuck unclean
Stuck
stale
placement groups usually indicate that keyceph-osd
daemons are not running.Stuck
inactive
placement groups usually indicate a peering problem (see Placement Group Down - Peering Failure).Stuck
unclean
placement groups usually indicate that something is preventing recovery from completing, possibly unfound objects (see Unfound Objects);
Placement Group Down - Peering Failure
In certain cases, the ceph-osd
peering process can run into problems,
which can prevent a PG from becoming active and usable. In such a case, running
the command ceph health detail
will report something similar to the following:
ceph health detail
HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
...
pg 0.5 is down+peering
pg 1.4 is down+peering
...
osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
Query the cluster to determine exactly why the PG is marked down
by running a command of the following form:
ceph pg 0.5 query
{ "state": "down+peering",
...
"recovery_state": [
{ "name": "Started\/Primary\/Peering\/GetInfo",
"enter_time": "2012-03-06 14:40:16.169679",
"requested_info_from": []},
{ "name": "Started\/Primary\/Peering",
"enter_time": "2012-03-06 14:40:16.169659",
"probing_osds": [
0,
1],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
1],
"peering_blocked_by": [
{ "osd": 1,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us proceed"}]},
{ "name": "Started",
"enter_time": "2012-03-06 14:40:16.169513"}
]
}
The recovery_state
section tells us that peering is blocked due to down
ceph-osd
daemons, specifically osd.1
. In this case, we can start that
particular ceph-osd
and recovery will proceed.
Alternatively, if there is a catastrophic failure of osd.1
(for example, if
there has been a disk failure), the cluster can be informed that the OSD is
lost
and the cluster can be instructed that it must cope as best it can.
Important
Informing the cluster that an OSD has been lost is dangerous because the cluster cannot guarantee that the other copies of the data are consistent and up to date.
To report an OSD lost
and to instruct Ceph to continue to attempt recovery
anyway, run a command of the following form:
ceph osd lost 1
Recovery will proceed.
Unfound Objects
Under certain combinations of failures, Ceph may complain about unfound
objects, as in this example:
ceph health detail
HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
pg 2.4 is active+degraded, 78 unfound
This means that the storage cluster knows that some objects (or newer copies of existing objects) exist, but it hasn’t found copies of them. Here is an example of how this might come about for a PG whose data is on two OSDS, which we will call “1” and “2”:
1 goes down
2 handles some writes, alone
1 comes up
1 and 2 re-peer, and the objects missing on 1 are queued for recovery.
Before the new objects are copied, 2 goes down.
At this point, 1 knows that these objects exist, but there is no live
ceph-osd
that has a copy of the objects. In this case, IO to those objects
will block, and the cluster will hope that the failed node comes back soon.
This is assumed to be preferable to returning an IO error to the user.
Note
The situation described immediately above is one reason that setting
size=2
on a replicated pool and m=1
on an erasure coded pool risks
data loss.
Identify which objects are unfound by running a command of the following form:
ceph pg 2.4 list_unfound [starting offset, in json]
{
"num_missing": 1,
"num_unfound": 1,
"objects": [
{
"oid": {
"oid": "object",
"key": "",
"snapid": -2,
"hash": 2249616407,
"max": 0,
"pool": 2,
"namespace": ""
},
"need": "43'251",
"have": "0'0",
"flags": "none",
"clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
"locations": [
"0(3)",
"4(2)"
]
}
],
"state": "NotRecovering",
"available_might_have_unfound": true,
"might_have_unfound": [
{
"osd": "2(4)",
"status": "osd is down"
}
],
"more": false
}
If there are too many objects to list in a single result, the more
field
will be true and you can query for more. (Eventually the command line tool
will hide this from you, but not yet.)
Now you can identify which OSDs have been probed or might contain data.
At the end of the listing (before more: false
), might_have_unfound
is
provided when available_might_have_unfound
is true. This is equivalent to
the output of ceph pg #.# query
. This eliminates the need to use query
directly. The might_have_unfound
information given behaves the same way as
that query
does, which is described below. The only difference is that
OSDs that have the status of already probed
are ignored.
Use of query
:
ceph pg 2.4 query
"recovery_state": [
{ "name": "Started\/Primary\/Active",
"enter_time": "2012-03-06 15:15:46.713212",
"might_have_unfound": [
{ "osd": 1,
"status": "osd is down"}]},
In this case, the cluster knows that osd.1
might have data, but it is
down
. Here is the full range of possible states:
already probed
querying
OSD is down
not queried (yet)
Sometimes it simply takes some time for the cluster to query possible locations.
It is possible that there are other locations where the object might exist that are not listed. For example: if an OSD is stopped and taken out of the cluster and then the cluster fully recovers, and then through a subsequent set of failures the cluster ends up with an unfound object, the cluster will ignore the removed OSD. (This scenario, however, is unlikely.)
If all possible locations have been queried and objects are still lost, you may have to give up on the lost objects. This, again, is possible only when unusual combinations of failures have occurred that allow the cluster to learn about writes that were performed before the writes themselves have been recovered. To mark the “unfound” objects as “lost”, run a command of the following form:
ceph pg 2.5 mark_unfound_lost revert|delete
Here the final argument (revert|delete
) specifies how the cluster should
deal with lost objects.
The delete
option will cause the cluster to forget about them entirely.
The revert
option (which is not available for erasure coded pools) will
either roll back to a previous version of the object or (if it was a new
object) forget about the object entirely. Use revert
with caution, as it
may confuse applications that expect the object to exist.
Homeless Placement Groups
It is possible that every OSD that has copies of a given placement group fails.
If this happens, then the subset of the object store that contains those
placement groups becomes unavailable and the monitor will receive no status
updates for those placement groups. The monitor marks as stale
any
placement group whose primary OSD has failed. For example:
ceph health
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
Identify which placement groups are stale
and which were the last OSDs to
store the stale
placement groups by running the following command:
ceph health detail
HEALTH_WARN 24 pgs stale; 3/300 in osds are down
...
pg 2.5 is stuck stale+active+remapped, last acting [2,0]
...
osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861
This output indicates that placement group 2.5 (pg 2.5
) was last managed by
osd.0
and osd.2
. Restart those OSDs to allow the cluster to recover
that placement group.
Only a Few OSDs Receive Data
If only a few of the nodes in the cluster are receiving data, check the number of placement groups in the pool as instructed in the Placement Groups documentation. Since placement groups get mapped to OSDs in an operation involving dividing the number of placement groups in the cluster by the number of OSDs in the cluster, a small number of placement groups (the remainder, in this operation) are sometimes not distributed across the cluster. In situations like this, create a pool with a placement group count that is a multiple of the number of OSDs. See Placement Groups for details. See the Pool, PG, and CRUSH Config Reference for instructions on changing the default values used to determine how many placement groups are assigned to each pool.
Can’t Write Data
If the cluster is up, but some OSDs are down and you cannot write data, make
sure that you have the minimum number of OSDs running for the placement group.
If you don’t have the minimum number of OSDs running, Ceph will not allow you
to write data because there is no guarantee that Ceph can replicate your data.
See osd_pool_default_min_size
in the Pool, PG, and CRUSH Config
Reference for details.
PGs Inconsistent
If the command ceph health detail
returns an active + clean +
inconsistent
state, this might indicate an error during scrubbing. Identify
the inconsistent placement group or placement groups by running the following
command:
$ ceph health detail
HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
pg 0.6 is active+clean+inconsistent, acting [0,1,2]
2 scrub errors
Alternatively, run this command if you prefer to inspect the output in a programmatic way:
$ rados list-inconsistent-pg rbd
["0.6"]
There is only one consistent state, but in the worst case, we could have
different inconsistencies in multiple perspectives found in more than one
object. If an object named foo
in PG 0.6
is truncated, the output of
rados list-inconsistent-pg rbd
will look something like this:
rados list-inconsistent-obj 0.6 --format=json-pretty
{
"epoch": 14,
"inconsistents": [
{
"object": {
"name": "foo",
"nspace": "",
"locator": "",
"snap": "head",
"version": 1
},
"errors": [
"data_digest_mismatch",
"size_mismatch"
],
"union_shard_errors": [
"data_digest_mismatch_info",
"size_mismatch_info"
],
"selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
"shards": [
{
"osd": 0,
"errors": [],
"size": 968,
"omap_digest": "0xffffffff",
"data_digest": "0xe978e67f"
},
{
"osd": 1,
"errors": [],
"size": 968,
"omap_digest": "0xffffffff",
"data_digest": "0xe978e67f"
},
{
"osd": 2,
"errors": [
"data_digest_mismatch_info",
"size_mismatch_info"
],
"size": 0,
"omap_digest": "0xffffffff",
"data_digest": "0xffffffff"
}
]
}
]
}
In this case, the output indicates the following:
The only inconsistent object is named
foo
, and its head has inconsistencies.The inconsistencies fall into two categories:
errors
: these errors indicate inconsistencies between shards, without an indication of which shard(s) are bad. Check for theerrors
in theshards
array, if available, to pinpoint the problem.data_digest_mismatch
: the digest of the replica read fromOSD.2
is different from the digests of the replica reads ofOSD.0
andOSD.1
size_mismatch
: the size of the replica read fromOSD.2
is0
, but the size reported byOSD.0
andOSD.1
is968
.
union_shard_errors
: the union of all shard-specificerrors
in theshards
array. Theerrors
are set for the shard with the problem. These errors includeread_error
and other similar errors. Theerrors
ending inoi
indicate a comparison withselected_object_info
. Examine theshards
array to determine which shard has which error or errors.data_digest_mismatch_info
: the digest stored in theobject-info
is not0xffffffff
, which is calculated from the shard read fromOSD.2
size_mismatch_info
: the size stored in theobject-info
is different from the size read fromOSD.2
. The latter is0
.
Warning
If read_error
is listed in a shard’s errors
attribute, the
inconsistency is likely due to physical storage errors. In cases like this,
check the storage used by that OSD.
Examine the output of dmesg
and smartctl
before attempting a drive
repair.
To repair the inconsistent placement group, run a command of the following form:
ceph pg repair {placement-group-ID}
If you receive active + clean + inconsistent
states periodically due to
clock skew, consider configuring the NTP daemons on your monitor
hosts to act as peers. See The Network Time Protocol
and Ceph Clock Settings for more information.
Erasure Coded PGs are not active+clean
If CRUSH fails to find enough OSDs to map to a PG, it will show as a
2147483647
which is ITEM_NONE
or no OSD found
. For example:
[2,1,6,0,5,8,2147483647,7,4]
Not enough OSDs
If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine OSDs, the cluster will show “Not enough OSDs”. In this case, you either create another erasure coded pool that requires fewer OSDs, by running commands of the following form:
ceph osd erasure-code-profile set myprofile k=5 m=3
ceph osd pool create erasurepool erasure myprofile
or add new OSDs, and the PG will automatically use them.
CRUSH constraints cannot be satisfied
If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing constraints that cannot be satisfied. If there are ten OSDs on two hosts and the CRUSH rule requires that no two OSDs from the same host are used in the same PG, the mapping may fail because only two OSDs will be found. Check the constraint by displaying (“dumping”) the rule, as shown here:
ceph osd crush rule ls
[
"replicated_rule",
"erasurepool"]
$ ceph osd crush rule dump erasurepool
{ "rule_id": 1,
"rule_name": "erasurepool",
"type": 3,
"steps": [
{ "op": "take",
"item": -1,
"item_name": "default"},
{ "op": "chooseleaf_indep",
"num": 0,
"type": "host"},
{ "op": "emit"}]}
Resolve this problem by creating a new pool in which PGs are allowed to have OSDs residing on the same host by running the following commands:
ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
ceph osd pool create erasurepool erasure myprofile
CRUSH gives up too soon
If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster with a total of nine OSDs and an erasure coded pool that requires nine OSDs per PG), it is possible that CRUSH gives up before finding a mapping. This problem can be resolved by:
lowering the erasure coded pool requirements to use fewer OSDs per PG (this requires the creation of another pool, because erasure code profiles cannot be modified dynamically).
adding more OSDs to the cluster (this does not require the erasure coded pool to be modified, because it will become clean automatically)
using a handmade CRUSH rule that tries more times to find a good mapping. This can be modified for an existing CRUSH rule by setting
set_choose_tries
to a value greater than the default.
First, verify the problem by using crushtool
after extracting the crushmap
from the cluster. This ensures that your experiments do not modify the Ceph
cluster and that they operate only on local files:
ceph osd crush rule dump erasurepool
{ "rule_id": 1,
"rule_name": "erasurepool",
"type": 3,
"steps": [
{ "op": "take",
"item": -1,
"item_name": "default"},
{ "op": "chooseleaf_indep",
"num": 0,
"type": "host"},
{ "op": "emit"}]}
$ ceph osd getcrushmap > crush.map
got crush map from osdmap epoch 13
$ crushtool -i crush.map --test --show-bad-mappings \
--rule 1 \
--num-rep 9 \
--min-x 1 --max-x $((1024 * 1024))
bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]
Here, --num-rep
is the number of OSDs that the erasure code CRUSH rule
needs, --rule
is the value of the rule_id
field that was displayed by
ceph osd crush rule dump
. This test will attempt to map one million values
(in this example, the range defined by [--min-x,--max-x]
) and must display
at least one bad mapping. If this test outputs nothing, all mappings have been
successful and you can be assured that the problem with your cluster is not
caused by bad mappings.
Changing the value of set_choose_tries
Decompile the CRUSH map to edit the CRUSH rule by running the following command:
crushtool --decompile crush.map > crush.txt
Add the following line to the rule:
step set_choose_tries 100
The relevant part of the
crush.txt
file will resemble this:rule erasurepool { id 1 type erasure step set_chooseleaf_tries 5 step set_choose_tries 100 step take default step chooseleaf indep 0 type host step emit }
Recompile and retest the CRUSH rule:
crushtool --compile crush.txt -o better-crush.map
When all mappings succeed, display a histogram of the number of tries that were necessary to find all of the mapping by using the
--show-choose-tries
option of thecrushtool
command, as in the following example:crushtool -i better-crush.map --test --show-bad-mappings \ --show-choose-tries \ --rule 1 \ --num-rep 9 \ --min-x 1 --max-x $((1024 * 1024)) ... 11: 42 12: 44 13: 54 14: 45 15: 35 16: 34 17: 30 18: 25 19: 19 20: 22 21: 20 22: 17 23: 13 24: 16 25: 13 26: 11 27: 11 28: 13 29: 11 30: 10 31: 6 32: 5 33: 10 34: 3 35: 7 36: 5 37: 2 38: 5 39: 5 40: 2 41: 5 42: 4 43: 1 44: 2 45: 2 46: 3 47: 1 48: 0 ... 102: 0 103: 1 104: 0 ...
This output indicates that it took eleven tries to map forty-two PGs, twelve tries to map forty-four PGs etc. The highest number of tries is the minimum value of
set_choose_tries
that prevents bad mappings (for example,103
in the above output, because it did not take more than 103 tries for any PG to be mapped).