ZPOOLCONCEPTS(8) | System Manager's Manual | ZPOOLCONCEPTS(8) |
zpoolconcepts
—
overview of ZFS storage pools
A "virtual device" describes a single device or a collection of devices organized according to certain performance and fault characteristics. The following virtual devices are supported:
A raidz group can have single-, double-, or triple-parity, meaning that the raidz group can sustain one, two, or three failures, respectively, without losing any data. The raidz1 vdev type specifies a single-parity raidz group; the raidz2 vdev type specifies a double-parity raidz group; and the raidz3 vdev type specifies a triple-parity raidz group. The raidz vdev type is an alias for raidz1.
A raidz group with N disks of size X with P parity disks can hold approximately (N-P)*X bytes and can withstand P device(s) failing before data integrity is compromised. The minimum number of devices in a raidz group is one more than the number of parity disks. The recommended number is between 3 and 9 to help increase performance.
For more information on special allocations, see the Special Allocation Class section.
Virtual devices cannot be nested, so a mirror or raidz virtual device can only contain files or disks. Mirrors of mirrors (or other combinations) are not allowed.
A pool can have any number of virtual devices at the top of the configuration (known as "root vdevs"). Data is dynamically distributed across all top-level devices to balance data among devices. As new virtual devices are added, ZFS automatically places data on the newly available devices.
Virtual devices are specified one at a time on the command line, separated by whitespace. The keywords mirror and raidz are used to distinguish where a group ends and another begins. For example, the following creates two root vdevs, each a mirror of two disks:
# zpool create mypool mirror sda sdb mirror sdc sdd
ZFS supports a rich set of mechanisms for handling device failure and data corruption. All metadata and data is checksummed, and ZFS automatically repairs bad data from a good copy when corruption is detected.
In order to take advantage of these features, a pool must make use of some form of redundancy, using either mirrored or raidz groups. While ZFS supports running in a non-redundant configuration, where each root vdev is simply a disk or file, this is strongly discouraged. A single case of bit corruption can render some or all of your data unavailable.
A pool's health status is described by one of three states: online, degraded, or faulted. An online pool has all devices operating normally. A degraded pool is one in which one or more devices have failed, but the data is still available due to a redundant configuration. A faulted pool has corrupted metadata, or one or more faulted devices, and insufficient replicas to continue functioning.
The health of the top-level vdev, such as mirror or raidz device, is potentially impacted by the state of its associated vdevs, or component devices. A top-level vdev or component device is in one of the following states:
One or more component devices is in the degraded or faulted state, but sufficient replicas exist to continue functioning. The underlying conditions are as follows:
One or more component devices is in the faulted state, and insufficient replicas exist to continue functioning. The underlying conditions are as follows:
zpool
offline
command.If a device is removed and later re-attached to the system, ZFS attempts to put the device online automatically. Device attach detection is hardware-dependent and might not be supported on all platforms.
ZFS allows devices to be associated with pools as "hot spares". These devices are not actively used in the pool, but when an active device fails, it is automatically replaced by a hot spare. To create a pool with hot spares, specify a spare vdev with any number of devices. For example,
# zpool create pool mirror sda sdb spare sdc sdd
Spares can be shared across multiple pools, and can be added with
the zpool
add
command and
removed with the zpool
remove
command. Once a spare replacement is
initiated, a new spare vdev is created within the
configuration that will remain there until the original device is replaced.
At this point, the hot spare becomes available again if another device
fails.
If a pool has a shared spare that is currently being used, the pool can not be exported since other pools may use this shared spare, which may lead to potential data corruption.
Shared spares add some risk. If the pools are imported on different hosts, and both pools suffer a device failure at the same time, both could attempt to use the spare at the same time. This may not be detected, resulting in data corruption.
An in-progress spare replacement can be cancelled by detaching the hot spare. If the original faulted device is detached, then the hot spare assumes its place in the configuration, and is removed from the spare list of all active pools.
Spares cannot replace log devices.
The ZFS Intent Log (ZIL) satisfies POSIX requirements for synchronous transactions. For instance, databases often require their transactions to be on stable storage devices when returning from a system call. NFS and other applications can also use fsync(2) to ensure data stability. By default, the intent log is allocated from blocks within the main pool. However, it might be possible to get better performance using separate intent log devices such as NVRAM or a dedicated disk. For example:
# zpool create pool sda sdb log sdc
Multiple log devices can also be specified, and they can be mirrored. See the EXAMPLES section for an example of mirroring multiple log devices.
Log devices can be added, replaced, attached, detached and removed. In addition, log devices are imported and exported as part of the pool that contains them. Mirrored devices can be removed by specifying the top-level mirror vdev.
Devices can be added to a storage pool as "cache devices". These devices provide an additional layer of caching between main memory and disk. For read-heavy workloads, where the working set size is much larger than what can be cached in main memory, using cache devices allow much more of this working set to be served from low latency media. Using cache devices provides the greatest performance improvement for random read-workloads of mostly static content.
To create a pool with cache devices, specify a cache vdev with any number of devices. For example:
# zpool create pool sda sdb cache sdc sdd
Cache devices cannot be mirrored or part of a raidz configuration. If a read error is encountered on a cache device, that read I/O is reissued to the original storage pool device, which might be part of a mirrored or raidz configuration.
The content of the cache devices is
persistent across reboots and restored asynchronously when importing the
pool in L2ARC (persistent L2ARC). This can be disabled by setting
l2arc_rebuild_enabled
= 0. For cache devices smaller than 1GB we do not write the metadata
structures required for rebuilding the L2ARC in order not to waste space.
This can be changed with
l2arc_rebuild_blocks_min_l2size.
The cache device header (512 bytes) is updated even if no metadata
structures are written. Setting
l2arc_headroom
= 0 will result in scanning the full-length ARC lists for cacheable
content to be written in L2ARC (persistent ARC). If a cache device is added
with zpool
add
its label and
header will be overwritten and its contents are not going to be restored in
L2ARC, even if the device was previously part of the pool. If a cache device
is onlined with zpool
online
its contents will be restored in L2ARC. This is useful in case of memory
pressure where the contents of the cache device are not fully restored in
L2ARC. The user can off/online the cache device when there is less memory
pressure in order to fully restore its contents to L2ARC.
Before starting critical procedures that include destructive
actions (e.g zfs
destroy
),
an administrator can checkpoint the pool's state and in the case of a
mistake or failure, rewind the entire pool back to the checkpoint.
Otherwise, the checkpoint can be discarded when the procedure has completed
successfully.
A pool checkpoint can be thought of as a pool-wide snapshot and should be used with care as it contains every part of the pool's state, from properties to vdev configuration. Thus, while a pool has a checkpoint certain operations are not allowed. Specifically, vdev removal/attach/detach, mirror splitting, and changing the pool's guid. Adding a new vdev is supported but in the case of a rewind it will have to be added again. Finally, users of this feature should keep in mind that scrubs in a pool that has a checkpoint do not repair checkpointed data.
To create a checkpoint for a pool:
# zpool checkpoint pool
To later rewind to its checkpointed state, you need to first export it and then rewind it during import:
# zpool export pool # zpool import --rewind-to-checkpoint pool
To discard the checkpoint from a pool:
# zpool checkpoint -d pool
Dataset reservations (controlled by the
reservation
or
refreservation
zfs properties) may be unenforceable
while a checkpoint exists, because the checkpoint is allowed to consume the
dataset's reservation. Finally, data that is part of the checkpoint but has
been freed in the current state of the pool won't be scanned during a
scrub.
The allocations in the special class are dedicated to specific block types. By default this includes all metadata, the indirect blocks of user data, and any deduplication tables. The class can also be provisioned to accept small file blocks.
A pool must always have at least one normal (non-dedup/special) vdev before other devices can be assigned to the special class. If the special class becomes full, then allocations intended for it will spill back into the normal class.
Deduplication tables can be excluded from the special class by setting the zfs_ddt_data_is_special zfs module parameter to false (0).
Inclusion of small file blocks in the special class is opt-in. Each dataset can control the size of small file blocks allowed in the special class by setting the special_small_blocks dataset property. It defaults to zero, so you must opt-in by setting it to a non-zero value. See zfs(8) for more info on setting this property.
August 9, 2019 | Linux |