GEOM(4) | Device Drivers Manual | GEOM(4) |
GEOM
— modular
disk I/O request transformation framework
options GEOM_BDE
options GEOM_CACHE
options GEOM_CONCAT
options GEOM_ELI
options GEOM_GATE
options GEOM_JOURNAL
options GEOM_LABEL
options GEOM_LINUX_LVM
options GEOM_MAP
options GEOM_MIRROR
options GEOM_MOUNTVER
options GEOM_MULTIPATH
options GEOM_NOP
options GEOM_PART_APM
options GEOM_PART_BSD
options GEOM_PART_BSD64
options GEOM_PART_EBR
options GEOM_PART_EBR_COMPAT
options GEOM_PART_GPT
options GEOM_PART_LDM
options GEOM_PART_MBR
options GEOM_PART_VTOC8
options GEOM_RAID
options GEOM_RAID3
options GEOM_SHSEC
options GEOM_STRIPE
options GEOM_UZIP
options GEOM_VIRSTOR
options GEOM_ZERO
The GEOM
framework provides an
infrastructure in which “classes” can perform transformations
on disk I/O requests on their path from the upper kernel to the device
drivers and back.
Transformations in a GEOM
context range
from the simple geometric displacement performed in typical disk
partitioning modules over RAID algorithms and device multipath resolution to
full blown cryptographic protection of the stored data.
Compared to traditional “volume management”,
GEOM
differs from most and in some cases all
previous implementations in the following ways:
GEOM
is extensible. It is trivially simple to
write a new class of transformation and it will not be given stepchild
treatment. If someone for some reason wanted to mount IBM MVS diskpacks, a
class recognizing and configuring their VTOC information would be a
trivial matter.GEOM
is topologically agnostic. Most volume
management implementations have very strict notions of how classes can fit
together, very often one fixed hierarchy is provided, for instance,
subdisk - plex - volume.Being extensible means that new transformations are treated no differently than existing transformations.
Fixed hierarchies are bad because they make it impossible to
express the intent efficiently. In the fixed hierarchy above, it is not
possible to mirror two physical disks and then partition the mirror into
subdisks, instead one is forced to make subdisks on the physical volumes and
to mirror these two and two, resulting in a much more complex configuration.
GEOM
on the other hand does not care in which order
things are done, the only restriction is that cycles in the graph will not
be allowed.
GEOM
is quite object oriented and
consequently the terminology borrows a lot of context and semantics from the
OO vocabulary:
A “class”, represented by the data structure g_class implements one particular kind of transformation. Typical examples are MBR disk partition, BSD disklabel, and RAID5 classes.
An instance of a class is called a “geom” and represented by the data structure g_geom. In a typical i386 FreeBSD system, there will be one geom of class MBR for each disk.
A “provider”, represented by the data structure g_provider, is the front gate at which a geom offers service. A provider is “a disk-like thing which appears in /dev” - a logical disk in other words. All providers have three main properties: “name”, “sectorsize” and “size”.
A “consumer” is the backdoor through which a geom connects to another geom provider and through which I/O requests are sent.
The topological relationship between these entities are as follows:
All geoms have a rank-number assigned, which is used to detect and prevent loops in the acyclic directed graph. This rank number is assigned as follows:
In addition to the straightforward attach, which attaches a consumer to a provider, and detach, which breaks the bond, a number of special topological maneuvers exists to facilitate configuration and to improve the overall flexibility.
A new class will be offered to all existing providers in turn and a new provider will be offered to all classes in turn.
Exactly what a class does to recognize if it should accept the
offered provider is not defined by GEOM
, but the
sensible set of options are:
When a geom orphans a provider, all future I/O requests will “bounce” on the provider with an error code set by the geom. Any consumers attached to the provider will receive notification about the orphanization when the event loop gets around to it, and they can take appropriate action at that time.
A geom which came into being as a result of a normal taste operation should self-destruct unless it has a way to keep functioning whilst lacking the orphaned provider. Geoms like disk slicers should therefore self-destruct whereas RAID5 or mirror geoms will be able to continue as long as they do not lose quorum.
When a provider is orphaned, this does not necessarily result in any immediate change in the topology: any attached consumers are still attached, any opened paths are still open, any outstanding I/O requests are still outstanding.
The typical scenario is:
While this approach seems byzantine, it does provide the maximum flexibility and robustness in handling disappearing devices.
The one absolutely crucial detail to be aware of is that if the device driver does not return all I/O requests, the tree will not unravel.
Imagine a disk, da0, on top of which an MBR geom provides da0s1 and da0s2, and on top of da0s1 a BSD geom provides da0s1a through da0s1e, and that both the MBR and BSD geoms have autoconfigured based on data structures on the disk media. Now imagine the case where da0 is opened for writing and those data structures are modified or overwritten: now the geoms would be operating on stale metadata unless some notification system can inform them otherwise.
To avoid this situation, when the open of da0 for write happens, all attached consumers are told about this and geoms like MBR and BSD will self-destruct as a result. When da0 is closed, it will be offered for tasting again and, if the data structures for MBR and BSD are still there, new geoms will instantiate themselves anew.
Now for the fine print:
If any of the paths through the MBR or BSD module were open, they would have opened downwards with an exclusive bit thus rendering it impossible to open da0 for writing in that case. Conversely, the requested exclusive bit would render it impossible to open a path through the MBR geom while da0 is open for writing.
From this it also follows that changing the size of open geoms can only be done with their cooperation.
Finally: the spoiling only happens when the write count goes from zero to non-zero and the retasting happens only when the write count goes from non-zero to zero.
Finally, I/O is the reason we even do this: it concerns itself with sending I/O requests through the graph.
In total, four different I/O requests exist in
GEOM
: read, write, delete, and “get
attribute”.
Read and write are self explanatory.
Delete indicates that a certain range of data is no longer used and that it can be erased or freed as the underlying technology supports. Technologies like flash adaptation layers can arrange to erase the relevant blocks before they will become reassigned and cryptographic devices may want to fill random bits into the range to reduce the amount of data available for attack.
It is important to recognize that a delete indication is not a request and consequently there is no guarantee that the data actually will be erased or made unavailable unless guaranteed by specific geoms in the graph. If “secure delete” semantics are required, a geom should be pushed which converts delete indications into (a sequence of) write requests.
“Get attribute” supports inspection and manipulation of out-of-band attributes on a particular provider or path. Attributes are named by ASCII strings and they will be discussed in a separate section below.
(Stay tuned while the author rests his brain and fingers: more to come.)
Several flags are provided for tracing
GEOM
operations and unlocking protection mechanisms
via the kern.geom.debugflags sysctl. All of these
flags are off by default, and great care should be taken in turning them
on.
G_T_TOPOLOGY
)G_T_BIO
)G_T_ACCESS
)G_F_DISKIOCTL
)G_F_CTLDUMP
)The following options have been deprecated and will be removed in
FreeBSD 12: GEOM_BSD
,
GEOM_FOX
, GEOM_MBR
,
GEOM_SUNLABEL
, and
GEOM_VOL
.
Use GEOM_PART_BSD
,
GEOM_MULTIPATH
,
GEOM_PART_MBR
,
GEOM_PART_VTOC8
, GEOM_LABEL
options, respectively, instead.
libgeom(3), DECLARE_GEOM_CLASS(9), disk(9), g_access(9), g_attach(9), g_bio(9), g_consumer(9), g_data(9), g_event(9), g_geom(9), g_provider(9), g_provider_by_name(9)
This software was developed for the FreeBSD Project by Poul-Henning Kamp and NAI Labs, the Security Research Division of Network Associates, Inc. under DARPA/SPAWAR contract N66001-01-C-8035 (“CBOSS”), as part of the DARPA CHATS research program.
The first precursor for GEOM
was a
gruesome hack to Minix 1.2 and was never distributed. An earlier attempt to
implement a less general scheme in FreeBSD never
succeeded.
Poul-Henning Kamp <phk@FreeBSD.org>
April 9, 2018 | Debian |