Zoned Storage SupportΒΆ
Zoned Storage is a class of storage devices that enables host and storage devices to cooperate to achieve higher storage capacities, increased throughput, and lower latencies. The zoned storage interface is available through the SCSI Zoned Block Commands (ZBC) and Zoned Device ATA Command Set (ZAC) standards on Shingled Magnetic Recording (SMR) hard disks today and is also being adopted for NVMe Solid State Disks with the upcoming NVMe Zoned Namespaces (ZNS) standard.
This project aims to enable Ceph to work on zoned storage drives and at the same time explore research problems related to adopting this new interface. The first target is to enable non-ovewrite workloads (e.g. RGW) on host-managed SMR (HM-SMR) drives and explore cleaning (garbage collection) policies. HM-SMR drives are high capacity hard drives with the ZBC/ZAC interface. The longer term goal is to support ZNS SSDs, as they become available, as well as overwrite workloads.
The first patch in these series enabled writing data to HM-SMR drives. This patch introduces ZonedFreelistManger, a FreelistManager implementation that passes enough information to ZonedAllocator to correctly initialize state of zones by tracking the write pointer and the number of dead bytes per zone. We have to introduce a new FreelistManager implementation because with zoned devices a region of disk can be in three states (empty, used, and dead), whereas current BitmapFreelistManager tracks only two states (empty and used). It is not possible to accurately initialize the state of zones in ZonedAllocator by tracking only two states. The third planned patch will introduce a rudimentary cleaner to form a baseline for further research.
Currently we can perform basic RADOS benchmarks on an OSD running on an HM-SMR drives, restart the OSD, and read the written data, and write new data, as can be seen below.
Please contact Abutalib Aghayev <agayev@psu.edu> for questions.
$ sudo zbd report -i -n /dev/sdc
Device /dev/sdc:
Vendor ID: ATA HGST HSH721414AL T240
Zone model: host-managed
Capacity: 14000.520 GB (27344764928 512-bytes sectors)
Logical blocks: 3418095616 blocks of 4096 B
Physical blocks: 3418095616 blocks of 4096 B
Zones: 52156 zones of 256.0 MB
Maximum number of open zones: no limit
Maximum number of active zones: no limit
52156 / 52156 zones
$ MON=1 OSD=1 MDS=0 sudo ../src/vstart.sh --new --localhost --bluestore --bluestore-devs /dev/sdc --bluestore-zoned
<snipped verbose output>
$ sudo ./bin/ceph osd pool create bench 32 32
pool 'bench' created
$ sudo ./bin/rados bench -p bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_h0.cc.journaling712.narwhal.p_29846
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 45 29 115.943 116 0.384175 0.407806
2 16 86 70 139.949 164 0.259845 0.391488
3 16 125 109 145.286 156 0.31727 0.404727
4 16 162 146 145.953 148 0.826671 0.409003
5 16 203 187 149.553 164 0.44815 0.404303
6 16 242 226 150.621 156 0.227488 0.409872
7 16 281 265 151.384 156 0.411896 0.408686
8 16 320 304 151.956 156 0.435135 0.411473
9 16 359 343 152.401 156 0.463699 0.408658
10 15 396 381 152.356 152 0.409554 0.410851
Total time run: 10.3305
Total writes made: 396
Write size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 153.333
Stddev Bandwidth: 13.6561
Max bandwidth (MB/sec): 164
Min bandwidth (MB/sec): 116
Average IOPS: 38
Stddev IOPS: 3.41402
Max IOPS: 41
Min IOPS: 29
Average Latency(s): 0.411226
Stddev Latency(s): 0.180238
Max latency(s): 1.00844
Min latency(s): 0.108616
$ sudo ../src/stop.sh
$ # Notice the lack of "--new" parameter to vstart.sh
$ MON=1 OSD=1 MDS=0 sudo ../src/vstart.sh --localhost --bluestore --bluestore-devs /dev/sdc --bluestore-zoned
<snipped verbose output>
$ sudo ./bin/rados bench -p bench 10 rand
hints = 1
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 61 45 179.903 180 0.117329 0.244067
2 16 116 100 199.918 220 0.144162 0.292305
3 16 174 158 210.589 232 0.170941 0.285481
4 16 251 235 234.918 308 0.241175 0.256543
5 16 316 300 239.914 260 0.206044 0.255882
6 15 392 377 251.206 308 0.137972 0.247426
7 15 458 443 252.984 264 0.0800146 0.245138
8 16 529 513 256.346 280 0.103529 0.239888
9 16 587 571 253.634 232 0.145535 0.2453
10 15 646 631 252.254 240 0.837727 0.246019
Total time run: 10.272
Total reads made: 646
Read size: 4194304
Object size: 4194304
Bandwidth (MB/sec): 251.558
Average IOPS: 62
Stddev IOPS: 10.005
Max IOPS: 77
Min IOPS: 45
Average Latency(s): 0.249385
Max latency(s): 0.888654
Min latency(s): 0.0103208
$ sudo ./bin/rados bench -p bench 10 write --no-cleanup
hints = 1
Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects
Object prefix: benchmark_data_h0.aa.journaling712.narwhal.p_64416
sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s)
0 0 0 0 0 0 - 0
1 16 46 30 119.949 120 0.52627 0.396166
2 16 82 66 131.955 144 0.48087 0.427311
3 16 123 107 142.627 164 0.3287 0.420614
4 16 158 142 141.964 140 0.405177 0.425993
5 16 192 176 140.766 136 0.514565 0.425175
6 16 224 208 138.635 128 0.69184 0.436672
7 16 261 245 139.967 148 0.459929 0.439502
8 16 301 285 142.468 160 0.250846 0.434799
9 16 336 320 142.189 140 0.621686 0.435457
10 16 374 358 143.166 152 0.460593 0.436384