Moving dm-integrity tags off the HDD to speed up RAID
Putting dm-integrity under RAID gives reasonable protection against silent data corruption, but on HDDs it slashes the read-write speeds in half. The slowdown is caused by the nature of dm-integrity. It reserves a portion of the disk for “tags”, which are checksums of data sectors. Whenever a sector is written, the corresponding tag has to be updated; whenever a sector is read, the tag has to be read as well. On HDDs, all that seeking tanks performance.
To combat the slowdown, dm-integrity allows storing tags on a separate disk. Setting it up is trivial, but fitting it with the rest of the storage puzzle is another matter. Here’s the problems I ran into, along with the solutions I settled upon.
Note: use “flexible I/O tester”, fio
You’ll probably want to benchmark your disks before and after applying my
advice. For this, fio
is indispensable. It’s a versatile tool written by
kernel block layer maintainers, and seem to be the de-facto standard for disk
benchmarking on Linux. Here’s how to check 4K write performance of
a dm-integrity device:
$ sudo fio --name=test --filename=/dev/mapper/integrity_29dd2d87 \
--direct=1 --ioengine=sync --rw=write --bs=4k --numjobs=1 \
--runtime=300s --time_based
The same command can benchmark the raw disk when given --filename=/dev/sda
.
Pitfall: provided_data_sectors
always uses 512-byte sectors
You can find out more about your dm-integrity device with the following command:
$ sudo integritysetup dump /dev/mapper/nvme0n1p3_lvm-integrity_tags_701637e5
Info for integrity device /dev/mapper/nvme0n1p3_lvm-integrity_tags_701637e5.
superblock_version 5
log2_interleave_sectors 0
integrity_tag_size 20
journal_sections 190
provided_data_sectors 1953521664
sector_size 4096
log2_blocks_per_bitmap 12
flags fix_hmac
Here, \(provided\_data\_sectors \times sector\_size\) hints at 7.27TiB, but in
fact this is a 1TiB disk. Don’t get alarmed that provided_data_sectors
doesn’t change when you specify different --sector-size
; it always uses
512-byte sectors (I believe that’s what the Linux storage layer always does).
Problem: it’s hard to precisely size the tags partition
Storing dm-integrity tags on a separate device is simple:
$ sudo integritysetup format TAG_DEVICE --data-device=DATA_DEVICE
DATA_DEVICE
contains the data, and is probably an entire disk. Tags are
shorter than the data sectors they’re generated from, so TAG_DEVICE
can be
smaller, but how small exactly?
It definitely can’t be smaller than \(tag\_size \times \frac{data\_disk\_size}{data\_sector\_size}\). For example, SHA-1 produces 20 bytes, so for 3TiB disk with 4K sectors dm-integrity tags will require \(20 \times \frac{3 * 1024 * 1024 * 1024}{4}\) bytes, or 15GiB.
On top of the tags, dm-integrity needs space for its journal. (It also supports
bitmap mode, but journal is more crash-resistant, so that’s the only thing
I tried.) In my experiments, it seems like 40MiB is enough for it; additional
storage doesn’t affect journal_sections
in integritysetup dump
output.
I might be leaving some performance on the table by not tuning the journal further, but the loss must be negligible compared to the 2x boost I get from moving the tags onto a separate disk. If you do dig into this, please write a blog post and I’ll link to it from here ;)
Problem: not all integrity algorithms are supported by /etc/integritytab
This might not be a problem for you if you’re running a newer distribution, don’t use /etc/integritytab, or don’t rely on systemd at all.
One way to open dm-integrity devices upon boot is to describe them in
/etc/integritytab. This is a systemd config, and your version of systemd might
not support all the options that integritysetup
supports. In my case, systemd
252 shipped with RaspberryOS doesn’t provide xxhash64 hashing algorithm, so
I had to go with a larger, heavier SHA-1.
Problem: mdraid assembles RAID straight from data devices
integritysetup TAG_DEVICE --data-device=DATA_DEVICE
only puts its magic
numbers onto the TAG_DEVICE
. DATA_DEVICE
remains unchanged and looks like
a pristine disk. When dm-integrity device is added into a RAID array, mdraid
magic numbers end up on DATA_DEVICE
at exactly the place where mdraid would
look for them, so upon next boot mdraid will add DATA_DEVICE
into RAID without
waiting for dm-integrity to come up.
To work around that, let’s make a couple tweaks to /etc/mdadm/mdadm.conf:
specify where to look for mdraid magic numbers:
DEVICE /dev/mapper/integrity_*
Assuming all your dm-integrity devices are named integrity_SOMETHING, this line means that mdraid only scans devices with these names. In other words, /dev/sda and others will be ignored.
Theoretically this should be enough. mdraid would wait for dm-integrity devices to appear, read its magic numbers from them, and add them to the RAID. In practice though, this didn’t work for me on Linux 6.12 with mdadm 4.2.
This line is still needed to prevent mdraid from assembling a new array out of dm-integrity data disks. Keep it.
to help mdraid with building the array from the correct devices, let’s specify them directly in the config:
ARRAY /dev/md0 metadata=1.2 name=raspberrypi5:0 UUID=0c778aa8:8edea278:52c7517a:33131878 devices=/dev/mapper/integrity_29dd2d87,/dev/mapper/integrity_701637e5
Everything up to
devices=...
part can be obtained frommdadm --detail --scan
. Then you have to manually adddevices=
, enumerating dm-integrity disks that actually constitute the array.
After updating the config, run update-initramfs -u
to add a copy into your
initramfs. (The command might be different if you’re running something other
than Debian.)
Problem: XFS won’t mount after changing the sector size
In the process of moving the tags off the HDD, I changed dm-integrity sector size from 512 to 4096 bytes to better match the underlying disks. To my surprise, this prevented XFS from mounting. Here’s the error from dmesg:
XFS (dm-3): device supports 4096 byte sectors (not 512)
Thanks to the helpful thread on the Unraid forum, I found that
I could mount the partition with -o offset=0
. This creates a loop device and
makes a mess of lsblk
, but works well.
Your thoughts are welcome by email
(here’s why my blog doesn’t have a comments form)