This week I wrote a minimal NVME storage driver for
Zircon. As usual, I used
Gerrit as a place to backup my work-in-progress from
time to time. The end results are a (possibly interesting) window into how I go from a
zero to functional driver. The first section presents the very first shell of a driver
I checked in. Each following section shows the diffs from the version preceeding
it to that version of the driver. These focus on
nvme.c, the guts of the driver,
nvme-hw.h (the header file with registers and structures and such) and
(the Zircon build system file) are also present.
Where to start?
I like to read the documentation with an editor open to a header file where I type up constants and structures and macros and such for register access, data structures, and whathaveyou. It helps me wrap my head around the hardware.
A minimal shell of a driver that simply dumps parameters from the device and resets it.
Useful to start getting some data from real hardware (since NVME has a lot of
controller-specific parameters to look at):
Make it do something, anything
First interaction with the hardware! Submit an IDENTIFY command to the Admin Submission
Queue and observe a reply from the Admin Completion Queue. Hexdump the results for inspection:
Start making it a little more real
Factor Admin Queue processing out into dedicated functions, provide a convenience function
for transactions, wire up interrupts so we don’t have to spin on the completion status.
Decode some of the information from the IDENTIFY command and display it. Issue an IDENTIFY
NAMESPACE command as well. Actually publish a device instead of just failing.
Time to stop polling and actually use interrupts
Setup an IO submission and completion queue as well (preparation to doing actual disk IO)
and fiddle with IRQ setup a bit while trying to figure out why IRQs work on HW but not in Qemu.
Now things get a bit more complicated
#if 0'd code down in
nvme_init() where I experimented with IO READ ops to verify that I
understood how the command structure and prp list worked. Added a QEMU_IRQ_HACK to use
polling instead of IRQs so I could test with Qemu as well. Start sketching out IO operation
processing, with the concept of breaking iotxns down into utxns that are 1:1 with nvme io
operations. Introduce #defines for a bunch of magic numbers, some more comments, and an IO
processing thread. Wire up the device ops
which will be needed for this to act as a real block device.
Make it actually work
Until now, anything trying to open the driver or interact with it would fail or hang. It was a bunch of code that poked at the hardware when loaded but didn’t do anything beyond that.
Build out the IO processing with
io_process_cpls() to handle completion messages from the HW
io_process_txns() to handle subdividing iotxns into utxns and issuing IO commands to the
hardware. Not done yet, and not code reviewed, but the iochk multithreaded disk io exerciser
runs against devices published by this driver without failing or causing the driver to crash,
Some clean up
Fix a bug where the io thread would spin instead of wait when there was no io pending. Add
some simple stat counters (which helped detect this bug).
The specifications do not necessarily reflect the truth on the ground
Especially in the case where the peripheral is complex and has a bunch of optional features, tunables, etc, it’s worth exploring what actual hardware is capable of before depending on a feature nobody supports. For example, NVME allows the queues for submitting commands to be physically discontiguous, but no hardware I’ve seen so far supports that. Similarly it supports a (required) simple scatter/gather page-list (PRP) and an (optional) fancier scatter/gather format that’s much more flexible (SGL). Turns out no hardware I’ve seen supports SGLs either.
I’ve been collecting various parameters that different NVME controllers report, which are useful to see since if you just go by what the spec says is possible you get a very different picture of what hardware might be like…
If you want to learn more…
The NVME specs themselves live over here: http://nvmexpress.org/resources/specifications/