Zero to Driver

This week I wrote a minimal NVME storage driver for Zircon. As usual, I used Gerrit as a place to backup my work-in-progress from time to time. The end results are a (possibly interesting) window into how I go from a zero to functional driver. The first section presents the very first shell of a driver I checked in. Each following section shows the diffs from the version preceeding it to that version of the driver. These focus on nvme.c, the guts of the driver, but nvme-hw.h (the header file with registers and structures and such) and (the Zircon build system file) are also present.

NVME M.2 modules

Where to start?

I like to read the documentation with an editor open to a header file where I type up constants and structures and macros and such for register access, data structures, and whathaveyou. It helps me wrap my head around the hardware.

A minimal shell of a driver that simply dumps parameters from the device and resets it. Useful to start getting some data from real hardware (since NVME has a lot of controller-specific parameters to look at):

Make it do something, anything

First interaction with the hardware! Submit an IDENTIFY command to the Admin Submission Queue and observe a reply from the Admin Completion Queue. Hexdump the results for inspection:

Start making it a little more real

Factor Admin Queue processing out into dedicated functions, provide a convenience function for transactions, wire up interrupts so we don’t have to spin on the completion status. Decode some of the information from the IDENTIFY command and display it. Issue an IDENTIFY NAMESPACE command as well. Actually publish a device instead of just failing.

Time to stop polling and actually use interrupts

Setup an IO submission and completion queue as well (preparation to doing actual disk IO) and fiddle with IRQ setup a bit while trying to figure out why IRQs work on HW but not in Qemu.

Now things get a bit more complicated

Some #if 0'd code down in nvme_init() where I experimented with IO READ ops to verify that I understood how the command structure and prp list worked. Added a QEMU_IRQ_HACK to use polling instead of IRQs so I could test with Qemu as well. Start sketching out IO operation processing, with the concept of breaking iotxns down into utxns that are 1:1 with nvme io operations. Introduce #defines for a bunch of magic numbers, some more comments, and an IO processing thread. Wire up the device ops nvme_get_size(), nvme_ioctl(), and nvme_queue_iotxn() which will be needed for this to act as a real block device.

Make it actually work

Until now, anything trying to open the driver or interact with it would fail or hang. It was a bunch of code that poked at the hardware when loaded but didn’t do anything beyond that.

Build out the IO processing with io_process_cpls() to handle completion messages from the HW and io_process_txns() to handle subdividing iotxns into utxns and issuing IO commands to the hardware. Not done yet, and not code reviewed, but the iochk multithreaded disk io exerciser runs against devices published by this driver without failing or causing the driver to crash, so yay!

Some clean up

Fix a bug where the io thread would spin instead of wait when there was no io pending. Add some simple stat counters (which helped detect this bug).

The specifications do not necessarily reflect the truth on the ground

Especially in the case where the peripheral is complex and has a bunch of optional features, tunables, etc, it’s worth exploring what actual hardware is capable of before depending on a feature nobody supports. For example, NVME allows the queues for submitting commands to be physically discontiguous, but no hardware I’ve seen so far supports that. Similarly it supports a (required) simple scatter/gather page-list (PRP) and an (optional) fancier scatter/gather format that’s much more flexible (SGL). Turns out no hardware I’ve seen supports SGLs either.

I’ve been collecting various parameters that different NVME controllers report, which are useful to see since if you just go by what the spec says is possible you get a very different picture of what hardware might be like…

NVME device features

If you want to learn more…

The NVME specs themselves live over here: