Okay, this thing didn’t get a whole lot of use after those NVME posts.
won’t unexpectedly consume 100% cpu in peoples’ browsers (sorry, people!).
It’s a year later and I still blame Joe.
Over the weekend I kept tinkering with this NVME driver,
because sometimes I am not good at work/life balance.
Tidying up and an optimization
Removed the no-longer-used
nvme_io_txn() and dependencies on the hexdump utility library,
leftover from early tests. Adjusted completion queue processing to only ring the doorbell
that tells the hardware that the queue tail has advanced after draining the queue (thank you,
Doug Gale, for pointing out this inefficiency in a comment on the CL).
Display more information, tidy the code that does it
NVME controllers are very flexible and I’m dumping a bunch of information about how whatever
controller the driver is talking to is configured. Now we dump even more, but tidy the code
that does it up a bit, adjust the formatting to be a bit more consistent, and prepare to stash
some of that information in the driver’s
nvme_device_t structure so we can reference it
from code that may want to take advantage of optional features in the future:
Here’s some typical output from the driver, for a Crucial M.2 NVME SSD:
nvme: version 1.2.0
nvme: page size: (MPSMIN): 4096 (MPSMAX): 4096
nvme: doorbell stride: 4
nvme: timeout: 65536 ms
nvme: boot partition support (BPS): N
nvme: supports NVM command set (CSS:NVM): Y
nvme: subsystem reset supported (NSSRS): N
nvme: weighted-round-robin (AMS:WRR): Y
nvme: vendor-specific arbitration (AMS:VS): N
nvme: contiquous queues required (CQR): Y
nvme: maximum queue entries supported (MQES): 65536
nvme: model: 'Force MP500'
nvme: serial number: '17457994000122410152'
nvme: firmware: 'E7FM04.5'
nvme: max outstanding commands: 0
nvme: max namespaces: 1
nvme: scatter gather lists (SGL): N 00000000
nvme: max data transfer: 2097152 bytes
nvme: sanitize caps: 0
nvme: abort command limit (ACL): 4
nvme: asynch event req limit (AERL): 4
nvme: firmware: slots: 1 reset: Y slot1ro: N
nvme: host buffer: min/preferred: 0/0 pages
nvme: capacity: total/unalloc: 0/0
nvme: volatile write cache (VWC): Y
nvme: atomic write unit (AWUN)/(AWUPF): 256/1 blks
nvme: feature: FIRMWARE_DOWNLOAD_COMMIT
nvme: feature: FORMAT_NVM
nvme: feature: SECURITY_SEND_RECV
nvme: feature: SAVE_SELECT_NONZERO
nvme: feature: WRITE_UNCORRECTABLE
nvme: ns: atomic write unit (AWUN)/(AWUPF): 256/1 blks
nvme: ns: NABSN/NABO/NABSPF/NOIOB: 255/0/0/0
nvme: ns: LBA FMT 00: RP=1 LBADS=2^9b MS=0b
nvme: ns: LBA FMT 01: RP=0 LBADS=2^12b MS=0b
nvme: ns: LBA FMT #0 active
nvme: ns: data protection: caps/set: 0x00/0
nvme: ns: size/cap/util: 234441648/234441648/234441648 blks
Later this will be much reduced and only displayed if verbose debug chatter is requested.
Some cleanup, make Plextor devices work
Added code to cancel in-flight transactions when the driver shuts down in
Added some (disabled) code to do a SHUTDOWN operation before RESET – a Plextor SSD was erroring
out when configuring the IO completion queue and my initial theory was maybe it didn’t like
the abrupt reset. Turned out I was erroneously setting the namespace ID in the CREATE QUEUE
command – neither Qemu, nor 6 other different NVME controllers cared about this, but the Plextor
controller is more particular about spec adherence here!
So this change also tidies up the various setup commands and leaves the namespace ID as 0 for the
several commands where it should be zero.
I still haven’t sorted out why the legacy PCI IRQs are not working properly with Qemu. I put
together a patch for Qemu to support PCI MSI interrupts which works around that problem from
the other side, removed the polling hack I had in the driver for use on Qemu, and filed a bug
against the owner of our PCI subsystem so he can investigate the legacy IRQ interaction when
The patch to Qemu (which we should tidy up and send upstream as we’ve done with other Qemu patches)
is currently applied to our local Qemu tree over here:
This week I wrote a minimal NVME storage driver for
Zircon. As usual, I used
Gerrit as a place to backup my work-in-progress from
time to time. The end results are a (possibly interesting) window into how I go from a
zero to functional driver. The first section presents the very first shell of a driver
I checked in. Each following section shows the diffs from the version preceeding
it to that version of the driver. These focus on
nvme.c, the guts of the driver,
nvme-hw.h (the header file with registers and structures and such) and
(the Zircon build system file) are also present.
Where to start?
I like to read the documentation with an editor open to a header file where I type
up constants and structures and macros and such for register access, data structures,
and whathaveyou. It helps me wrap my head around the hardware.
A minimal shell of a driver that simply dumps parameters from the device and resets it.
Useful to start getting some data from real hardware (since NVME has a lot of
controller-specific parameters to look at):
Make it do something, anything
First interaction with the hardware! Submit an IDENTIFY command to the Admin Submission
Queue and observe a reply from the Admin Completion Queue. Hexdump the results for inspection:
Start making it a little more real
Factor Admin Queue processing out into dedicated functions, provide a convenience function
for transactions, wire up interrupts so we don’t have to spin on the completion status.
Decode some of the information from the IDENTIFY command and display it. Issue an IDENTIFY
NAMESPACE command as well. Actually publish a device instead of just failing.
Time to stop polling and actually use interrupts
Setup an IO submission and completion queue as well (preparation to doing actual disk IO)
and fiddle with IRQ setup a bit while trying to figure out why IRQs work on HW but not in Qemu.
Now things get a bit more complicated
#if 0'd code down in
nvme_init() where I experimented with IO READ ops to verify that I
understood how the command structure and prp list worked. Added a QEMU_IRQ_HACK to use
polling instead of IRQs so I could test with Qemu as well. Start sketching out IO operation
processing, with the concept of breaking iotxns down into utxns that are 1:1 with nvme io
operations. Introduce #defines for a bunch of magic numbers, some more comments, and an IO
processing thread. Wire up the device ops
which will be needed for this to act as a real block device.
Make it actually work
Until now, anything trying to open the driver or interact with it would fail or hang. It was
a bunch of code that poked at the hardware when loaded but didn’t do anything beyond that.
Build out the IO processing with
io_process_cpls() to handle completion messages from the HW
io_process_txns() to handle subdividing iotxns into utxns and issuing IO commands to the
hardware. Not done yet, and not code reviewed, but the iochk multithreaded disk io exerciser
runs against devices published by this driver without failing or causing the driver to crash,
Some clean up
Fix a bug where the io thread would spin instead of wait when there was no io pending. Add
some simple stat counters (which helped detect this bug).
The specifications do not necessarily reflect the truth on the ground
Especially in the case where the peripheral is complex and has a bunch of optional features,
tunables, etc, it’s worth exploring what actual hardware is capable of before depending on
a feature nobody supports. For example, NVME allows the queues for submitting commands to
be physically discontiguous, but no hardware I’ve seen so far supports that. Similarly it
supports a (required) simple scatter/gather page-list (PRP) and an (optional) fancier
scatter/gather format that’s much more flexible (SGL). Turns out no hardware I’ve seen
supports SGLs either.
I’ve been collecting various parameters that different NVME controllers report, which are
useful to see since if you just go by what the spec says is possible you get a very different
picture of what hardware might be like…
If you want to learn more…
The NVME specs themselves live over here: http://nvmexpress.org/resources/specifications/
Well, first post, I guess!
I’m giving Hugo a try as a way to host my occasional ramblings.
It’s not perfect, but the basic static site generator thing makes sense to me and is pretty
straightforward. I edit markdown locally, preview with its built in webserver, and then
push the generated static content to my machine in the cloud.
which I’m not thrilled about, but will sort out how to fix that later.
Why this? Why now? I blame my friend Joe:
<nebkor> this is totally selfish, and I understand, but I wish you had a regular blog instead of g+
<swetland> me too!
<swetland> g+ is terrible for this
<swetland> but everything else is terrible too. and then I go down the rathole of writing a CMS
<nebkor> I mean
<geist> geocities man