aboutsummaryrefslogtreecommitdiff
path: root/sys/dev/nvme
Commit message (Collapse)AuthorAgeFilesLines
* nvme(4): Add MSI and single MSI-X support.Alexander Motin2021-09-075-48/+73
| | | | | | | | | | | | | | If we can't allocate more MSI-X vectors, accept using single shared. If we can't allocate any MSI-X, try to allocate 2 MSI vectors, but accept single shared. If still no luck, fall back to shared INTx. This provides maximal flexibility in some limited scenarios. For example, vmd(4) does not support INTx and can handle only limited number of MSI/MSI-X vectors without sharing. MFC after: 1 week (cherry picked from commit e3bdf3da769a55f0944d9c337bb4d91b6435f02c)
* nvme(4): Do not panic on admin queue construct error.Alexander Motin2021-09-071-0/+3
| | | | | | MFC after: 1 week (cherry picked from commit 31111372e6bad7212dbee36dd312e3b53fdfd3f6)
* nvme: coherently read status of completion recordsWarner Losh2021-07-311-4/+21
| | | | | | | | | | | | | | | | | | | | | | | | Coherently read the phase bit of the status completion record. We loop over the completion record array, looking for all the transactions in the same phase that have been completed. In doing that, we have to be careful to read the status field first, and if it indicates a complete record, we need to read and process that record. Otherwise, the host might be overtaken by device when reading this completion record, leading to a mistaken belief that the record is in phase. This leads to the code using old values and looking at an already completed entry, which has no current tracker. To work around this problem, we read the status and make sure it is in phase, we then re-read the entire completion record guaranteeing it's complete, valid, and consistent . In addition we resync the dmatag to reflect changes since the prior loop for the bouncing dma case. Reviewed by: jrtc27@, chuck@ Found by: jrtc27 (this fix is based in part on her D30995 fix) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31002 (cherry picked from commit aa0ab681ae755e01cd69435fab50f6852f248c42)
* nvme: Fix alignment on nvme structuresWarner Losh2021-07-311-5/+5
| | | | | | | | | | | | | | | Remove __packed from nvme_command, nvme_completion and nvme_dsm_trim. Add super-alignment to nvme_completion since it's always at least that aligned in hardware (and in our existing uses of it embedded in structures). It generates better code in nvme_qpair_process_completions on riscv64 because otherwise the ABI assumes a 4-byte alignment, and the same on all other platforms. Reviewed by: jrtc27@, mav@, chuck@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31001 (cherry picked from commit fea3cf1d6da0acf40bc1d3dadeeea7eeccbc10dd)
* nvme: style nitWarner Losh2021-07-311-14/+7
| | | | | | | | | Put the { on the same line as the struct nvme_foo when we define these structures. It's FreeBSD standard and these were inconsistent. Sponsored by: Netflix (cherry picked from commit 80a75155e1601bddc2c595c06ab6ea916c603071)
* nvme: fix a race between failing the controller and failing requestsWarner Losh2021-07-311-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | Part of the nvme recovery process for errors is to reset the card. Sometimes, this results in failing the entire controller. When nda is in use, we free the sim, which will sleep until all the I/O has completed. However, with only one thread, the request fail task never runs once the reset thread sleeps here. Create two threads to allow I/O to fail until it's all processed and the reset task can proceed. This is a temporary kludge until I can work out questions that arose during the review, not least is what was the race that queueing to a failure task solved. The original commit is vague and other error paths in the same context do a direct failure. I'll investigate that more completely before committing changing that to a direct failure. mav@ raised this issue during the review, but didn't otherwise object. Multiple threads, though, solve the problem in the mean time until other such means can be perfected. Reviewed by: jhb@ Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D30366 (cherry picked from commit f0f47121653e88197d8537572294b90f5aef7f17)
* nvme: use config_intrhook_drain to avoid removable card racesWarner Losh2021-07-312-6/+1
| | | | | | | | | | | | | | | | | | | | nvme drives are configured early in boot. However, a number of the configuration steps takes which take a while, so we defer those to a config intrhook that runs before the root filesystem is mounted. At the same time, the PCI hot plug wakes up and tests the status of the card. It may decide that the card has gone away and deletes the child. As part of that process nvme_detach is called. If this call happens after the config_intrhook starts to run, but before it is finished, there's a race where we can tear down the device's soft state while the config_intrhook is still using it. Use the new config_intrhook_drain to disestablish the hook. Either it will be removed w/o running, or the routine will wait for it to finish. This closes the race and allows safe hotplug at any time, even very early in boot. Sponsored by: Netflix, Inc Reviewed by: jhb, mav Differential Revision: https://reviews.freebsd.org/D29006 (cherry picked from commit 8423f5d4c127f18e7500bc455bc7b6b1691385ef)
* nvme: Make nvme_ctrlr_hw_reset staticWarner Losh2021-07-312-2/+1
| | | | | | | nvme_ctrlr_hw_reset is no longer used outside of nvme_ctrlr.c, so make it static. If we need to change this in the future we can. (cherry picked from commit dd2516fc078f15633ad5aedaad6de140cb491f80)
* nvme: use NVME_GONE rather than hard-coded 0xffffffffWarner Losh2021-07-313-4/+6
| | | | | | | Make it clearer that the value 0xfffffff is being used to detect the device is gone. We use it other places in the driver for other meanings. (cherry picked from commit 9600aa31aa633bbb9e8a56d91a781d5a7ce2bff6)
* fix big-endian platforms after 6733401935f8Chuck Tuffli2021-07-311-5/+9
| | | | | | | | | | | The NVMe byte-swap routines for big-endian platforms used memcpy() to move the unaligned 64-bit value into a temp register to byte swap it. Instead of introducing a dependency, manually byte-swap the values in place. Point hat: me (cherry picked from commit e83fdf8bb391579fa422d34663cd8c1f82a00dc0)
* nvmecontrol: add device self-test op and log pageChuck Tuffli2021-07-311-0/+39
| | | | | | | | | | | | Add decoding of the Device Self-test log page and the ability to start or abort a test. Reviewed by: imp, mav Tested by: Muhammad Ahmad <muhammad.ahmad@seagate.com> MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D27517 (cherry picked from commit 6733401935f83754b4b2744bc3d33ef84b1271e0)
* nvme: Remove a wmb() that's not necessary.Warner Losh2021-07-311-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | bus_dmamap_sync() ensures that memory that's prepared for PREWRITE can be DMA'd immediately after it returns. The details differ, but this mirrors atomic thread release semantics, at least for the buffers synced. For non-x86 platforms, bus_dmamap_sync() has the right syncing and fences. So in the past, wmb() had been omitted for them. For x86 platforms, the memory ordering is already strong enough to ensure DMA to the device sees the current contents. As such, we don't need the wmb() here. It translates to an sfence which is only needed for writes to regions that have the write combining attribute set or when some exotic opcodes are used. The nvme driver does neither of these. Since bus_dmamap_sync() includes atomic_thread_fence_rel, we can be assured any optimizer won't reorder the bus_dmamap_sync and the bus_space_write operations. The wmb() was a vestiage of the pre-busdma version initially committed to the tree. Reviewed by: kib@, gallatin@, chuck@, mav@ Differential Revision: https://reviews.freebsd.org/D27448 (cherry picked from commit 082905cad121bf6721606b6b9ba20a09bc6e56d0)
* NVME: Multiple busdma related fixes.Michal Meloun2021-07-311-4/+4
| | | | | | | | | | | | | | | | | | - in nvme_qpair_process_completions() do dma sync before completion buffer is used. - in nvme_qpair_submit_tracker(), don't do explicit wmb() also for arm and arm64. Bus_dmamap_sync() on these architectures is sufficient to ensure that all CPU stores are visible to external (including DMA) observers. - Allocate completion buffer as BUS_DMA_COHERENT. On not-DMA coherent systems, buffers continuously owned (and accessed) by DMA must be allocated with this flag. Note that BUS_DMA_COHERENT flag is no-op on DMA coherent systems (or coherent buses in mixed systems). MFC after: 4 weeks Reviewed by: mav, imp Differential Revision: https://reviews.freebsd.org/D27446 (cherry picked from commit 8f9d5a8dbf4ea69c5f9a1e3a36e23732ffaa5c75)
* Always use the __unused attribute even for potentially unused parameters.Michal Meloun2021-07-311-24/+17
| | | | | | | Requested by: ian, imp MFC with: r368167 (cherry picked from commit cf7c06293236710cc33de029fccd1185cb38c5fb)
* Unbreak r368167 in userland. Decorate unused arguments.Michal Meloun2021-07-311-17/+24
| | | | | | | Reported by: kp, tuexen, jenkins, and many others MFC with: r368167 (cherry picked from commit b2e9e573a392a973bea0ff180932913b7aa0eb66)
* nvme: change namei_request_zone into a malloc typeMateusz Guzik2021-07-312-8/+2
| | | | | | | | | | | | Both the size (128 bytes) and ephemeral nature of allocations make it a great fit for malloc. A dedicated zone unnecessarily avoids sharing buckets with 128-byte objects. Reviewed by: imp Differential Revision: https://reviews.freebsd.org/D27103 (cherry picked from commit 71460dfcb275f0a2a20b39a332b0e1149c6e7e3f)
* nvme: Remove compat code for older kernelsWarner Losh2021-07-311-10/+0
| | | | | | | | Remove code that supported pre-2011 kernels. CTLTYPE_S64 was defined in rev 217616. All supported branches have it, so remove its compat definition as OBE. (cherry picked from commit 0fc1d2088169456d469b53ecbe7832349917c29d)
* Use symbolic names for asych eventsWarner Losh2021-07-312-1/+4
| | | | | | | Rather than |= 0x300, define and use asyn event names for the name space changes and the firmware activations that we're asking for. (cherry picked from commit 881534f09cccbf4bc749be22eb34ad57b5c13563)
* Report cpi->hba_* for nda(4) because why not.Alexander Motin2021-07-311-1/+5
| | | | MFC after: 1 week
* Add KASSERT to ensure sane nsid.Warner Losh2021-07-311-1/+6
| | | | | | | | All callers are currently filtering bad nsid to this function, however, we'll have undefined behavior if that's not true. Add the KASSERT to prevent that. (cherry picked from commit d5cc572ce6009993fb3c4f6c887194b9ec3c9815)
* Rename ns notification function...Warner Losh2021-07-311-3/+3
| | | | | | | This function is called whenever the namespace is added, deleted or changes. Update the name to reflect that. No functional change. (cherry picked from commit 950475ca2062b5d95efcf4d758cb5f33d7710aed)
* Make sure that we get the sbuf resources we need.Warner Losh2021-07-311-1/+2
| | | | | | | | | Since we're calling sbuf_new with NOWAIT, make sure it can allocate a buffer to use. Don't print anything if we can't get it. Noticed by: rpokala (cherry picked from commit 4e6a434b6bb81a7ae80911ec6730ff79b9352a88)
* Generate a devctl event for interesting eventsWarner Losh2021-07-311-8/+43
| | | | | | | When we reset the controller, and when the controller tells us about a critical warning, send an event. (cherry picked from commit 244b805397208842e4d8bbf1ad5b1b83dbcd4c91)
* nvme: Enable interrupts after qpair fully constructedWarner Losh2021-07-211-25/+25
| | | | | | | | | | | | | | | | | | | | To guard against the ill effects of a spurious interrupt during construction (or one that was bogusly pending), enable interrupts after the qpair is completely constructed. Otherwise, we can die with null pointer dereferences in nvme_qpair_process_completions. This has been observed in at least one pre-release NVMe drive where the MSIX interrupt fired while the queue was being created, before we'd started the NVMe controller card. The alternative of only turning on the interrupts after the rest was tried, but was insufficient to work around this bug and made the code more complicated w/o benefit. Reviewed by: mav, chuck Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D31182 (cherry picked from commit fc9a0840231770bc7e7dcfe4616babdc6d4389a6)
* nvme(4): Report NPWA before NPWG as stripesize.Alexander Motin2021-07-131-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | New Samsung 980 SSDs report Namespace Preferred Write Alignment of 8 (4KB) and Namespace Preferred Write Granularity of 32 (16KB). My quick tests show that 16KB is a minimal sequential write size when the SSD reaches peak IOPS, so writing much less is very slow. But writing slightly less or slightly more does not change much, so it seems not so much a size granularity as minimum I/O size. Thinking about different stripesize consumers: - Partition alignment should be based on NPWA by definition. - ZFS ashift in part of forcing alignment of all I/Os should also be based on NPWA. In part of forcing size granularity, if really needed, it may be set to NPWG, but too big value can make ZFS too space-inefficient, and the 16KB is actually the biggest supported value there now. - ZFS recordsize/volblocksize could potentially be tuned up toward NPWG to work as I/O size granularity, but enabled compression makes it too fuzzy. And those are normally user-configurable things. - ZFS I/O aggregation code could definitely use Optimal Write Size value and may be NPWG, but we don't have fields in GEOM now to report the minimal and optimal I/O sizes, and even maximal is not reported outside GEOM DISK to be used by ZFS. MFC after: 1 week (cherry picked from commit e3bcd07d834def94dcf570ac7350ca2c454ebf10)
* Partially revert r248770.Dmitry Chagin2021-04-161-1/+1
| | | | | | | | | | | Under geom(4) nvme_ns_bio_process() is on the path where sleep is prohibited as g_io_shedule_down() calls THREAD_NO_SLEEPNG() before geom->start(). Reviewed By: imp Differential Revision: https://reviews.freebsd.org/D29539 (cherry picked from commit a78109d5db87b08785a822770e2e4fdb15f921b6)
* nvme: Replace potentially long DELAY() with pause().Alexander Motin2021-03-241-13/+11
| | | | | | | | | | | | | In some cases like broken hardware nvme(4) may wait minutes for controller response before timeout. Doing so in a tight spin loop made whole system unresponsive. Reviewed by: imp MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29309 Sponsored by: iXsystems, Inc. (cherry picked from commit 4fbbe523653b6d2a0186aca38224efcab941deaa)
* MFC r368167,r368187,r368203:Michal Meloun2020-12-171-17/+50
| | | | | | | | | | | | | | | r368167: NVME: Don't try to swap data on little endian machines. These swapping functions violate BUSDMA contract - we cannot write to armed (by bus_dmamap_sync(PRE_..)) buffers. Remove them at least from little endian machines until a better solution will be developed. r368187: Unbreak r368167 in userland. Decorate unused arguments. r368203: Always use the __unused attribute even for potentially unused parameters. Notes: svn path=/stable/12/; revision=368717
* MFC r368275: nvme: Fix typo in definitionChuck Tuffli2020-12-142-2/+2
| | | | Notes: svn path=/stable/12/; revision=368638
* MFC r367955:Michal Meloun2020-12-141-0/+1
| | | | | | | | Ensure that the buffer is in nvme_single_map() mapped to single segment. Not a functional change. Notes: svn path=/stable/12/; revision=368630
* MFC r368132: Increase nvme(4) maximum transfer size from 1MB to 2MB.Alexander Motin2020-12-134-19/+16
| | | | | | | | | | | | | | | | With 4KB page size the 2MB is the maximum we can address with one page PRP. Going further would require chaining, that would add some more complexity. On the other side, to reduce memory consumption, allocate the PRP memory respecting maximum transfer size reported in the controller identify data. Many of NVMe devices support much smaller values, starting from 128KB. To do that we have to change the initialization sequence to pull the data earlier, before setting up the I/O queue pairs. The admin queue pair is still allocated for full MIN(maxphys, 2MB) size, but it is not a big deal, since there is only one such queue with only 16 trackers. Notes: svn path=/stable/12/; revision=368602
* MFC r367625: Fix panic if NVMe is detached before the intrhook call.Alexander Motin2020-11-192-1/+8
| | | | Notes: svn path=/stable/12/; revision=367825
* MFC r367659: Add PMRCAP printing and fix earlier CAP_HI.Alexander Motin2020-11-172-6/+48
| | | | Notes: svn path=/stable/12/; revision=367739
* MFC r367109, r367113: Print NVMe controller capabilities in verbose dmesg.Alexander Motin2020-11-042-2/+41
| | | | | | | | Those values are not reported in controller identification, while sometimes interesting for development and debugging. Notes: svn path=/stable/12/; revision=367329
* MFC r366911:Brooks Davis2020-10-291-3/+1
| | | | | | | | | | | | | | | | | | | | | vmapbuf: don't smuggle address or length in buf Instead, add arguments to vmapbuf. Since this argument is always a pointer use a type of void * and cast to vm_offset_t in vmapbuf. (In CheriBSD we've altered vm_fault_quick_hold_pages to take a pointer and check its bounds.) In no other situtation does b_data contain a user pointer and vmapbuf replaces b_data with the actual mapping. Suggested by: jhb Reviewed by: imp, jhb Obtained from: CheriBSD Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D26784 Notes: svn path=/stable/12/; revision=367140
* MFC r366707: Use RTD3 Entry Latency value as shutdown timeout.Alexander Motin2020-10-211-3/+5
| | | | | | | | | This field was not in specs when the driver was written, but now there are SSDs with the reported latency of 10s, where hardcoded value of 5s seems to be not enough sometimes, causing shutdown timeout messages. Notes: svn path=/stable/12/; revision=366905
* MFC r365946:David Bright2020-09-292-0/+4
| | | | | | | | | Add an ioctl to get an NVMe device's maximum transfer size Sponsored by: Dell EMC Isilon Notes: svn path=/stable/12/; revision=366255
* MFC r360483,360484: Make nvmecontrol work with nda like it does withColin Percival2020-09-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | nvd, and associated bits. This commit changes the size of 'struct ccb_pathinq_settings_nvme', which would normally risk breaking kernel ABI; however, that structure is only ever used as part of a union with larger structures -- so nothing really changes size. r360483: Return the nvmeX device associated with the ndaX device. Add the nvmeX device to the XPT_PATH_INQ nvme specific information. while one could figure this out by looking up the domain:bus:slot:function, it's a lot easier to have the SIM set it directly since the sim knows this. r360484: Implement the NVME_GET_NSID and NVME_PASSTHROUGH_CMD ioctls With these two ioctls implemented in the nda driver, nvmecontrol now works with nda just like it does with nvd. It eliminates the need to jump through odd hoops to get this data. Discussed with: imp Notes: svn path=/stable/12/; revision=366179
* MFC r362630: Fix few panics on NVMe's timing out initialization requests.Alexander Motin2020-07-021-13/+19
| | | | Notes: svn path=/stable/12/; revision=362881
* MFC r362337: Make polled request timeout less invasive.Alexander Motin2020-06-243-9/+17
| | | | | | | | | | | | | Instead of panic after one second of polling, make the normal timeout handler to activate, reset the controller and abort the outstanding requests. If all of it won't happen within 10 seconds then something in the driver is likely stuck bad and panic is the only way out. In particular this fixed device hot unplug during execution of those polled commands, allowing clean device detach instead of panic. Notes: svn path=/stable/12/; revision=362579
* MFC r362282: Fix admin qpair leak if detached during initial reset.Alexander Motin2020-06-242-17/+30
| | | | Notes: svn path=/stable/12/; revision=362578
* MFC r362100: Fix config_intrhook leak on initial reset failure.Alexander Motin2020-06-191-0/+2
| | | | Notes: svn path=/stable/12/; revision=362357
* MFC r360504 (by imp): Style(9) nit: put function name at start of line.Alexander Motin2020-06-181-1/+2
| | | | Notes: svn path=/stable/12/; revision=362340
* MFC r360503 (by imp): Move / reword a comment.Alexander Motin2020-06-181-7/+5
| | | | | | | | Explain what we're doing with mapping CAM's notion of a LUN to NVMe's notion of a namespace. Notes: svn path=/stable/12/; revision=362339
* MFC r360568:David Bright2020-05-147-17/+24
| | | | | | | | | | | | | | Fix various Coverity-detected errors in nvme driver This fixes several Coverity-detected errors in the nvme driver. CIDs addressed: 1008344, 1009377, 1009380, 1193740, 1305470, 1403975, 1403980 Sponsored by: Dell EMC Isilon Notes: svn path=/stable/12/; revision=361030
* MFC r356474, r356480, r356482, r356506:Alexander Motin2020-01-224-9/+210
| | | | | | | | | | | | | | Add Host Memory Buffer support to nvme(4). This allows cheapest DRAM-less NVMe SSDs to use some of host RAM (about 1MB per 1GB on the devices I have) for its metadata cache, significantly improving random I/O performance. Device reports minimal and preferable size of the buffer. The code limits it to 5% of physical RAM by default. If the buffer can not be allocated or below minimal size, the device will just have to work without it. Notes: svn path=/stable/12/; revision=356961
* MFC r355774 (by mmel): Properly synchronize completion DMA buffers.Alexander Motin2020-01-221-5/+9
| | | | | | | | | Within command completion processing the callback function may access DMAed data buffer. Synchronize it before use, not after. This allows to use NVMe disk on non-DMA coherent arm64 system. Notes: svn path=/stable/12/; revision=356957
* MFC r355721 (by imp): Move to using bool instead of boolean_tAlexander Motin2020-01-223-15/+15
| | | | | | | | | | | | While there are subtle semantic differences between bool and boolean_t, none of them matter in these cases. Prefer true/false when dealing with bool type. Preserve a couple of TRUEs since they are passed into int args into CAM. Preserve a couple of FALSEs when used for status.done, an int. Differential Revision: https://reviews.freebsd.org/D20999 Notes: svn path=/stable/12/; revision=356956
* MFC r355631 (by imp): Move reset to the interrutp processing stageAlexander Motin2020-01-222-19/+19
| | | | | | | | | This trims the boot time a bit more for AWS and other platforms that have nvme drives. There's no reason too do this inline. This has been in my tree a while, but IIRC I talked to Jim Harris about this at one of our face to face meetings. Notes: svn path=/stable/12/; revision=356955
* MFC r355465 (by imp): trackers always know what qpair they are onAlexander Motin2020-01-221-11/+14
| | | | | | | | | | | Don't needlessly pass around qpair pointers when the tracker knows what qpair it's on. This will simplify code and make it easier to split submission and completion queues in the future. Signed-off-by: John Meneghini <johnm@netapp.com> Notes: svn path=/stable/12/; revision=356954