aboutsummaryrefslogtreecommitdiff
path: root/sys/ofed/drivers/infiniband/core
Commit message (Collapse)AuthorAgeFilesLines
* ofed: jiffies is unsigned longKonstantin Belousov2025-04-292-4/+4
| | | | | Sponsored by: NVidia networking Differential revision: https://reviews.freebsd.org/D48878
* LinuxKPI: Remove owner argument from class_create function on KBI layerVladimir Kondratyev2024-07-211-1/+0
| | | | | | | To chase Linux 6.4 Sponsored by: Serenity Cyber Security, LLC Differential Revision: https://reviews.freebsd.org/D45849
* ibcore: Mark write-only variablesAndrew Turner2024-06-122-10/+10
| | | | | | | | | | | | Some LinuxKPI lock macros pass need a flags field passed in. This is written to but never read from so gcc complains. Fix this by marking the flags variables as unused to quieten the compiler. Reviewed by: brooks (earlier version), kib Sponsored by: Arm Ltd Differential Revision: https://reviews.freebsd.org/D45303
* ibcore: Remove the use of NULL_IB_OBJECTKa Ho Ng2024-04-121-5/+3
| | | | | | | | | | | LinuxKPI's XArray implementation accepts NULL as an input as of the following commit: - linuxkpi: Accept NULL as a value in linux_xarray (3102ea3b15b6) Sponsored by: Juniper Networks, Inc. MFC after: 1 week Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D44533
* ofed: mask seq_num identifier to occupy only 3 bytesBartosz Sobczak2023-08-221-0/+1
| | | | | | | | | | | | | | | | | | The seq_num among other things is used to assign rq_psn value, which is a 24-bit identifier. When the seq_num is full 4-byte value, we are usually receiving: '_ib_modify_qp rq_psn overflow, masking to 24 bits' warning. This is burdensome for running rdma traffic with large number of connections, because the number of logs is growing fast. Signed-off-by: Bartosz Sobczak <bartosz.sobczak@intel.com> Signed-off-by: Eric Joyner <erj@FreeBSD.org> Reviewed by: kib@, erj@ MFC after: 3 days Sponsored by: Intel Corporation Differential Revision: https://reviews.freebsd.org/D41531
* sys: Remove $FreeBSD$: one-line .c patternWarner Losh2023-08-1631-62/+0
| | | | Remove /^[\s*]*__FBSDID\("\$FreeBSD\$"\);?\s*\n/
* sys: Remove $FreeBSD$: two-line .h patternWarner Losh2023-08-1611-22/+0
| | | | Remove /^\s*\*\n \*\s+\$FreeBSD\$$\n/
* ofed: fix roce gid insertion for vlan interfacesBartosz Sobczak2023-08-141-1/+1
| | | | | | | | | | | | | | | | | When attempting to use vlan interface the correct GID wasn't created due to incorrect ifp validation. The problem was introduced in 3e142e07675b ('ofed: Mechanically convert to IfAPI) Signed-off-by: Bartosz Sobczak bartosz.sobczak@intel.com Signed-off-by: Eric Joyner <erj@FreeBSD.org> PR: 273043 Sponsored by: Intel Corporation Reviewed by: jhibbits@, erj@ MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D41426
* ofed: Mechanically convert to IfAPIJustin Hibbits2023-03-249-182/+194
| | | | | | | | | | Summary: Because of the intricacies of this code it wasn't purely scripted, but instead hand-mechanical. Reviewed by: hselasky Sponsored by: Juniper Networks, Inc. Differential Revision: https://reviews.freebsd.org/D38560
* ibcore: The use of IN_LOOPBACK() now requires a valid VNET context.Hans Petter Selasky2022-09-231-27/+54
| | | | | | | | | Make sure the VNET is set before using this macro. Fixes: efe58855f3ea2cfc24cb705aabce3bc0fe1fb6d5 PR: 266054 MFC after: 1 week Sponsored by: NVIDIA Networking
* ibcore: Add support for RDMA/RoCE using VLAN(4) devices.Hans Petter Selasky2022-08-221-1/+1
| | | | | | | | | | | | Classify VLAN devices as ethernet in rdma_copy_addr(). This fixes the following error message: rdma_bind_addr: No such file or directory Submitted by: bartosz.sobczak@intel.com (Bartosz Sobczak) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D36120 Sponsored by: NVIDIA Networking
* ofed/infiniband: fix ifdefs for new INET changes, fixing LINT-NOIPMike Karels2022-07-181-0/+11
| | | | | | | | | | | | | | Some of the ofed/infiniband code has INET and INET6 address handling code without using ifdefs. This failed with a recent change to INET, in which IN_LOOPBACK() started using a VNET variable, and which is not present if INET is not configured. Add #ifdef INET, and INET6 for good measure, in cma_loopback_addr(), along with inclusion of the options headers in ib_cma.c. Reviewed by: hselasky rgrimes bz Differential Revision: https://reviews.freebsd.org/D35835 (cherry picked from commit 752b7632776237f9c071783acdd1136ebf5f287d)
* ibcore: Fix a race with disassociate and exit_mmap()Hans Petter Selasky2022-06-211-0/+4
| | | | | | | | | | | | | | | | | | | | If uverbs_user_mmap_disassociate() is called while the mmap is concurrently doing exit_mmap then the ordering of the rdma_user_mmap_entry_put() is not reliable. The put must be done before uvers_user_mmap_disassociate() returns, otherwise there can be a use after free on the ucontext, and a left over entry in the xarray. If the put is not done here then it is done during rdma_umap_close() later. Add the missing put to the error exit path. Linux commit: 39c011a538272589b9eb02ff1228af528522a22c PR: 264473 MFC after: 3 days Sponsored by: NVIDIA Networking
* ibcore: Fix sysfs registration error flowHans Petter Selasky2022-06-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel commit cited below restructured ib device management so that the device kobject is initialized in ib_alloc_device. As part of the restructuring, the kobject is now initialized in procedure ib_alloc_device, and is later added to the device hierarchy in the ib_register_device call stack, in procedure ib_device_register_sysfs (which calls device_add). However, in the ib_device_register_sysfs error flow, if an error occurs following the call to device_add, the cleanup procedure device_unregister is called. This call results in the device object being deleted -- which results in various use-after-free crashes. The correct cleanup call is device_del -- which undoes device_add without deleting the device object. The device object will then (correctly) be deleted in the ib_register_device caller's error cleanup flow, when the caller invokes ib_dealloc_device. Linux commit: b312be3d87e4c80872cbea869e569175c5eb0f9a PR: 264472 MFC after: 3 days Sponsored by: NVIDIA Networking
* ibcore: Fix use-after-free access in ucma_close()Hans Petter Selasky2022-06-131-0/+3
| | | | | | | | | | | | | The error in ucma_create_id() left ctx in the list of contexts belong to ucma file descriptor. The attempt to close this file descriptor causes to use-after-free accesses while iterating over such list. Linux commit: ed65a4dc22083e73bac599ded6a262318cad7baf PR: 264650 MFC after: 1 week Sponsored by: NVIDIA Networking
* ibcore: Fix missing ib_cm_destroy_id() in ib_cm_insert_listen()Hans Petter Selasky2022-05-301-0/+1
| | | | | | | | | | | | | The algorithm pre-allocates a cm_id since allocation cannot be done while holding the cm.lock spinlock, however it doesn't free it on one error path, leading to a memory leak. Linux commit: c14dfddbd869bf0c2bafb7ef260c41d9cebbcfec PR: 264248 MFC after: 1 week Sponsored by: NVIDIA Networking
* ibcore: Fix possible memory leak in ib_mad_post_receive_mads()Hans Petter Selasky2022-05-191-0/+1
| | | | | | | | | | | | | | If ib_dma_mapping_error() returns non-zero value, ib_mad_post_receive_mads() will jump out of loops and return -ENOMEM without freeing mad_priv. Fix this memory-leak problem by freeing mad_priv in this case. Linux commit: a17f4bed811c60712d8131883cdba11a105d0161 PR: 264057 MFC after: 1 week Sponsored by: NVIDIA Networking
* ibcore: Remove set, but not used variable.Hans Petter Selasky2022-05-051-3/+0
| | | | | MFC after: 1 week Sponsored by: NVIDIA Networking
* ibcore: Allow passing NULL-pointers to ib_umem_release()Hans Petter Selasky2022-05-021-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | FreeBSD commit b633e08c705fe43180567eae26923d6f6f98c8d9 removed the NULL-checks from the mlx4ib-driver. This fixes the following NULL-pointer panic when unloading mlx4ib: ib_umem_release() mlx4_ib_destroy_qp() ib_destroy_qp_user() ipoib_transport_dev_cleanup() ipoib_dev_cleanup() ipoib_remove_one() ib_unregister_client() ipoib_cleanup_module() linker_file_sysuninit() linker_file_unload() kern_kldunload() amd64_syscall() Linux commit: 836a0fbb3e76f704ad65ddfb57f00725245e509b MFC after: 1 week Submitted by: dandan@lysator.liu.se Sponsored by: Lysator ACS Sponsored by: NVIDIA Networking
* ofed: Remove a double word in a source code commentGordon Bergling2022-04-091-1/+1
| | | | | | - s/is is/is/ MFC after: 3 days
* ibcore: Fix multiple includes of same header file.Hans Petter Selasky2022-03-031-1/+0
| | | | | MFC after: 1 week Sponsored by: NVIDIA Networking
* ibcore: Add support for NDR link speed.Hans Petter Selasky2022-02-211-0/+4
| | | | | | | | | | Add new IBTA speed NDR, supporting signaling rate of 100Gb. Linux commit: c7adf7717301558e8852949d8e3dc3748d1a4a97 MFC after: 1 week Sponsored by: NVIDIA Networking
* routing: Allow using IPv6 next-hops for IPv4 routes (RFC 5549).Zhenlei Huang2021-08-221-3/+11
| | | | | | | | | | | | | | | | | | | | | | | Implement kernel support for RFC 5549/8950. * Relax control plane restrictions and allow specifying IPv6 gateways for IPv4 routes. This behavior is controlled by the net.route.rib_route_ipv6_nexthop sysctl (on by default). * Always pass final destination in ro->ro_dst in ip_forward(). * Use ro->ro_dst to exract packet family inside if_output() routines. Consistently use RO_GET_FAMILY() macro to handle ro=NULL case. * Pass extracted family to nd6_resolve() to get the LLE with proper encap. It leverages recent lltable changes committed in c541bd368f86. Presence of the functionality can be checked using ipv4_rfc5549_support feature(3). Example usage: route add -net 192.0.0.0/24 -inet6 fe80::5054:ff:fe14:e319%vtnet0 Differential Revision: https://reviews.freebsd.org/D30398 MFC after: 2 weeks
* lltable: Add support for "child" LLEs holding encap for IPv4oIPv6 entries.Alexander V. Chernikov2021-08-211-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we use pre-calculated headers inside LLE entries as prepend data for `if_output` functions. Using these headers allows saving some CPU cycles/memory accesses on the fast path. However, this approach makes adding L2 header for IPv4 traffic with IPv6 nexthops more complex, as it is not possible to store multiple pre-calculated headers inside lle. Additionally, the solution space is limited by the fact that PCB caching saves LLEs in addition to the nexthop. Thus, add support for creating special "child" LLEs for the purpose of holding custom family encaps and store mbufs pending resolution. To simplify handling of those LLEs, store them in a linked-list inside a "parent" (e.g. normal) LLE. Such LLEs are not visible when iterating LLE table. Their lifecycle is bound to the "parent" LLE - it is not possible to delete "child" when parent is alive. Furthermore, "child" LLEs are static (RTF_STATIC), avoding complex state machine used by the standard LLEs. nd6_lookup() and nd6_resolve() now accepts an additional argument, family, allowing to return such child LLEs. This change uses `LLE_SF()` macro which packs family and flags in a single int field. This is done to simplify merging back to stable/. Once this code lands, most of the cases will be converted to use a dedicated `family` parameter. Differential Revision: https://reviews.freebsd.org/D31379 MFC after: 2 weeks
* ibcore: Kernel space update based on Linux 5.7-rc1.Hans Petter Selasky2021-07-2824-3159/+7828
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Overview: This is the first stage of a RDMA stack upgrade introducing kernel changes only based on Linux 5.7-rc1. This patch is based on about four main areas of work: - Update of the IB uobjects system: - The memory holding so-called AH, CQ, PD, SRQ and UCONTEXT objects is now managed by ibcore. This also require some changes in the kernel verbs API. The updated verbs changes are typically about initialize and deinitialize objects, and remove allocation and free of memory. - Update of the uverbs IOCTL framework: - The parsing and handling of user-space commands has been completely refactored to integrate with the updated IB uobjects system. - Various changes and updates to the generic uverbs interfaces in device drivers including the new uAPI surface. - The mlx5_ib_devx.c in mlx5ib and related mlx5 core changes. Dependencies: - The mlx4ib driver code has been updated with the minimum changes needed. - The mlx5ib driver code has been updated with the minimum changes needed including DV support. Compatibility: - All user-space facing APIs are backwards compatible after this change. - All kernel-space facing RDMA APIs are backwards compatible after this change, with exception of ib_create_ah() and ib_destroy_ah() which takes a new flag. - The "ib_device_ops" structure exist, but only contains the driver ID and some structure sizes. Differences from Linux: - Infiniband drivers must use the INIT_IB_DEVICE_OPS() macro to set the sizes needed for allocating various IB objects, when adding IB device instances. Security: - PRIV_NET_RAW is needed to use raw ethernet transmit features. - PRIV_DRIVER is needed to use other privileged operations. Based on upstream Linux, Torvalds (5.7-rc1): 8632e9b5645bbc2331d21d892b0d6961c1a08429 MFC after: 1 week Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D31149 Sponsored by: NVIDIA Networking
* ibcore: Add some functions and definitions for selecting and querying ↵Hans Petter Selasky2021-07-121-0/+1
| | | | | | | | | | | retryable ucontext cleanup. Linux commit: 1c77483e4c50339b0306572167ccbff6b55d051b MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Declare ib_post_send() and ib_post_recv() arguments constHans Petter Selasky2021-07-123-9/+14
| | | | | | | | | | | | | | | | | | | Since neither ib_post_send() nor ib_post_recv() modify the data structure their second argument points at, declare that argument const. This change makes it necessary to declare the 'bad_wr' argument const too and also to modify all ULPs that call ib_post_send(), ib_post_recv() or ib_post_srq_recv(). This patch does not change any functionality but makes it possible for the compiler to verify whether the ib_post_(send|recv|srq_recv) really do not modify the posted work request. Linux commit: f696bf6d64b195b83ca1bdb7cd33c999c9dcf514 7bb1fafc2f163ad03a2007295bb2f57cfdbfb630 d34ac5cd3a73aacd11009c4fc3ba15d7ea62c411 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Implement ib_uverbs_get_ucontext_file().Hans Petter Selasky2021-07-121-0/+24
| | | | | | | | | | | | | | | Expose ib_ucontext from a given ib_uverbs_file. Drivers that use the ioctl(9) API may have the ib_uverbs_file and need a way to get the related ib_ucontext from it, this is enabled by this patch. Downstream patches from this series will use it. Linux commit: 7dc08dcfc8c86cb4457e383734ff6844ddaff876 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Clean up INIT_UDATA() and INIT_UDATA_BUF_OR_NULL() macro usage.Hans Petter Selasky2021-07-123-50/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | We get a harmless warning about the fact that we use the result of a multiplication as a condition in INIT_UDATA_BUF_OR_NULL(): uverbs_main.c: In function 'ib_uverbs_write': error: '*' in boolean context, suggest '&&' instead [-Werror=int-in-bool-context] This avoids the problem by using an inline function in place of the macro. After changing INIT_UDATA_BUF_OR_NULL() to an inline function, do the same change to INIT_UDATA() for consistency. Using an inline function gives us better type safety here among other issues with macros. I'm using u64_to_user_ptr() to convert the user pointer to simplify the logic rather than adding lots of new type casts. Linux commit: 12f727721eee61b3d19dedb95cb893b2baa9fe41 40a203396cc1c239f2e71c47c66ed03097123d2c MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Simplify ib_modify_qp_is_ok().Hans Petter Selasky2021-07-121-12/+7
| | | | | | | | | | | | | | | | All callers to ib_modify_qp_is_ok() provides enum ib_qp_state makes the checks of out-of-scope redundant. Let's remove them together with updating function signature to return boolean result. While at it remove unused "ll" parameter from ib_modify_qp_is_ok(). Linux commit: 19b1f54099b6ee334acbfbcfbdffd1d1f057216d d31131bba5a1630304c55ea775c48cc84912ab59 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Support rate limit for packet pacingHans Petter Selasky2021-07-121-0/+2
| | | | | | | | | | | | | | | Add new member rate_limit to ib_qp_attr which holds the packet pacing rate in kbps, 0 means unlimited. IB_QP_RATE_LIMIT is added to ib_attr_mask and could be used by RAW QPs when changing QP state from RTR to RTS, RTS to RTS. Linux commit: 528e5a1bd3f0e9b760cb3a1062fce7513712a15d MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Add new IB rates.Hans Petter Selasky2021-07-121-20/+28
| | | | | | | | | | | | Add the new rates that were added to Infiniband spec as part of HDR and 2x support. Linux commit: a5a5d1993696419e7d5357fc3128e53d219d382e MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Don't allocate method table, if already present.Hans Petter Selasky2021-07-121-2/+5
| | | | | | | | | | | This commit aligns the code in question with upstream Linux. Linux commit: 2468b82d69e3a53d024f28d79ba0fdb8bf43dfbf MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fix a use-after-free in ucma_resolve_ip().Hans Petter Selasky2021-07-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race condition between ucma_close() and ucma_resolve_ip(): CPU0 CPU1 ucma_resolve_ip(): ucma_close(): ctx = ucma_get_ctx(file, cmd.id); list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) { mutex_lock(&mut); idr_remove(&ctx_idr, ctx->id); mutex_unlock(&mut); ... mutex_lock(&mut); if (!ctx->closing) { mutex_unlock(&mut); rdma_destroy_id(ctx->cm_id); ... ucma_free_ctx(ctx); } ret = rdma_resolve_addr(); ucma_put_ctx(ctx); Before idr_remove(), ucma_get_ctx() could still find the ctx and after rdma_destroy_id(), rdma_resolve_addr() may still access id_priv pointer. Also, ucma_put_ctx() may use ctx after ucma_free_ctx() too. ucma_close() should call ucma_put_ctx() too which tests the refcnt and waits for the last one releasing it. The similar pattern is already used by ucma_destroy_id(). Linux commit: 5fe23f262e0548ca7f19fb79f89059a60d087d22 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Define option to set ack timeout.Hans Petter Selasky2021-07-122-0/+41
| | | | | | | | | | | | | | Define new option in 'rdma_set_option' to override calculated QP timeout when requested to provide QP attributes to modify a QP. At the same time, pack tos_set to be bitfield. Linux commit: 2c1619edef61a03cb516efaa81750784c3071d10 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Do not overreact to SM LID change event.Hans Petter Selasky2021-07-121-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When IPoIB receives an SM LID change event, it reacts by flushing its path record cache and rejoining multicast groups. This is the same behavior it performs when it receives a reregistration event. This behavior is unnecessary as an SM may have database backup or synchronization mechanisms which permit the SM location or LID to change without loss of multicast membership and without impact to path records. Both opensm and the OPA FM issue reregistration events if a new SM is started (or restarted with a new config) or an SM event occurs which results in loss of multicast membership records by the SM (such as opensm failover) or the SM encounters new nodes with Active ports (such as after joining 2 fabrics by connecting switches via ISLs). Hence this event can be depended on as the trigger for IPoIB cache and multicast flushing. It appears that some drivers, such as qib, and hfi1 issue the IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not. Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed that Mellanox EDR HCAs do not generate SM change events and that opensm does generate reregistration. An SM LID change event is generated by the mentioned drivers to reflect that sm_lid and/or sm_sl in the local port info has changed. The intent of this event is to permit applications and ULPs which have a local copy of this information (or an address handle using it) to update their information. The intent is that the reregistration event (caused by the SM via a bit in Set(PortInfo)) be used to inform nodes that they need to rejoin multicast groups, resubscribe for notices and potentially update path records. When an SM migrates or fails over, a SM LID change event can occur. In response IPoIB discards path records and multicast membership and loses connectivity until these records are restored via SA requests. In very large fabrics, it may take minutes for the SM to be ready and for the SA responses to be supplied. This can result in undesirable and unnecessary IPoIB connectivity impacts. It also can result in an unnecessary storm of SA queries from all nodes in a cluster potentially followed by yet another storm if the SM issues the reregistration request. The fact the Mellanox HCAs do not even generate this event, is further evidence that on modern IB fabrics there will be no ill side effects from the proposed changes below to reduce the reaction by 3 kernel components to this event. So these changes should be benign for Mellanox IB fabrics and will benefit OPA fabrics while also making ib_core and ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event APIs. Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib. IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do anything on SM LID change. IPoIB makes use of other ib_core components to issue SA requests for it and those components correctly track SM LID and SM LID changes. Also in ib_core multicast handling, remove the test for IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the error state, which will trigger rejoins. This code is used by IPoIB as well as the connection manager and other clients of multicast groups. This kernel module centralizes group membership status and joins since a node can only join a given group once but multiple ULPs or applications may want to join the same group. It makes use of the sa_query.c component in ib_core, which correctly trackes SM LID and SL. This component does not track SM LID nor SL itself and hence need not react to their changes. Similarly in the ib_core cache code remove the handling for the IB_EVENT_SM_CHANGE. In this function. The ib_cache_update function which is ultimately called is updating local copies of the pkey table, gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As such it does not need to be called on an SM LID change. It technically also does not need to be called on a reregistration. The LID_CHANGE, PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR, PORT_ACTICE) should be sufficient triggers. It is worth noting that the alternative of simply having the hfi1 and qib drivers not generate the SM LID change event was explored. While this would duplicate what Mellanox drivers do now, it is not the correct behavior and removes the ability for an SM to migrate without requiring reregistration. Since both opensm and OPA SM have mechanisms to backup or synchronize registration information, it is desirable to let them perform SM migrations (with LID or SL changes) without requiring reregistration when they deem it appropriate. Linux commit: ba7d8117f3cca8eb70d579fde3f9ec8cd6a28f39 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Remove debug prints after allocation failure.Hans Petter Selasky2021-07-121-33/+7
| | | | | | | | | | | | | The prints after [k|v][m|z|c]alloc() functions are not needed, because in case of failure, allocator will print their internal error prints anyway. Linux commit: 2716243212241855cd9070883779f6e58967dec5 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fix use-after-free in IB mad completion handling.Hans Petter Selasky2021-07-121-13/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We encountered a use-after-free bug when unloading the driver: BUG: KASAN: use-after-free in ib_mad_post_receive_mads+0xddc/0xed0 [ib_core] Read of size 4 at addr ffff8882ca5aa868 by task kworker/u13:2/23862 Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core] Call Trace: dump_stack+0x9a/0xeb print_address_description+0xe3/0x2e0 ib_mad_post_receive_mads+0xddc/0xed0 [ib_core] __kasan_report+0x15c/0x1df ib_mad_post_receive_mads+0xddc/0xed0 [ib_core] kasan_report+0xe/0x20 ib_mad_post_receive_mads+0xddc/0xed0 [ib_core] find_mad_agent+0xa00/0xa00 [ib_core] qlist_free_all+0x51/0xb0 mlx4_ib_sqp_comp_worker+0x1970/0x1970 [mlx4_ib] quarantine_reduce+0x1fa/0x270 kasan_unpoison_shadow+0x30/0x40 ib_mad_recv_done+0xdf6/0x3000 [ib_core] _raw_spin_unlock_irqrestore+0x46/0x70 ib_mad_send_done+0x1810/0x1810 [ib_core] mlx4_ib_destroy_cq+0x2a0/0x2a0 [mlx4_ib] _raw_spin_unlock_irqrestore+0x46/0x70 debug_object_deactivate+0x2b9/0x4a0 __ib_process_cq+0xe2/0x1d0 [ib_core] ib_cq_poll_work+0x45/0xf0 [ib_core] process_one_work+0x90c/0x1860 pwq_dec_nr_in_flight+0x320/0x320 worker_thread+0x87/0xbb0 __kthread_parkme+0xb6/0x180 process_one_work+0x1860/0x1860 kthread+0x320/0x3e0 kthread_park+0x120/0x120 ret_from_fork+0x24/0x30 ... Freed by task 31682: save_stack+0x19/0x80 __kasan_slab_free+0x11d/0x160 kfree+0xf5/0x2f0 ib_mad_port_close+0x200/0x380 [ib_core] ib_mad_remove_device+0xf0/0x230 [ib_core] remove_client_context+0xa6/0xe0 [ib_core] disable_device+0x14e/0x260 [ib_core] __ib_unregister_device+0x79/0x150 [ib_core] ib_unregister_device+0x21/0x30 [ib_core] mlx4_ib_remove+0x162/0x690 [mlx4_ib] mlx4_remove_device+0x204/0x2c0 [mlx4_core] mlx4_unregister_interface+0x49/0x1d0 [mlx4_core] mlx4_ib_cleanup+0xc/0x1d [mlx4_ib] __x64_sys_delete_module+0x2d2/0x400 do_syscall_64+0x95/0x470 entry_SYSCALL_64_after_hwframe+0x49/0xbe The problem was that the MAD PD was deallocated before the MAD CQ. There was completion work pending for the CQ when the PD got deallocated. When the mad completion handling reached procedure ib_mad_post_receive_mads(), we got a use-after-free bug in the following line of code in that procedure: sg_list.lkey = qp_info->port_priv->pd->local_dma_lkey; (the pd pointer in the above line is no longer valid, because the pd has been deallocated). We fix this by allocating the PD before the CQ in procedure ib_mad_port_open(), and deallocating the PD after freeing the CQ in procedure ib_mad_port_close(). Since the CQ completion work queue is flushed during ib_free_cq(), no completions will be pending for that CQ when the PD is later deallocated. Note that freeing the CQ before deallocating the PD is the practice in the ULPs. Linux commit: 770b7d96cfff6a8bf6c9f261ba6f135dc9edf484 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fail early if unsupported QP is provided.Hans Petter Selasky2021-07-121-0/+4
| | | | | | | | | | | | | When requested QP type is not supported for a {device, port}, return the error right away before validating all parameters during mad agent registration time. Linux commit: 798bba01b44b0ddf8cd6e542635b37cc9a9b739c MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Use inline function to validate portHans Petter Selasky2021-07-123-17/+15
| | | | | | | | | Linux commit: 24dc831b77eca9361cf835be59fa69ea0e471afc MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Validate port number in query_pkey verb.Hans Petter Selasky2021-07-121-0/+3
| | | | | | | | | | | Before calling the driver's function let's make sure port is valid. Linux commit: 9af3f5cf9d64a056eca53bc643f6288ad28bbbb5 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Protect against concurrent access to hardware stats.Hans Petter Selasky2021-07-121-6/+28
| | | | | | | | | | | | | | Currently access to hardware stats buffer isn't protected, this can result in multiple writes and reads at the same time to the same memory location. This can lead to providing an incorrect value to the user. Add a mutex to protect against it. Linux commit: e945130b52bea65d15f9bdf54949d4cb7a88db7f MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Do not expose unsupported counters.Hans Petter Selasky2021-07-121-7/+12
| | | | | | | | | | | | | | If the provider driver (such as rdma_rxe) doesn't support PMA counters, avoid exposing its directory similar to optional hw_counters directory. If core fails to read the PMA counter, return an error so that user can retry later if needed. Linux commit: 0f6ef65d1c6ec8deb5d0f11f86631ec4cfe8f22e MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Introduce ib_port_phys_state enum.Hans Petter Selasky2021-07-121-10/+20
| | | | | | | | | | | | In order to improve readability, add ib_port_phys_state enum to replace the use of magic numbers. Linux commit: 72a7720fca37fec0daf295923f17ac5d88a613e1 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fix unable to change lifespan entry for hw_counters.Hans Petter Selasky2021-07-121-1/+15
| | | | | | | | | | | | | | | | | | | | | This patch fixes the case where 'lifespan' entry of the hw_counters is not writable. Currently write callback is not exposed for for the hw_counters sysfs operation. Due to this, modifying lifespan value results into permission denied error in below example. echo 10 > /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan -bash: /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan: Permission denied This patch adds the hook to modify any attribute which implements store() operation. Linux commit: 79c4d80b43b8e43684894574a508a871f0c196bf MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Issue DREQ when receiving REQ/REP for stale QP.Hans Petter Selasky2021-07-121-1/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From "InfiBand Architecture Specifications Volume 1": A QP is said to have a stale connection when only one side has connection information. A stale connection may result if the remote CM had dropped the connection and sent a DREQ but the DREQ was never received by the local CM. Alternatively the remote CM may have lost all record of past connections because its node crashed and rebooted, while the local CM did not become aware of the remote node's reboot and therefore did not clean up stale connections. And: A local CM may receive a REQ/REP for a stale connection. It shall abort the connection issuing REJ to the REQ/REP. It shall then issue DREQ with "DREQ:remote QPN" set to the remote QPN from the REQ/REP. This patch solves a problem with reuse of QPN. Current codebase, that is IPoIB, relies on a REAP-mechanism to do cleanup of the structures in CM. A problem with this is the timeconstants governing this mechanism; they are up to 768 seconds and the interface may look inresponsive in that period. Issuing a DREQ (and receiving a DREP) does the necessary cleanup and the interface comes up. Linux commit: 9315bc9a133011fdb084f2626b86db3ebb64661f MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fix memory leak in cm_req_handler error flows.Hans Petter Selasky2021-07-121-2/+3
| | | | | | | | | | | | In the cm_req_handler() error flows, sometimes cm_id_priv->timewait_info isn't free'd. Linux commit: 8b00914654ef56ff5473f4fe1f1168254dbb8a17 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Move debug counters to be under relevant IB deviceHans Petter Selasky2021-07-123-38/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sysfs layout is created by CM incorrectly presented RDMA devices with InfiniBand link layer. Layout of such devices represents device tree of connections. By moving CM statistics to be under relevant port of IB device, we will fix the following issues: * Symlink name - It used device name instead of specific identifier. * Target location - It was supposed to point to PCI-ID/infiniband_cm/ instead of PCI-ID/infiniband/ * Target name - It created extra device file under already existing device folder, e.g. mlx5_0/mlx5_0 * Crash during boot with RDMA persistent naming patches. sysfs: cannot create duplicate filename '/class/infiniband_cm/mlx5_0' CPU: 29 PID: 433 Comm: modprobe Not tainted 5.0.0-rc5+ #178 Call Trace: dump_stack+0xcc/0x180 sysfs_warn_dup.cold.3+0x17/0x2d sysfs_do_create_link_sd.isra.2+0xd0/0xf0 device_add+0x7cb/0x1450 device_create_groups_vargs+0x1ae/0x220 device_create+0x93/0xc0 cm_add_one+0x38f/0xf60 [ib_cm] add_client_context+0x167/0x210 [ib_core] enable_device_and_get+0x230/0x3f0 [ib_core] ib_register_device+0x823/0xbf0 [ib_core] __mlx5_ib_add+0x45/0x150 [mlx5_ib] mlx5_ib_add+0x1b3/0x5e0 [mlx5_ib] mlx5_add_device+0x130/0x3a0 [mlx5_core] mlx5_register_interface+0x1a9/0x270 [mlx5_core] do_one_initcall+0x14f/0x5de do_init_module+0x247/0x7c0 load_module+0x4c2f/0x60d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe After this change: [leonro@server ~]$ ls -al /sys/class/infiniband/ibp0s12f0/ports/1/ drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_rx_duplicates drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_rx_msgs drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_tx_msgs drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_tx_retries Linux commit: c87e65cfb97c7f325132a68288ed76ba7bdcd2c6 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fix memory leak in cm_add/remove_one.Hans Petter Selasky2021-07-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | In the process of moving the debug counters sysfs entries, the commit mentioned below eliminated the cm_infiniband sysfs directory. This sysfs directory was tied to the cm_port object allocated in procedure cm_add_one(). Before the commit below, this cm_port object was freed via a call to kobject_put(port->kobj) in procedure cm_remove_port_fs(). Since port no longer uses its kobj, kobject_put(port->kobj) was eliminated. This, however, meant that kfree was never called for the cm_port buffers. Fix this by adding explicit kfree(port) calls to functions cm_add_one() and cm_remove_one(). Note that the kfree call in the first chunk below, in the cm_add_one error flow, fixes an old, undetected memory leak. Linux commit: 94635c36f3854934a46d9e812e028d4721bbb0e6 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Block processing of alternate path handling in RoCE RX CM messages.Hans Petter Selasky2021-07-121-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | Due to the below reasons, it is better to not support alternate path receive messages for RoCE in near term. 1. Alternate path for RoCE is not supported at rdmacm layer. 2. It is not supported in uverbs/core layer for RoCE. 3. Alternate path for IPv6 for link local address cannot resolve route determinstically without a valid incoming interface ID whose usecase make sense only with dual port mode. 4. init_av_from_path while processing LAP messages for IB and RoCE can lead to adding duplicate entry of AV into the port list, leads to list corruption. 5. rdma-core userspace a well known userspace implementation has removed support of libucm which use ucm.ko module, which is the only module that can trigger alternate path related messages. 6. ucm kernel module is requested to be removed from the IB core in the following patch, https://patchwork.kernel.org/patch/10268503/ . Linux commit: 97c45c2c28cd291e06778d9d36a0f60ee74726bc MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking