aboutsummaryrefslogtreecommitdiff
path: root/sys/ofed/drivers
Commit message (Collapse)AuthorAgeFilesLines
...
* ibcore: Simplify ib_modify_qp_is_ok().Hans Petter Selasky2021-07-121-12/+7
| | | | | | | | | | | | | | | | All callers to ib_modify_qp_is_ok() provides enum ib_qp_state makes the checks of out-of-scope redundant. Let's remove them together with updating function signature to return boolean result. While at it remove unused "ll" parameter from ib_modify_qp_is_ok(). Linux commit: 19b1f54099b6ee334acbfbcfbdffd1d1f057216d d31131bba5a1630304c55ea775c48cc84912ab59 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Support rate limit for packet pacingHans Petter Selasky2021-07-121-0/+2
| | | | | | | | | | | | | | | Add new member rate_limit to ib_qp_attr which holds the packet pacing rate in kbps, 0 means unlimited. IB_QP_RATE_LIMIT is added to ib_attr_mask and could be used by RAW QPs when changing QP state from RTR to RTS, RTS to RTS. Linux commit: 528e5a1bd3f0e9b760cb3a1062fce7513712a15d MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Add new IB rates.Hans Petter Selasky2021-07-121-20/+28
| | | | | | | | | | | | Add the new rates that were added to Infiniband spec as part of HDR and 2x support. Linux commit: a5a5d1993696419e7d5357fc3128e53d219d382e MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Don't allocate method table, if already present.Hans Petter Selasky2021-07-121-2/+5
| | | | | | | | | | | This commit aligns the code in question with upstream Linux. Linux commit: 2468b82d69e3a53d024f28d79ba0fdb8bf43dfbf MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fix a use-after-free in ucma_resolve_ip().Hans Petter Selasky2021-07-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a race condition between ucma_close() and ucma_resolve_ip(): CPU0 CPU1 ucma_resolve_ip(): ucma_close(): ctx = ucma_get_ctx(file, cmd.id); list_for_each_entry_safe(ctx, tmp, &file->ctx_list, list) { mutex_lock(&mut); idr_remove(&ctx_idr, ctx->id); mutex_unlock(&mut); ... mutex_lock(&mut); if (!ctx->closing) { mutex_unlock(&mut); rdma_destroy_id(ctx->cm_id); ... ucma_free_ctx(ctx); } ret = rdma_resolve_addr(); ucma_put_ctx(ctx); Before idr_remove(), ucma_get_ctx() could still find the ctx and after rdma_destroy_id(), rdma_resolve_addr() may still access id_priv pointer. Also, ucma_put_ctx() may use ctx after ucma_free_ctx() too. ucma_close() should call ucma_put_ctx() too which tests the refcnt and waits for the last one releasing it. The similar pattern is already used by ucma_destroy_id(). Linux commit: 5fe23f262e0548ca7f19fb79f89059a60d087d22 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Define option to set ack timeout.Hans Petter Selasky2021-07-122-0/+41
| | | | | | | | | | | | | | Define new option in 'rdma_set_option' to override calculated QP timeout when requested to provide QP attributes to modify a QP. At the same time, pack tos_set to be bitfield. Linux commit: 2c1619edef61a03cb516efaa81750784c3071d10 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Do not overreact to SM LID change event.Hans Petter Selasky2021-07-121-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When IPoIB receives an SM LID change event, it reacts by flushing its path record cache and rejoining multicast groups. This is the same behavior it performs when it receives a reregistration event. This behavior is unnecessary as an SM may have database backup or synchronization mechanisms which permit the SM location or LID to change without loss of multicast membership and without impact to path records. Both opensm and the OPA FM issue reregistration events if a new SM is started (or restarted with a new config) or an SM event occurs which results in loss of multicast membership records by the SM (such as opensm failover) or the SM encounters new nodes with Active ports (such as after joining 2 fabrics by connecting switches via ISLs). Hence this event can be depended on as the trigger for IPoIB cache and multicast flushing. It appears that some drivers, such as qib, and hfi1 issue the IB_EVENT_SM_CHANGE but other drivers such as mlx4 and mlx5 do not. Empirical testing on Mellanox EDR using ibv_asyncwatch has confirmed that Mellanox EDR HCAs do not generate SM change events and that opensm does generate reregistration. An SM LID change event is generated by the mentioned drivers to reflect that sm_lid and/or sm_sl in the local port info has changed. The intent of this event is to permit applications and ULPs which have a local copy of this information (or an address handle using it) to update their information. The intent is that the reregistration event (caused by the SM via a bit in Set(PortInfo)) be used to inform nodes that they need to rejoin multicast groups, resubscribe for notices and potentially update path records. When an SM migrates or fails over, a SM LID change event can occur. In response IPoIB discards path records and multicast membership and loses connectivity until these records are restored via SA requests. In very large fabrics, it may take minutes for the SM to be ready and for the SA responses to be supplied. This can result in undesirable and unnecessary IPoIB connectivity impacts. It also can result in an unnecessary storm of SA queries from all nodes in a cluster potentially followed by yet another storm if the SM issues the reregistration request. The fact the Mellanox HCAs do not even generate this event, is further evidence that on modern IB fabrics there will be no ill side effects from the proposed changes below to reduce the reaction by 3 kernel components to this event. So these changes should be benign for Mellanox IB fabrics and will benefit OPA fabrics while also making ib_core and ULP behavor "correct" as intended by the IBTA spec and kernel RDMA event APIs. Address these issues by removing IB_EVENT_SM_CHANGE handling from ipoib. IPoIB does not locally store sm_lid nor sm_sl, so it does not need to do anything on SM LID change. IPoIB makes use of other ib_core components to issue SA requests for it and those components correctly track SM LID and SM LID changes. Also in ib_core multicast handling, remove the test for IB_EVENT_SM_CHANGE. This code is moving all multicast groups to the error state, which will trigger rejoins. This code is used by IPoIB as well as the connection manager and other clients of multicast groups. This kernel module centralizes group membership status and joins since a node can only join a given group once but multiple ULPs or applications may want to join the same group. It makes use of the sa_query.c component in ib_core, which correctly trackes SM LID and SL. This component does not track SM LID nor SL itself and hence need not react to their changes. Similarly in the ib_core cache code remove the handling for the IB_EVENT_SM_CHANGE. In this function. The ib_cache_update function which is ultimately called is updating local copies of the pkey table, gid table and lmc. It does not update nor retain sm_lid nor sm_sl. As such it does not need to be called on an SM LID change. It technically also does not need to be called on a reregistration. The LID_CHANGE, PKEY_CHANGE, GID_CHANGE and port state change events (PORT_ERR, PORT_ACTICE) should be sufficient triggers. It is worth noting that the alternative of simply having the hfi1 and qib drivers not generate the SM LID change event was explored. While this would duplicate what Mellanox drivers do now, it is not the correct behavior and removes the ability for an SM to migrate without requiring reregistration. Since both opensm and OPA SM have mechanisms to backup or synchronize registration information, it is desirable to let them perform SM migrations (with LID or SL changes) without requiring reregistration when they deem it appropriate. Linux commit: ba7d8117f3cca8eb70d579fde3f9ec8cd6a28f39 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Remove debug prints after allocation failure.Hans Petter Selasky2021-07-121-33/+7
| | | | | | | | | | | | | The prints after [k|v][m|z|c]alloc() functions are not needed, because in case of failure, allocator will print their internal error prints anyway. Linux commit: 2716243212241855cd9070883779f6e58967dec5 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fix use-after-free in IB mad completion handling.Hans Petter Selasky2021-07-121-13/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We encountered a use-after-free bug when unloading the driver: BUG: KASAN: use-after-free in ib_mad_post_receive_mads+0xddc/0xed0 [ib_core] Read of size 4 at addr ffff8882ca5aa868 by task kworker/u13:2/23862 Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core] Call Trace: dump_stack+0x9a/0xeb print_address_description+0xe3/0x2e0 ib_mad_post_receive_mads+0xddc/0xed0 [ib_core] __kasan_report+0x15c/0x1df ib_mad_post_receive_mads+0xddc/0xed0 [ib_core] kasan_report+0xe/0x20 ib_mad_post_receive_mads+0xddc/0xed0 [ib_core] find_mad_agent+0xa00/0xa00 [ib_core] qlist_free_all+0x51/0xb0 mlx4_ib_sqp_comp_worker+0x1970/0x1970 [mlx4_ib] quarantine_reduce+0x1fa/0x270 kasan_unpoison_shadow+0x30/0x40 ib_mad_recv_done+0xdf6/0x3000 [ib_core] _raw_spin_unlock_irqrestore+0x46/0x70 ib_mad_send_done+0x1810/0x1810 [ib_core] mlx4_ib_destroy_cq+0x2a0/0x2a0 [mlx4_ib] _raw_spin_unlock_irqrestore+0x46/0x70 debug_object_deactivate+0x2b9/0x4a0 __ib_process_cq+0xe2/0x1d0 [ib_core] ib_cq_poll_work+0x45/0xf0 [ib_core] process_one_work+0x90c/0x1860 pwq_dec_nr_in_flight+0x320/0x320 worker_thread+0x87/0xbb0 __kthread_parkme+0xb6/0x180 process_one_work+0x1860/0x1860 kthread+0x320/0x3e0 kthread_park+0x120/0x120 ret_from_fork+0x24/0x30 ... Freed by task 31682: save_stack+0x19/0x80 __kasan_slab_free+0x11d/0x160 kfree+0xf5/0x2f0 ib_mad_port_close+0x200/0x380 [ib_core] ib_mad_remove_device+0xf0/0x230 [ib_core] remove_client_context+0xa6/0xe0 [ib_core] disable_device+0x14e/0x260 [ib_core] __ib_unregister_device+0x79/0x150 [ib_core] ib_unregister_device+0x21/0x30 [ib_core] mlx4_ib_remove+0x162/0x690 [mlx4_ib] mlx4_remove_device+0x204/0x2c0 [mlx4_core] mlx4_unregister_interface+0x49/0x1d0 [mlx4_core] mlx4_ib_cleanup+0xc/0x1d [mlx4_ib] __x64_sys_delete_module+0x2d2/0x400 do_syscall_64+0x95/0x470 entry_SYSCALL_64_after_hwframe+0x49/0xbe The problem was that the MAD PD was deallocated before the MAD CQ. There was completion work pending for the CQ when the PD got deallocated. When the mad completion handling reached procedure ib_mad_post_receive_mads(), we got a use-after-free bug in the following line of code in that procedure: sg_list.lkey = qp_info->port_priv->pd->local_dma_lkey; (the pd pointer in the above line is no longer valid, because the pd has been deallocated). We fix this by allocating the PD before the CQ in procedure ib_mad_port_open(), and deallocating the PD after freeing the CQ in procedure ib_mad_port_close(). Since the CQ completion work queue is flushed during ib_free_cq(), no completions will be pending for that CQ when the PD is later deallocated. Note that freeing the CQ before deallocating the PD is the practice in the ULPs. Linux commit: 770b7d96cfff6a8bf6c9f261ba6f135dc9edf484 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fail early if unsupported QP is provided.Hans Petter Selasky2021-07-121-0/+4
| | | | | | | | | | | | | When requested QP type is not supported for a {device, port}, return the error right away before validating all parameters during mad agent registration time. Linux commit: 798bba01b44b0ddf8cd6e542635b37cc9a9b739c MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Use inline function to validate portHans Petter Selasky2021-07-123-17/+15
| | | | | | | | | Linux commit: 24dc831b77eca9361cf835be59fa69ea0e471afc MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Validate port number in query_pkey verb.Hans Petter Selasky2021-07-121-0/+3
| | | | | | | | | | | Before calling the driver's function let's make sure port is valid. Linux commit: 9af3f5cf9d64a056eca53bc643f6288ad28bbbb5 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Protect against concurrent access to hardware stats.Hans Petter Selasky2021-07-121-6/+28
| | | | | | | | | | | | | | Currently access to hardware stats buffer isn't protected, this can result in multiple writes and reads at the same time to the same memory location. This can lead to providing an incorrect value to the user. Add a mutex to protect against it. Linux commit: e945130b52bea65d15f9bdf54949d4cb7a88db7f MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Do not expose unsupported counters.Hans Petter Selasky2021-07-121-7/+12
| | | | | | | | | | | | | | If the provider driver (such as rdma_rxe) doesn't support PMA counters, avoid exposing its directory similar to optional hw_counters directory. If core fails to read the PMA counter, return an error so that user can retry later if needed. Linux commit: 0f6ef65d1c6ec8deb5d0f11f86631ec4cfe8f22e MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Introduce ib_port_phys_state enum.Hans Petter Selasky2021-07-121-10/+20
| | | | | | | | | | | | In order to improve readability, add ib_port_phys_state enum to replace the use of magic numbers. Linux commit: 72a7720fca37fec0daf295923f17ac5d88a613e1 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fix unable to change lifespan entry for hw_counters.Hans Petter Selasky2021-07-121-1/+15
| | | | | | | | | | | | | | | | | | | | | This patch fixes the case where 'lifespan' entry of the hw_counters is not writable. Currently write callback is not exposed for for the hw_counters sysfs operation. Due to this, modifying lifespan value results into permission denied error in below example. echo 10 > /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan -bash: /sys/class/infiniband/mlx5_0/ports/1/hw_counters/lifespan: Permission denied This patch adds the hook to modify any attribute which implements store() operation. Linux commit: 79c4d80b43b8e43684894574a508a871f0c196bf MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Issue DREQ when receiving REQ/REP for stale QP.Hans Petter Selasky2021-07-121-1/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From "InfiBand Architecture Specifications Volume 1": A QP is said to have a stale connection when only one side has connection information. A stale connection may result if the remote CM had dropped the connection and sent a DREQ but the DREQ was never received by the local CM. Alternatively the remote CM may have lost all record of past connections because its node crashed and rebooted, while the local CM did not become aware of the remote node's reboot and therefore did not clean up stale connections. And: A local CM may receive a REQ/REP for a stale connection. It shall abort the connection issuing REJ to the REQ/REP. It shall then issue DREQ with "DREQ:remote QPN" set to the remote QPN from the REQ/REP. This patch solves a problem with reuse of QPN. Current codebase, that is IPoIB, relies on a REAP-mechanism to do cleanup of the structures in CM. A problem with this is the timeconstants governing this mechanism; they are up to 768 seconds and the interface may look inresponsive in that period. Issuing a DREQ (and receiving a DREP) does the necessary cleanup and the interface comes up. Linux commit: 9315bc9a133011fdb084f2626b86db3ebb64661f MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fix memory leak in cm_req_handler error flows.Hans Petter Selasky2021-07-121-2/+3
| | | | | | | | | | | | In the cm_req_handler() error flows, sometimes cm_id_priv->timewait_info isn't free'd. Linux commit: 8b00914654ef56ff5473f4fe1f1168254dbb8a17 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Move debug counters to be under relevant IB deviceHans Petter Selasky2021-07-123-38/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sysfs layout is created by CM incorrectly presented RDMA devices with InfiniBand link layer. Layout of such devices represents device tree of connections. By moving CM statistics to be under relevant port of IB device, we will fix the following issues: * Symlink name - It used device name instead of specific identifier. * Target location - It was supposed to point to PCI-ID/infiniband_cm/ instead of PCI-ID/infiniband/ * Target name - It created extra device file under already existing device folder, e.g. mlx5_0/mlx5_0 * Crash during boot with RDMA persistent naming patches. sysfs: cannot create duplicate filename '/class/infiniband_cm/mlx5_0' CPU: 29 PID: 433 Comm: modprobe Not tainted 5.0.0-rc5+ #178 Call Trace: dump_stack+0xcc/0x180 sysfs_warn_dup.cold.3+0x17/0x2d sysfs_do_create_link_sd.isra.2+0xd0/0xf0 device_add+0x7cb/0x1450 device_create_groups_vargs+0x1ae/0x220 device_create+0x93/0xc0 cm_add_one+0x38f/0xf60 [ib_cm] add_client_context+0x167/0x210 [ib_core] enable_device_and_get+0x230/0x3f0 [ib_core] ib_register_device+0x823/0xbf0 [ib_core] __mlx5_ib_add+0x45/0x150 [mlx5_ib] mlx5_ib_add+0x1b3/0x5e0 [mlx5_ib] mlx5_add_device+0x130/0x3a0 [mlx5_core] mlx5_register_interface+0x1a9/0x270 [mlx5_core] do_one_initcall+0x14f/0x5de do_init_module+0x247/0x7c0 load_module+0x4c2f/0x60d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe After this change: [leonro@server ~]$ ls -al /sys/class/infiniband/ibp0s12f0/ports/1/ drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_rx_duplicates drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_rx_msgs drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_tx_msgs drwxr-xr-x 2 root root 0 Mar 11 11:17 cm_tx_retries Linux commit: c87e65cfb97c7f325132a68288ed76ba7bdcd2c6 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Fix memory leak in cm_add/remove_one.Hans Petter Selasky2021-07-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | In the process of moving the debug counters sysfs entries, the commit mentioned below eliminated the cm_infiniband sysfs directory. This sysfs directory was tied to the cm_port object allocated in procedure cm_add_one(). Before the commit below, this cm_port object was freed via a call to kobject_put(port->kobj) in procedure cm_remove_port_fs(). Since port no longer uses its kobj, kobject_put(port->kobj) was eliminated. This, however, meant that kfree was never called for the cm_port buffers. Fix this by adding explicit kfree(port) calls to functions cm_add_one() and cm_remove_one(). Note that the kfree call in the first chunk below, in the cm_add_one error flow, fixes an old, undetected memory leak. Linux commit: 94635c36f3854934a46d9e812e028d4721bbb0e6 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Block processing of alternate path handling in RoCE RX CM messages.Hans Petter Selasky2021-07-121-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | Due to the below reasons, it is better to not support alternate path receive messages for RoCE in near term. 1. Alternate path for RoCE is not supported at rdmacm layer. 2. It is not supported in uverbs/core layer for RoCE. 3. Alternate path for IPv6 for link local address cannot resolve route determinstically without a valid incoming interface ID whose usecase make sense only with dual port mode. 4. init_av_from_path while processing LAP messages for IB and RoCE can lead to adding duplicate entry of AV into the port list, leads to list corruption. 5. rdma-core userspace a well known userspace implementation has removed support of libucm which use ucm.ko module, which is the only module that can trigger alternate path related messages. 6. ucm kernel module is requested to be removed from the IB core in the following patch, https://patchwork.kernel.org/patch/10268503/ . Linux commit: 97c45c2c28cd291e06778d9d36a0f60ee74726bc MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Store and restore ah_attr during LAP msg processing.Hans Petter Selasky2021-07-121-3/+29
| | | | | | | | | | | | | | | | | | During CM LAP processing, ah_attr is reinitialized on receiving a LAP request. First likely during CM request processing. ah_attr might get zeroed out if LAP processing fails. Therefore, try to create a new ah_attr for the LAP message. If the initialization fails, continue with older ah_attr. If the initialization passes, consider the new ah_attr by overwriting the older one. Linux commit: 0e225dcb7681c0a8e52fb9dc68bd8ab973de4ca2 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Add rdma_reject_msg() helper function.Hans Petter Selasky2021-07-123-0/+83
| | | | | | | | | | | | rdma_reject_msg() returns a pointer to a string message associated with the transport reject reason codes. Linux commit: 77a5db13153906a7e00740b10b2730e53385c5a8 MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ibcore: Remove unused and erroneous msg sequence encoding.Hans Petter Selasky2021-07-122-15/+6
| | | | | | | | | | | | | | | | | | | | | | | | In cm_form_tid(), a two bit message sequence number is OR'ed into bit 31-30 of the lower TID value. After Linux commit f06d26537559 ("IB/cm: Randomize starting comm ID"), the local_id is XOR'ed with a 32-bit random value. Hence, bit 31-30 in the lower TID now has an arbitrarily value and it makes no sense to OR in the message sequence number. Adding to that, the evolution in use of IDR routines in cm_alloc_id() has always had the possibility of returning a value with bit 30 set. In addition, said bits are never checked. Hence, remove the encoding and the corresponding enum. Linux commit: 87a37ce9e400e40daee537ff95343e3c94743c6d MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* ipoib: Destroying a CQ should never fail.Hans Petter Selasky2021-07-121-4/+2
| | | | | | | | | Remove not needed error handling when destroying a CQ. The function in question will later on be updated to return "void". MFC after: 1 week Reviewed by: kib Sponsored by: Mellanox Technologies // NVIDIA Networking
* mlx4/OFED: replace the struct net_device with struct ifnetBjoern A. Zeeb2021-06-1810-68/+68
| | | | | | | | | | | | | | | | | Given all the code does operate on struct ifnet, the last step in this longer series of changes now is to rename struct net_device to struct ifnet (that is what it was defined to in the LinuxKPi code). While mlx4 and OFED are "shared" code the decision was made years ago to not write it based on the netdevice KPI but the native ifnet KPI for most of it. This commit simply spells this out and with that frees "struct netdevice" to be re-done on LinuxKPI to become a more native/mixed implementation over time as needed by, e.g., wireless drivers. Sponsored by: The FreeBSD Foundation MFC after: 10 days Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D30515
* OFED: migrate LinuxKPI net_device/ifnet macros into ofedBjoern A. Zeeb2021-05-274-0/+4
| | | | | | | | | | | | The LinuxKPI net_device actually is an ifnet; in order to further clean that up so we can extend "net_device" migrate the few macros left into ofed and make sure the header is included in all files which need access to the macros. Sponsored by: The FreeBSD Foundation MFC after: 12 days Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D30477
* LinuxKPI/OFED/mlx4: cleanup netdevice.h some moreBjoern A. Zeeb2021-05-262-2/+0
| | | | | | | | | | | | | This removes all unused bits from linux/netdevice.h and migrates two inline functions into the mlx4 and ofed code respectively. This gets the mlx4/ofed (struct ifnet) specific bits down to 7 lines in netdevice.h. Sponsored by: The FreeBSD Foundation MFC after: 13 days Reviewed by: hselasky, kib Differential Revision: https://reviews.freebsd.org/D30461
* Add missing sockaddr length and family validation to various protocolsMark Johnston2021-05-031-4/+19
| | | | | | | | | | | | | | Several protocol methods take a sockaddr as input. In some cases the sockaddr lengths were not being validated, or were validated after some out-of-bounds accesses could occur. Add requisite checking to various protocol entry points, and convert some existing checks to assertions where appropriate. Reported by: syzkaller+KASAN Reviewed by: tuexen, melifaro MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D29519
* LinuxKPI/OFED: (re)move inetdevice.h implementationBjoern A. Zeeb2021-03-302-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | The two functions in linux/inetdevice.h are highly FreeBSD/ifnet specific. This is a result of struct net_device being mapped to struct ifnet. The only known consumer of these functions are two files in the ofed/infiniband code. As a first step of cleaning up copy linux/inetdevice.h to rdma/ib_addr_freebsd.h. (It stayed a separate file to preserve copyright and license of the original file; otherwise it could be merged into ib_addr.h where more EPOCH/vnet/.. are already used). Slightly rename the function to not conflict with LinuxKPI in the future. Remove the three last, now unneeded includes of inetdevice.h and zap linux/inetdevice.h to an empty header file with only the forward include to netdevice.h remaining. Sponsored-by: The FreeBSD Foundation MFC-after: 2 weeks Reviewed-by: hselasky, kib X-D-R: D29366 (extracted as further cleanup) Differential Revision: https://reviews.freebsd.org/D29434
* ipoib: Fix incorrectly computed IPOIB_CM_RX_SG value.Hans Petter Selasky2021-03-253-8/+8
| | | | | | | | | | | | | | The computed IPOIB_CM_RX_SG is too small. It doesn't account for fallback to mbuf clusters when jumbo frames are not available and it also doesn't account for the packet header and trailer mbuf. This causes a memory overwrite situation when IPOIB_CM is configured. While at it add a kernel assert to ensure the mapping array is not overwritten. PR: 254474 MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking
* LinuxKPI: remove < 5.0 version supportBjoern A. Zeeb2021-03-241-2/+1
| | | | | | | | | | | | | | We are not aware of any out-of-tree consumers anymore which would need KPI support for before Linux version 5. Update the two in-tree consumers to use the new KPI. This allows us to remove the extra version check and will also give access to {lower,upper}_32_bits() unconditionally. Sponsored-by: The FreeBSD Foundation Reviewed-by: hselasky, rlibby, rstone MFC-after: 2 weeks X-MFC: to 13 only Differential Revision: https://reviews.freebsd.org/D29391
* ofed/linuxkpi: use proper accessor functionBjoern A. Zeeb2021-03-241-1/+1
| | | | | | | | | | | | | | In the notifier event callback function rather than casting directly to the expected type use the proper accessor function as the mlx drivers already do. This is preparational work to allow us to improve the struct net_device is struct ifnet compat code shortcut in the future. Obtained-from: bz_iwlwifi Sponsored-by: The FreeBSD Foundation MFC-after: 2 weeks Reviewed-by: hselasky Differential Revision: https://reviews.freebsd.org/D29364
* ofed: quiet gcc -Wint-in-bool-contextRyan Libby2021-02-241-2/+4
| | | | | | | | | | | | The int in the argument to the ternary triggered -Wint-in-bool-context from gcc. Upstream linux has a larger and more entangled patch, 12f727721eee61b3d19dedb95cb893b2baa9fe41, which doesn't apply cleanly. When we eventually sync that, we can just drop this change. Reviewed by: hselasky, imp, kib MFC after: 3 days Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D28762
* kern: net: remove TCP_LINGERTIMEKyle Evans2021-02-191-2/+0
| | | | | | | | | | | | | | | | | | | | | | TCP_LINGERTIME can be traced back to BSD 4.4 Lite and perhaps beyond, in exactly the same form that it appears here modulo slightly different context. It used to be the case that there was a single pr_usrreq method with requests dispatched to it; these exact two lines appeared in tcp_usrreq's PRU_ATTACH handling. The only purpose of this that I can find is to cause surprising behavior on accepted connections. Newly-created sockets will never hit these paths as one cannot set SO_LINGER prior to socket(2). If SO_LINGER is set on a listening socket and inherited, one would expect the timeout to be inherited rather than changed arbitrarily like this -- noting that SO_LINGER is nonsense on a listening socket beyond inheritance, since they cannot be 'connected' by definition. Neither Illumos nor Linux reset the timer like this based on testing and inspection of Illumos, and testing of Linux. Reviewed by: rscheff, tuexen Differential Revision: https://reviews.freebsd.org/D28265
* Fix mismerge in OFED updateRyan Stone2021-02-041-0/+2
| | | | | | | | | | | | | | | When OFED was upgraded to Linux v4.9, a bunch of Linux-specific netlink changes were dropped. Unfortunately, there was a mismerge in this process and as a result ib_sa_cancel_query() would fail to cancel an outstanding MAD. This was causing rdma_destroy_id() to hang indefinitely waiting for the MAD to complete and release the final reference. Sponsored by: Dell Inc. Differential Revision: https://reviews.freebsd.org/D28421 Reviewed by: hselasky, kib MFC after: 2 months
* Fix for referencing file via its vnode in ibore.Hans Petter Selasky2020-11-021-43/+39
| | | | | | | | | | | | | | | | | | | | | Use the native vnode lookup functions, instead of going via the LinuxKPI, because the file referenced is typically created outside the LinuxKPI, and the LinuxKPI's fdget() can only resolve file descriptor numbers which were created by itself. The vnode pointer is used as an identifier to identify XRCD handles which are sharing resources. This patch fixes the so-called XRCD support in ibcore for FreeBSD. Refer to ibv_open_xrcd(3) for more information how the file descriptor argument is used. Reviewed by: kib@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking Notes: svn path=/head/; revision=367269
* Factor out generic IP over infiniband, IPoIB, definitions and codeHans Petter Selasky2020-10-224-375/+33
| | | | | | | | | | | | | into net/if_infiniband.c and net/infiniband.h . No functional change intended. Differential Revision: https://reviews.freebsd.org/D26254 Reviewed by: melifaro@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking Notes: svn path=/head/; revision=366930
* Allow IP over IB to work with multiple FIBs.Ravi Pokala2020-10-131-0/+2
| | | | | | | | | | | | | | | | | Call M_SETFIB() to make sure the IPoIB packet is directed to the correct interface-specific FIB. This was sufficient to allow general-purpose routing using the default FIB, and a separate FIB for routing between IPoIB on ib0 and IPoEthernet on mce0. Reviewed by: hselasky Obtained from: Anmol Kumar <anmolk at panasas dot com> MFC after: 1 week Sponsored by: Panasas Differential Revision: https://reviews.freebsd.org/D25239 Notes: svn path=/head/; revision=366686
* infiniband: Appease CovertyEric van Gyzen2020-08-313-17/+6
| | | | | | | | | | | | | | | | Coverity claims the call to rdma_gid2ip in cma_igmp_send overwrites addr. Use a consistent definition of sockaddr to prevent detections and code changes in the future. Submitted by: bret_ketchum@dell.com Reported by: Coverity Reviewed by: hselasky, kib MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D26229 Notes: svn path=/head/; revision=364997
* Infiniband clients must be attached and detached in a specific order in ibcore.Hans Petter Selasky2020-07-0610-19/+36
| | | | | | | | | | | | | | | | | | | Currently the linking order of the infiniband, IB, modules decide in which order the clients are attached and detached. For example one IB client may use resources from another IB client. This can lead to a potential deadlock at shutdown. For example if the ipoib is unregistered after the ib_multicast client is detached, then if ipoib is using multicast addresses a deadlock may happen, because ib_multicast will wait for all its resources to be freed before returning from the remove method. Fix this by using module_xxx_order() instead of module_xxx(). Differential Revision: https://reviews.freebsd.org/D23973 MFC after: 1 week Sponsored by: Mellanox Technologies Notes: svn path=/head/; revision=362953
* Convert OFED rtable interactions to the new routing KPI.Alexander V. Chernikov2020-04-152-82/+59
| | | | | | | | Reviewed by: hselasky Differential Revision: https://reviews.freebsd.org/D24387 Notes: svn path=/head/; revision=359966
* Fix for double unlock in ipoib.Hans Petter Selasky2020-03-161-1/+0
| | | | | | | | | | The ipoib_unicast_send() function is not supposed to unlock the priv lock. MFC after: 3 days Sponsored by: Mellanox Technologies Notes: svn path=/head/; revision=359014
* Fix some whitespace issues in ipoib.Hans Petter Selasky2020-03-061-3/+3
| | | | | | | | MFC after: 1 week Sponsored by: Mellanox Technologies Notes: svn path=/head/; revision=358694
* Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)Pawel Biernacki2020-02-261-2/+4
| | | | | | | | | | | | | | | | | | | r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718 Notes: svn path=/head/; revision=358333
* Make sure the VNET is properly set when reaping mbufs in ipoib.Hans Petter Selasky2020-01-111-0/+4
| | | | | | | | | | | | | | | | | | | | Else the following panic may happen: panic() icmp_error() ipoib_cm_mb_reap() linux_work_fn() taskqueue_run_locked() taskqueue_thread_loop() fork_exit() fork_trampoline() Submitted by: Andreas Kempe <kempe@lysator.liu.se> MFC after: 1 week Sponsored by: Mellanox Technologies Notes: svn path=/head/; revision=356633
* Replace rdma_is_upper_dev_rcu() with rdma_vlan_dev_real_dev() in ibcore.Hans Petter Selasky2019-10-162-13/+1
| | | | | | | | | | | This reduces the number of references to VLAN_TRUNKDEV() in ibcore. Currently only VLAN is supported as a child interface in FreeBSD. Remove superfluous RCU locking. Sponsored by: Mellanox Technologies Notes: svn path=/head/; revision=353632
* VLAN_DEVAT() requires epochification in ipoib after r353292.Hans Petter Selasky2019-10-161-0/+6
| | | | | | | Sponsored by: Mellanox Technologies Notes: svn path=/head/; revision=353631
* Fix missing epochification of the ibcore code after r353292.Hans Petter Selasky2019-10-151-1/+4
| | | | | | | Sponsored by: Mellanox Technologies Notes: svn path=/head/; revision=353547
* Fix missing epochification of the ipoib code after r353292.Hans Petter Selasky2019-10-153-0/+12
| | | | | | | Sponsored by: Mellanox Technologies Notes: svn path=/head/; revision=353546