aboutsummaryrefslogtreecommitdiff
path: root/sys/sys/mbuf.h
Commit message (Collapse)AuthorAgeFilesLines
* Update the LRO processing code so that we can supportRandall Stewart2021-02-171-0/+1
| | | | | | | | | | | | | | | | a further CPU enhancements for compressed acks. These are acks that are compressed into an mbuf. The transport has to be aware of how to process these, and an upcoming update to rack will do so. You need the rack changes to actually test and validate these since if the transport does not support mbuf compression, then the old code paths stay in place. We do in this commit take out the concept of logging if you don't have a lock (which was quite dangerous and was only for some early debugging but has been left in the code). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28374
* Add m_snd_tag_alloc() as a wrapper around if_snd_tag_alloc().John Baldwin2020-10-291-0/+3
| | | | | | | | | | | This gives a more uniform API for send tag life cycle management. Reviewed by: gallatin, hselasky Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D27000 Notes: svn path=/head/; revision=367151
* Implement mbuf hashing routines for IP over infiniband, IPoIB.Hans Petter Selasky2020-10-221-2/+4
| | | | | | | | | | | | No functional change intended. Differential Revision: https://reviews.freebsd.org/D26254 Reviewed by: melifaro@ MFC after: 1 week Sponsored by: Mellanox Technologies // NVIDIA Networking Notes: svn path=/head/; revision=366931
* Store the send tag type in the common send tag header.John Baldwin2020-10-061-1/+2
| | | | | | | | | | | | | | | | | | | | Both cxgbe(4) and mlx5(4) wrapped the existing send tag header with their own identical headers that stored the type that the type-specific tag structures inherited from, so in practice it seems drivers need this in the tag anyway. This permits removing these extra header indirections (struct cxgbe_snd_tag and struct mlx5e_snd_tag). In addition, this permits driver-independent code to query the type of a tag, e.g. to know what type of tag is being queried via if_snd_query. Reviewed by: gallatin, hselasky, np, kib Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26689 Notes: svn path=/head/; revision=366491
* mbuf checksum flags and fields to support tunneling protocols.Navdeep Parhar2020-09-181-7/+47
| | | | | | | | | | | | | | | | These are being added to support VXLAN but will work for GENEVE as well. ENCAP_RSVD1 will likely become ENCAP_GENEVE in the future. The size of struct mbuf does not change and that means this change can be MFC'd. If size wasn't a constraint a cleaner way may have been to add inner_csum_flags and inner_csum_data to go with csum_flags and csum_data. Reviewed by: kib@ Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D25873 Notes: svn path=/head/; revision=365867
* Add m__getjcl SDT probe.Andrey V. Elsukov2020-08-051-0/+1
| | | | | | | | | Obtained from: Yandex LLC MFC after: 1 week Sponsored by: Yandex LLC Notes: svn path=/head/; revision=363906
* Add two functions that create M_EXTPG mbufs with anonymous pages.Rick Macklem2020-06-101-0/+3
| | | | | | | | | | | | | | | | These two functions are needed by nfs-over-tls, but could also be useful for other purposes. mb_alloc_ext_plus_pages() - Allocates a M_EXTPG mbuf and enough anonymous pages to store "len" data bytes. mb_mapped_to_unmapped() - Copies the data from a list of mapped (non-M_EXTPG) mbufs into a list of M_EXTPG mbufs allocated with anonymous pages. This is roughly the inverse of mb_unmapped_to_ext(). Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D25182 Notes: svn path=/head/; revision=361998
* Step 4.2: start divorce of M_EXT and M_EXTPGGleb Smirnoff2020-05-031-3/+5
| | | | | | | | | | | | | They have more differencies than similarities. For now there is lots of code that would check for M_EXT only and work correctly on M_EXTPG buffers, so still carry M_EXT bit together with M_EXTPG. However, prepare some code for explicit check for M_EXTPG. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598 Notes: svn path=/head/; revision=360583
* Mechanically rename MBUF_EXT_PGS_ASSERT() to M_ASSERTEXTPG() to matchGleb Smirnoff2020-05-031-6/+5
| | | | | | | | | | classical M_ASSERTPKTHDR. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598 Notes: svn path=/head/; revision=360582
* Step 4.1: mechanically rename M_NOMAP to M_EXTPGGleb Smirnoff2020-05-031-5/+5
| | | | | | | | Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598 Notes: svn path=/head/; revision=360581
* Step 3: anonymize struct mbuf_ext_pgs and move all its fields into mbufGleb Smirnoff2020-05-031-33/+35
| | | | | | | | | | | | | | | | | | | | | | | | | within m_epg namespace. All edits except the 'struct mbuf' declaration and mb_dupcl() were done mechanically with sed: s/->m_ext_pgs.nrdy/->m_epg_nrdy/g s/->m_ext_pgs.hdr_len/->m_epg_hdrlen/g s/->m_ext_pgs.trail_len/->m_epg_trllen/g s/->m_ext_pgs.first_pg_off/->m_epg_1st_off/g s/->m_ext_pgs.last_pg_len/->m_epg_last_len/g s/->m_ext_pgs.flags/->m_epg_flags/g s/->m_ext_pgs.record_type/->m_epg_record_type/g s/->m_ext_pgs.enc_cnt/->m_epg_enc_cnt/g s/->m_ext_pgs.tls/->m_epg_tls/g s/->m_ext_pgs.so/->m_epg_so/g s/->m_ext_pgs.seqno/->m_epg_seqno/g s/->m_ext_pgs.stailq/->m_epg_stailq/g Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598 Notes: svn path=/head/; revision=360579
* Make MBUF_EXT_PGS_ASSERT_SANITY() a macro, so that it prints file:line.Gleb Smirnoff2020-05-031-5/+28
| | | | | | | | | | While here, stop using struct mbuf_ext_pgs. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598 Notes: svn path=/head/; revision=360577
* Step 2.3: Rename mbuf_ext_pg_len() to m_epg_pagelen() thatGleb Smirnoff2020-05-021-4/+6
| | | | | | | | | | uses mbuf argument. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598 Notes: svn path=/head/; revision=360575
* Step 2.1: Build TLS workqueue from mbufs, not struct mbuf_ext_pgs.Gleb Smirnoff2020-05-021-1/+1
| | | | | | | | Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598 Notes: svn path=/head/; revision=360573
* Get rid of the mbuf self-pointing pointer.Gleb Smirnoff2020-05-021-1/+1
| | | | | | | | Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598 Notes: svn path=/head/; revision=360572
* Start moving into EPG_/epg_ namespace. There is only one flag, butGleb Smirnoff2020-05-021-2/+1
| | | | | | | | | | | next commit brings in second flag, so let them already be in the future namespace. Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598 Notes: svn path=/head/; revision=360571
* Continuation of multi page mbuf redesign from r359919.Gleb Smirnoff2020-05-021-57/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The following series of patches addresses three things: Now that array of pages is embedded into mbuf, we no longer need separate structure to pass around, so struct mbuf_ext_pgs is an artifact of the first implementation. And struct mbuf_ext_pgs_data is a crutch to accomodate the main idea r359919 with minimal churn. Also, M_EXT of type EXT_PGS are just a synonym of M_NOMAP. The namespace for the newfeature is somewhat inconsistent and sometimes has a lengthy prefixes. In these patches we will gradually bring the namespace to "m_epg" prefix for all mbuf fields and most functions. Step 1 of 4: o Anonymize mbuf_ext_pgs_data, embed in m_ext o Embed mbuf_ext_pgs o Start documenting all this entanglement Reviewed by: gallatin Differential Revision: https://reviews.freebsd.org/D24598 Notes: svn path=/head/; revision=360569
* KTLS: Re-work unmapped mbufs to carry ext_pgs in the mbuf itself.Andrew Gallatin2020-04-141-86/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While the original implementation of unmapped mbufs was a large step forward in terms of reducing cache misses by enabling mbufs to carry more than a single page for sendfile, they are rather cache unfriendly when accessing the ext_pgs metadata and data. This is because the ext_pgs part of the mbuf is allocated separately, and almost guaranteed to be cold in cache. This change takes advantage of the fact that unmapped mbufs are never used at the same time as pkthdr mbufs. Given this fact, we can overlap the ext_pgs metadata with the mbuf pkthdr, and carry the ext_pgs meta directly in the mbuf itself. Similarly, we can carry the ext_pgs data (TLS hdr/trailer/array of pages) directly after the existing m_ext. In order to be able to carry 5 pages (which is the minimum required for a 16K TLS record which is not perfectly aligned) on LP64, I've had to steal ext_arg2. The only user of this in the xmit path is sendfile, and I've adjusted it to use arg1 when using unmapped mbufs. This change is almost entirely mechanical, except that we change mb_alloc_ext_pgs() to no longer allow allocating pkthdrs, the change to avoid ext_arg2 as mentioned above, and the removal of the ext_pgs zone, This change saves roughly 2% "raw" CPU (~59% -> 57%), or over 3% "scaled" CPU on a Netflix 100% software kTLS workload at 90+ Gb/s on Broadwell Xeons. In a follow-on commit, I plan to remove some hacks to avoid access ext_pgs fields of mbufs, since they will now be in cache. Many thanks to glebius for helping to make this better in the Netflix tree. Reviewed by: hselasky, jhb, rrs, glebius (early version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D24213 Notes: svn path=/head/; revision=359919
* Split out a more generic debugnet(4) from netdump(4)Conrad Meyer2019-10-171-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Debugnet is a simplistic and specialized panic- or debug-time reliable datagram transport. It can drive a single connection at a time and is currently unidirectional (debug/panic machine transmit to remote server only). It is mostly a verbatim code lift from netdump(4). Netdump(4) remains the only consumer (until the rest of this patch series lands). The INET-specific logic has been extracted somewhat more thoroughly than previously in netdump(4), into debugnet_inet.c. UDP-layer logic and up, as much as possible as is protocol-independent, remains in debugnet.c. The separation is not perfect and future improvement is welcome. Supporting INET6 is a long-term goal. Much of the diff is "gratuitous" renaming from 'netdump_' or 'nd_' to 'debugnet_' or 'dn_' -- sorry. I thought keeping the netdump name on the generic module would be more confusing than the refactoring. The only functional change here is the mbuf allocation / tracking. Instead of initiating solely on netdump-configured interface(s) at dumpon(8) configuration time, we watch for any debugnet-enabled NIC for link activation and query it for mbuf parameters at that time. If they exceed the existing high-water mark allocation, we re-allocate and track the new high-water mark. Otherwise, we leave the pre-panic mbuf allocation alone. In a future patch in this series, this will allow initiating netdump from panic ddb(4) without pre-panic configuration. No other functional change intended. Reviewed by: markj (earlier version) Some discussion with: emaste, jhb Objection from: marius Differential Revision: https://reviews.freebsd.org/D21421 Notes: svn path=/head/; revision=353685
* Brad Davis identified a problem with the new LRO code, VLAN'sRandall Stewart2019-10-061-12/+7
| | | | | | | | | | | | | | | | | no longer worked. The problem was that the defines used the same space as the VLAN id. This commit does three things. 1) Move the LRO used fields to the PH_per fields. This is safe since the entire PH_per is used for IP reassembly which LRO code will not hit. 2) Remove old unused pace fields that are not used in mbuf.h 3) The VLAN processing is not in the mbuf queueing code. Consequently if a VLAN submits to Rack or BBR we need to bypass the mbuf queueing for now until rack_bbr_common is updated to handle the VLAN properly. Reported by: Brad Davis Notes: svn path=/head/; revision=353156
* kTLS: Fix a bug where we would not encrypt anon data inplace.Andrew Gallatin2019-09-271-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Software Kernel TLS needs to allocate a new destination crypto buffer when encrypting data from the page cache, so as to avoid overwriting shared clear-text file data with encrypted data specific to a single socket. When the data is anonymous, eg, not tied to a file, then we can encrypt in place and avoid allocating a new page. This fixes a bug where the existing code always assumes the data is private, and never encrypts in place. This results in unneeded page allocations and potentially more memory bandwidth consumption when doing socket writes. When the code was written at Netflix, ktls_encrypt() looked at private sendfile flags to determine if the pages being encrypted where part of the page cache (coming from sendfile) or anonymous (coming from sosend). This was broken internally at Netflix when the sendfile flags were made private, and the M_WRITABLE() check was added. Unfortunately, M_WRITABLE() will always be false for M_NOMAP mbufs, since one cannot just mtod() them. This change introduces a new flags field to the mbuf_ext_pgs struct by stealing a byte from the tls hdr. Note that the current header is still 2 bytes larger than the largest header we support: AES-CBC with explicit IV. We set MBUF_PEXT_FLAG_ANON when creating an unmapped mbuf in m_uiotombuf_nomap() (which is the path that socket writes take), and we check for that flag in ktls_encrypt() when looking for anon pages. Reviewed by: jhb Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21796 Notes: svn path=/head/; revision=352816
* kTLS support for TLS 1.3Andrew Gallatin2019-09-271-0/+1
| | | | | | | | | | | | | | | | TLS 1.3 requires a few changes because 1.3 pretends to be 1.2 with a record type of application data. The "real" record type is then included at the end of the user-supplied plaintext data. This required adding a field to the mbuf_ext_pgs struct to save the record type, and passing the real record type to the sw_encrypt() ktls backend functions. Reviewed by: jhb, hselasky Sponsored by: Netflix Differential Revision: D21801 Notes: svn path=/head/; revision=352814
* This commit adds BBR (Bottleneck Bandwidth and RTT) congestion control. ThisRandall Stewart2019-09-241-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | is a completely separate TCP stack (tcp_bbr.ko) that will be built only if you add the make options WITH_EXTRA_TCP_STACKS=1 and also include the option TCPHPTS. You can also include the RATELIMIT option if you have a NIC interface that supports hardware pacing, BBR understands how to use such a feature. Note that this commit also adds in a general purpose time-filter which allows you to have a min-filter or max-filter. A filter allows you to have a low (or high) value for some period of time and degrade slowly to another value has time passes. You can find out the details of BBR by looking at the original paper at: https://queue.acm.org/detail.cfm?id=3022184 or consult many other web resources you can find on the web referenced by "BBR congestion control". It should be noted that BBRv1 (which this is) does tend to unfairness in cases of small buffered paths, and it will usually get less bandwidth in the case of large BDP paths(when competing with new-reno or cubic flows). BBR is still an active research area and we do plan on implementing V2 of BBR to see if it is an improvement over V1. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D21582 Notes: svn path=/head/; revision=352657
* This adds the final tweaks to LRO that will now allow meRandall Stewart2019-09-061-15/+16
| | | | | | | | | | | | | | | | to add BBR. These changes make it so you can get an array of timestamps instead of a compressed ack/data segment. BBR uses this to aid with its delivery estimates. We also now (via Drew's suggestions) will not go to the expense of the tcb lookup if no stack registers to want this feature. If HPTS is not present the feature is not present either and you just get the compressed behavior. Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D21127 Notes: svn path=/head/; revision=351934
* Allow mbuf queues to be unlimited.Gleb Smirnoff2019-08-301-1/+1
| | | | | | | | | There is number of legacy code that uses ifqueue without setting a limit on it first. Easier to allow for that rather than improve legacy drivers. Notes: svn path=/head/; revision=351615
* Add kernel-side support for in-kernel TLS.John Baldwin2019-08-271-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277 Notes: svn path=/head/; revision=351522
* This commit updates rack to what is basically being used at NF asRandall Stewart2019-07-101-0/+1
| | | | | | | | | | | | | | well as sets in some of the groundwork for committing BBR. The hpts system is updated as well as some other needed utilities for the entrance of BBR. This is actually part 1 of 3 more needed commits which will finally complete with BBRv1 being added as a new tcp stack. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D20834 Notes: svn path=/head/; revision=349893
* Add support for using unmapped mbufs with sendfile(2).John Baldwin2019-06-291-0/+1
| | | | | | | | | | | | | | This can be enabled at runtime via the kern.ipc.mb_use_ext_pgs sysctl. It is disabled by default. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616 Notes: svn path=/head/; revision=349530
* Add an external mbuf buffer type that holds multiple unmapped pages.John Baldwin2019-06-291-5/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unmapped mbufs allow sendfile to carry multiple pages of data in a single mbuf, without mapping those pages. It is a requirement for Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web serving workloads when used by sendfile, due to effectively compressing socket buffers by an order of magnitude, and hence reducing cache misses. For this new external mbuf buffer type (EXT_PGS), the ext_buf pointer now points to a struct mbuf_ext_pgs structure instead of a data buffer. This structure contains an array of physical addresses (this reduces cache misses compared to an earlier version that stored an array of vm_page_t pointers). It also stores additional fields needed for in-kernel TLS such as the TLS header and trailer data that are currently unused. To more easily detect these mbufs, the M_NOMAP flag is set in m_flags in addition to M_EXT. Various functions like m_copydata() have been updated to safely access packet contents (using uiomove_fromphys()), to make things like BPF safe. NIC drivers advertise support for unmapped mbufs on transmit via a new IFCAP_NOMAP capability. This capability can be toggled via the new 'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only transmit packet contents via DMA and use bus_dma, adding the capability to if_capabilities and if_capenable should be all that is required. If a NIC does not support unmapped mbufs, they are converted to a chain of mapped mbufs (using sf_bufs to provide the mapping) in ip_output or ip6_output. If an unmapped mbuf requires software checksums, it is also converted to a chain of mapped mbufs before computing the checksum. Submitted by: gallatin (earlier version) Reviewed by: gallatin, hselasky, rrs Discussed with: ae, kp (firewalls) Relnotes: yes Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20616 Notes: svn path=/head/; revision=349529
* Add M_NOFREE to M_FLAG_BITS.John Baldwin2019-06-111-1/+1
| | | | Notes: svn path=/head/; revision=348965
* Restructure mbuf send tags to provide stronger guarantees.John Baldwin2019-05-241-1/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Perform ifp mismatch checks (to determine if a send tag is allocated for a different ifp than the one the packet is being output on), in ip_output() and ip6_output(). This avoids sending packets with send tags to ifnet drivers that don't support send tags. Since we are now checking for ifp mismatches before invoking if_output, we can now try to allocate a new tag before invoking if_output sending the original packet on the new tag if allocation succeeds. To avoid code duplication for the fragment and unfragmented cases, add ip_output_send() and ip6_output_send() as wrappers around if_output and nd6_output_ifp, respectively. All of the logic for setting send tags and dealing with send tag-related errors is done in these wrapper functions. For pseudo interfaces that wrap other network interfaces (vlan and lagg), wrapper send tags are now allocated so that ip*_output see the wrapper ifp as the ifp in the send tag. The if_transmit routines rewrite the send tags after performing an ifp mismatch check. If an ifp mismatch is detected, the transmit routines fail with EAGAIN. - To provide clearer life cycle management of send tags, especially in the presence of vlan and lagg wrapper tags, add a reference count to send tags managed via m_snd_tag_ref() and m_snd_tag_rele(). Provide a helper function (m_snd_tag_init()) for use by drivers supporting send tags. m_snd_tag_init() takes care of the if_ref on the ifp meaning that code alloating send tags via if_snd_tag_alloc no longer has to manage that manually. Similarly, m_snd_tag_rele drops the refcount on the ifp after invoking if_snd_tag_free when the last reference to a send tag is dropped. This also closes use after free races if there are pending packets in driver tx rings after the socket is closed (e.g. from tcpdrop). In order for m_free to work reliably, add a new CSUM_SND_TAG flag in csum_flags to indicate 'snd_tag' is set (rather than 'rcvif'). Drivers now also check this flag instead of checking snd_tag against NULL. This avoids false positive matches when a forwarded packet has a non-NULL rcvif that was treated as a send tag. - cxgbe was relying on snd_tag_free being called when the inp was detached so that it could kick the firmware to flush any pending work on the flow. This is because the driver doesn't require ACK messages from the firmware for every request, but instead does a kind of manual interrupt coalescing by only setting a flag to request a completion on a subset of requests. If all of the in-flight requests don't have the flag when the tag is detached from the inp, the flow might never return the credits. The current snd_tag_free command issues a flush command to force the credits to return. However, the credit return is what also frees the mbufs, and since those mbufs now hold references on the tag, this meant that snd_tag_free would never be called. To fix, explicitly drop the mbuf's reference on the snd tag when the mbuf is queued in the firmware work queue. This means that once the inp's reference on the tag goes away and all in-flight mbufs have been queued to the firmware, tag's refcount will drop to zero and snd_tag_free will kick in and send the flush request. Note that we need to avoid doing this in the middle of ethofld_tx(), so the driver grabs a temporary reference on the tag around that loop to defer the free to the end of the function in case it sends the last mbuf to the queue after the inp has dropped its reference on the tag. - mlx5 preallocates send tags and was using the ifp pointer even when the send tag wasn't in use. Explicitly use the ifp from other data structures instead. - Sprinkle some assertions in various places to assert that received packets don't have a send tag, and that other places that overwrite rcvif (e.g. 802.11 transmit) don't clobber a send tag pointer. Reviewed by: gallatin, hselasky, rgrimes, ae Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20117 Notes: svn path=/head/; revision=348254
* Replace cosqos with numa_domain in mbuf pkthdrAndrew Gallatin2019-04-161-28/+2
| | | | | | | | | | | | | | | | | | | | The cosqos field was added nearly 6 years ago in r254804, and it is still unused by any in-tree consumers. I have a patchset that I'm working on which aligns many network resources by NUMA domain, including inps, inpcb lb group, tcp pacing, lagg output link selection, backing pages for sendfile, and more. It reduces cross-domain traffic by roughly 50% for a real web workload. This patchset relies on being able to store the numa domain in the mbuf, and grabbing the unused cosqos field for this purpose is the first step in starting to usptream it. Reviewed by: kib, markj Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19862 Notes: svn path=/head/; revision=346281
* Add new m_ext type for data for M_NOFREE mbufs, which doesn't actually doGleb Smirnoff2019-01-311-0/+1
| | | | | | | | | | anything except several assertions. This type is going to be used for temporary on stack mbufs, that point into data in receive ring of a NIC, that shall not be freed. Such mbuf can not be stored or reallocated, its life time is current context. Notes: svn path=/head/; revision=343627
* Improve handling of control message truncation.Mark Johnston2018-08-071-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a recvmsg(2) or recvmmsg(2) caller doesn't provide sufficient space for all control messages, the kernel sets MSG_CTRUNC in the message flags to indicate truncation of the control messages. In the case of SCM_RIGHTS messages, however, we were failing to dispose of the rights that had already been externalized into the recipient's file descriptor table. Add a new function and mbuf type to handle this cleanup task, and use it any time we fail to copy control messages out to the recipient. To simplify cleanup, control message truncation is now only performed at control message boundaries. The change also fixes a few related bugs: - Rights could be leaked to the recipient process if an error occurred while copying out a message's contents. - We failed to set MSG_CTRUNC if the truncation occurred on a control message boundary, e.g., if the caller received two control messages and provided only the exact amount of buffer space needed for the first. PR: 131876 Reviewed by: ed (previous version) MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16561 Notes: svn path=/head/; revision=337423
* Remove MT_NTYPES.Mark Johnston2018-08-011-1/+0
| | | | | | | Its last reference was removed in r253361. Notes: svn path=/head/; revision=337032
* This commit brings in a new refactored TCP stack called Rack.Randall Stewart2018-06-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rack includes the following features: - A different SACK processing scheme (the old sack structures are not used). - RACK (Recent acknowledgment) where counting dup-acks is no longer done instead time is used to knwo when to retransmit. (see the I-D) - TLP (Tail Loss Probe) where we will probe for tail-losses to attempt to try not to take a retransmit time-out. (see the I-D) - Burst mitigation using TCPHTPS - PRR (partial rate reduction) see the RFC. Once built into your kernel, you can select this stack by either socket option with the name of the stack is "rack" or by setting the global sysctl so the default is rack. Note that any connection that does not support SACK will be kicked back to the "default" base FreeBSD stack (currently known as "default"). To build this into your kernel you will need to enable in your kernel: makeoptions WITH_EXTRA_TCP_STACKS=1 options TCPHPTS Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15525 Notes: svn path=/head/; revision=334804
* Add an mbuf allocator for netdump.Mark Johnston2018-05-061-0/+7
| | | | | | | | | | | | | | | | | | | | | | The aim is to permit mbuf allocations after a panic without calling into the page allocator, without imposing any runtime overhead during regular operation of the system, and without modifying driver code. The approach taken is to preallocate a number of mbufs and clusters, storing them in linked lists, and using the lists to back some UMA cache zones. At panic time, the mbuf and cluster zone pointers are overwritten with those of the cache zones so that the mbuf allocator returns preallocated items. Using this scheme, drivers which cache mbuf zone pointers from m_getzone() require special handling when implementing netdump support. Reviewed by: cem (earlier version), julian, sbruno MFC after: 1 month Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D15251 Notes: svn path=/head/; revision=333281
* This commit brings in the TCP high precision timer system (tcp_hpts).Randall Stewart2018-04-191-0/+5
| | | | | | | | | | | | | | | It is the forerunner/foundational work of bringing in both Rack and BBR which use hpts for pacing out packets. The feature is optional and requires the TCPHPTS option to be enabled before the feature will be active. TCP modules that use it must assure that the base component is compile in the kernel in which they are loaded. MFC after: Never Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D15020 Notes: svn path=/head/; revision=332770
* sys: further adoption of SPDX licensing ID tags.Pedro F. Giffuni2017-11-201-0/+2
| | | | | | | | | | | | | | | | | Mainly focus on files that use BSD 3-Clause license. The Software Package Data Exchange (SPDX) group provides a specification to make it easier for automated tools to detect and summarize well known opensource licenses. We are gradually adopting the specification, noting that the tags are considered only advisory and do not, in any way, superceed or replace the license texts. Special thanks to Wind River for providing access to "The Duke of Highlander" tool: an older (2014) run over FreeBSD tree was useful as a starting point. Notes: svn path=/head/; revision=326023
* Add a place for a driver to report rx timestamps in nanoseconds fromKonstantin Belousov2017-11-071-8/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | boot for the received packets. The rcv_tstmp field overlaps the place of Ln header length indicators, not used by received packets. The basic pkthdr rearrangement change in sys/mbuf.h was provided by gallatin. There are two accompanying M_ flags: M_TSTMP means that there is the timestamp (and it was generated by hardware). Another flag M_TSTMP_HPREC indicates that the timestamp is high-precision. Practically M_TSTMP_HPREC means that hardware provided additional precision comparing with the stamps when the flag is not set. E.g., for ConnectX all packets are stamped by hardware when PCIe transaction to write out the completion descriptor is performed, but PTP packet are stamped on port. For Intel cards, when PTP assist is enabled, only PTP packets are stamped in the limited number of registers, so if Intel cards ever start support this mechanism, they would always set M_TSTMP | M_TSTMP_HPREC if hardware timestamp is present for the given packet. Add IFCAP_HWRXTSTMP interface capability to indicate the support for hardware rx timestamping, and ifconfig(8) command to toggle it. Based on the patch by: gallatin Reviewed by: gallatin (previous version), hselasky Sponsored by: Mellanox Technologies MFC after: 2 weeks (? mbuf KBI issue) X-Differential revision: https://reviews.freebsd.org/D12638 Notes: svn path=/head/; revision=325506
* Improvements to sendfile(2) mbuf free routine.Gleb Smirnoff2017-10-091-7/+0
| | | | | | | | | | | | | | | | | | o Fall back to default m_ext free mech, using function pointer in m_ext_free, and remove sf_ext_free() called directly from mbuf code. Testing on modern CPUs showed no regression. o Provide internally used flag EXT_FLAG_SYNC, to mark that I/O uses SF_SYNC flag. Lack of the flag allows us not to dereference ext_arg2, saving from a cache line miss. o Create function sendfile_free_page() that later will be used, for multi-page mbufs. For now compiler will inline it into sendfile_free_mext(). In collaboration with: gallatin Differential Revision: https://reviews.freebsd.org/D12615 Notes: svn path=/head/; revision=324448
* In mb_dupcl() don't copy full m_ext, to avoid cache miss. Respectively,Gleb Smirnoff2017-10-091-2/+18
| | | | | | | | | | | | | | | | | in mb_free_ext() always use fields from the original refcount holding mbuf (see. r296242) mbuf. Cuts another cache miss from mb_free_ext(). However, treat EXT_EXTREF mbufs differently, since they are different - they don't have a refcount holding mbuf. Provide longer comments in m_ext declaration to explain this change and change from r296242. In collaboration with: gallatin Differential Revision: https://reviews.freebsd.org/D12615 Notes: svn path=/head/; revision=324447
* Shorten list of arguments to mbuf external storage freeing function.Gleb Smirnoff2017-10-091-13/+12
| | | | | | | | | | | | | | | | | | | | All of these arguments are stored in m_ext, so there is no reason to pass them in the argument list. Not all functions need the second argument, some don't even need the first one. The second argument lives in next cache line, so not dereferencing it is a performance gain. This was discovered in sendfile(2), which will be covered by next commits. The second goal of this commit is to bring even more flexibility to m_ext mbufs, allowing to create more fields in m_ext, opaque to the generic mbuf code, and potentially set and dereferenced by subsystems. Reviewed by: gallatin, kbowling Differential Revision: https://reviews.freebsd.org/D12615 Notes: svn path=/head/; revision=324446
* mbuf: Remove UDP_IPV4_EX, which was never defined.Sepherosa Ziehau2017-09-271-3/+10
| | | | | | | | | | | | | Add comment to explain the IPV6_EX suffix. The confusion about these RSS hash type probably stems from the facts that they were never widely implemented by hardwares. Reviewed by: rwatson Sponsored by: Microsoft Differential Revision: https://reviews.freebsd.org/D12453 Notes: svn path=/head/; revision=324052
* Fixed typo.Patrick Kelsey2017-04-081-2/+2
| | | | | | | | | CSUM_COALESED -> CSUM_COALESCED MFC after: 3 days Notes: svn path=/head/; revision=316632
* [mbufq] add a concat method.Adrian Chadd2017-03-301-0/+13
| | | | | | | | | Reviewed by: gnn, ae, glebius Approved by: ae, glebius Differential Revision: https://reviews.freebsd.org/D10158 Notes: svn path=/head/; revision=316206
* m_mbuftouio() doesn't modify the mbuf.Gleb Smirnoff2017-03-071-1/+1
| | | | Notes: svn path=/head/; revision=314876
* Implement kernel support for hardware rate limited sockets.Hans Petter Selasky2017-01-181-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Add RATELIMIT kernel configuration keyword which must be set to enable the new functionality. - Add support for hardware driven, Receive Side Scaling, RSS aware, rate limited sendqueues and expose the functionality through the already established SO_MAX_PACING_RATE setsockopt(). The API support rates in the range from 1 to 4Gbytes/s which are suitable for regular TCP and UDP streams. The setsockopt(2) manual page has been updated. - Add rate limit function callback API to "struct ifnet" which supports the following operations: if_snd_tag_alloc(), if_snd_tag_modify(), if_snd_tag_query() and if_snd_tag_free(). - Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT flag, which tells if a network driver supports rate limiting or not. - This patch also adds support for rate limiting through VLAN and LAGG intermediate network devices. - How rate limiting works: 1) The userspace application calls setsockopt() after accepting or making a new connection to set the rate which is then stored in the socket structure in the kernel. Later on when packets are transmitted a check is made in the transmit path for rate changes. A rate change implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the destination network interface, which then sets up a custom sendqueue with the given rate limitation parameter. A "struct m_snd_tag" pointer is returned which serves as a "snd_tag" hint in the m_pkthdr for the subsequently transmitted mbufs. 2) When the network driver sees the "m->m_pkthdr.snd_tag" different from NULL, it will move the packets into a designated rate limited sendqueue given by the snd_tag pointer. It is up to the individual drivers how the rate limited traffic will be rate limited. 3) Route changes are detected by the NIC drivers in the ifp->if_transmit() routine when the ifnet pointer in the incoming snd_tag mismatches the one of the network interface. The network adapter frees the mbuf and returns EAGAIN which causes the ip_output() to release and clear the send tag. Upon next ip_output() a new "snd_tag" will be tried allocated. 4) When the PCB is detached the custom sendqueue will be released by a non-blocking ifp->if_snd_tag_free() call to the currently bound network interface. Reviewed by: wblock (manpages), adrian, gallatin, scottl (network) Differential Revision: https://reviews.freebsd.org/D3687 Sponsored by: Mellanox Technologies MFC after: 3 months Notes: svn path=/head/; revision=312379
* Fix typo in comments.Navdeep Parhar2016-10-151-1/+1
| | | | Notes: svn path=/head/; revision=307380
* Remove the 4.3BSD compatible macro m_copy(), use m_copym() instead.Kevin Lo2016-09-151-3/+0
| | | | | | | | Reviewed by: gnn Differential Revision: https://reviews.freebsd.org/D7878 Notes: svn path=/head/; revision=305824