src - FreeBSD source tree

	Commit message (Collapse)	Author	Age	Files	Lines
*	tcp: Make dsack stats available in netstat and also make sure its aware of ↵	Randall Stewart	2021-10-01	1	-0/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TLP's. DSACK accounting has been for quite some time under a NETFLIX_STATS ifdef. Statistics on DSACKs however are very useful in figuring out how much bad retransmissions you are doing. This is further complicated, however, by stacks that do TLP. A TLP when discovering a lost ack in the reverse path will cause the generation of a DSACK. For this situation we introduce a new dsack-tlp-bytes as well as the more traditional dsack-bytes and dsack-packets. These will now all display in netstat -p tcp -s. This also updates all stacks that are currently built to keep track of these stats. Reviewed by: tuexen Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D32158
*	tcp: TCP_LRO getting bad checksums and sending it in to TCP incorrectly.	Randall Stewart	2021-07-13	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In reviewing tcp_lro.c we have a possibility that some drives may send a mbuf into LRO without making sure that the checksum passes. Some drivers actually are aware of this and do not call lro when the csum failed, others do not do this and thus could end up sending data up that we think has a checksum passing when it does not. This change will fix that situation by properly verifying that the mbuf has the correct markings (CSUM VALID bits as well as csum in mbuf header is set to 0xffff). Reviewed by: tuexen, hselasky, gallatin Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D31155
*	tcp: tolerate missing timestamps	Michael Tuexen	2021-06-27	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \|	Some TCP stacks negotiate TS support, but do not send TS at all or not for keep-alive segments. Since this includes modern widely deployed stacks, tolerate the violation of RFC 7323 per default. Reviewed by: rgrimes, rrs, rscheff MFC after: 3 days Differential Revision: https://reviews.freebsd.org/D30740 Sponsored by: Netflix, Inc.
*	Consistently use the SOLISTENING() macro	Mark Johnston	2021-06-14	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Some code was using it already, but in many places we were testing SO_ACCEPTCONN directly. As a small step towards fixing some bugs involving synchronization with listen(2), make the kernel consistently use SOLISTENING(). No functional change intended. MFC after: 1 week Sponsored by: The FreeBSD Foundation
*	tcp: Fix an issue with the PUSH bit as well as fill in the missing mtu ↵	Randall Stewart	2021-05-24	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	change for fsb's The push bit itself was also not actually being properly moved to the right edge. The FIN bit was incorrectly on the left edge. We fix these two issues as well as plumb in the mtu_change for alternate stacks. Reviewed by: mtuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30413
*	tcp: SACK Lost Retransmission Detection (LRD)	Richard Scheffenegger	2021-05-10	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	Recover from excessive losses without reverting to a retransmission timeout (RTO). Disabled by default, enable with sysctl net.inet.tcp.do_lrd=1 Reviewed By: #transport, rrs, tuexen, #manpages Sponsored by: Netapp, Inc. Differential Revision: https://reviews.freebsd.org/D28931
*	tcp:Host cache and rack ending up with incorrect values.	Randall Stewart	2021-05-10	1	-56/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The hostcache up to now as been updated in the discard callback but without checking if we are all done (the race where there are more than one calls and the counter has not yet reached zero). This means that when the race occurs, we end up calling the hc_upate more than once. Also alternate stacks can keep there srtt/rttvar in different formats (example rack keeps its values in microseconds). Since we call the hc_update before the stack fini() then the values will be in the wrong format. Rack on the other hand, needs to convert items pulled from the hostcache into its internal format else it may end up with very much incorrect values from the hostcache. In the process lets commonize the update mechanism for srtt/rttvar since we now have more than one place that needs to call it. Reviewed by: Michael Tuexen Sponsored by: Netflix Inc Differential Revision: https://reviews.freebsd.org/D30172
*	This brings into sync FreeBSD with the netflix versions of rack and bbr.	Randall Stewart	2021-05-06	1	-0/+105
\| \| \| \| \| \| \| \| \| \| \| \|	This fixes several breakages (panics) since the tcp_lro code was committed that have been reported. Quite a few new features are now in rack (prefecting of DGP -- Dynamic Goodput Pacing among the largest). There is also support for ack-war prevention. Documents comming soon on rack.. Sponsored by: Netflix Reviewed by: rscheff, mtuexen Differential Revision: https://reviews.freebsd.org/D30036
*	Path MTU discovery hooks for offloaded TCP connections.	Navdeep Parhar	2021-04-21	1	-26/+54
\| \| \| \| \| \| \| \| \| \| \| \|	Notify the TOE driver when when an ICMP type 3 code 4 (Fragmentation needed and DF set) message is received for an offloaded connection. This gives the driver an opportunity to lower the path MTU for the connection and resume transmission, much like what the kernel does for the connections that it handles. Reviewed by: glebius@ Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D29755
*	Add TCP LRO support for VLAN and VxLAN.	Hans Petter Selasky	2021-04-20	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change makes the TCP LRO code more generic and flexible with regards to supporting multiple different TCP encapsulation protocols and in general lays the ground for broader TCP LRO support. The main job of the TCP LRO code is to merge TCP packets for the same flow, to reduce the number of calls to upper layers. This reduces CPU and increases performance, due to being able to send larger TSO offloaded data chunks at a time. Basically the TCP LRO makes it possible to avoid per-packet interaction by the host CPU. Because the current TCP LRO code was tightly bound and optimized for TCP/IP over ethernet only, several larger changes were needed. Also a minor bug was fixed in the flushing mechanism for inactive entries, where the expire time, "le->mtime" was not always properly set. To avoid having to re-run time consuming regression tests for every change, it was chosen to squash the following list of changes into a single commit: - Refactor parsing of all address information into the "lro_parser" structure. This easily allows to reuse parsing code for inner headers. - Speedup header data comparison. Don't compare field by field, but instead use an unsigned long array, where the fields get packed. - Refactor the IPv4/TCP/UDP checksum computations, so that they may be computed recursivly, only applying deltas as the result of updating payload data. - Make smaller inline functions doing one operation at a time instead of big functions having repeated code. - Refactor the TCP ACK compression code to only execute once per TCP LRO flush. This gives a minor performance improvement and keeps the code simple. - Use sbintime() for all time-keeping. This change also fixes flushing of inactive entries. - Try to shrink the size of the LRO entry, because it is frequently zeroed. - Removed unused TCP LRO macros. - Cleanup unused TCP LRO statistics counters while at it. - Try to use __predict_true() and predict_false() to optimise CPU branch predictions. Bump the __FreeBSD_version due to changing the "lro_ctrl" structure. Tested by: Netflix Reviewed by: rrs (transport) Differential Revision: https://reviews.freebsd.org/D29564 MFC after: 2 week Sponsored by: Mellanox Technologies // NVIDIA Networking
*	tcp: add support for TCP over UDP	Michael Tuexen	2021-04-18	1	-20/+442
\| \| \| \| \| \| \| \| \| \| \| \|	Adding support for TCP over UDP allows communication with TCP stacks which can be implemented in userspace without requiring special priviledges or specific support by the OS. This is joint work with rrs. Reviewed by: rrs Sponsored by: Netflix, Inc. MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D29469
*	tcp_respond(): fix assertion, should have been done in 08d9c920275.	Gleb Smirnoff	2021-04-16	1	-1/+1
\|
*	tcp_input/syncache: acquire only read lock on PCB for SYN,!ACK packets	Gleb Smirnoff	2021-04-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When packet is a SYN packet, we don't need to modify any existing PCB. Normally SYN arrives on a listening socket, we either create a syncache entry or generate syncookie, but we don't modify anything with the listening socket or associated PCB. Thus create a new PCB lookup mode - rlock if listening. This removes the primary contention point under SYN flood - the listening socket PCB. Sidenote: when SYN arrives on a synchronized connection, we still don't need write access to PCB to send a challenge ACK or just to drop. There is only one exclusion - tcptw recycling. However, existing entanglement of tcp_input + stacks doesn't allow to make this change small. Consider this patch as first approach to the problem. Reviewed by: rrs Differential revision: https://reviews.freebsd.org/D29576
*	Update the LRO processing code so that we can support	Randall Stewart	2021-02-17	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	a further CPU enhancements for compressed acks. These are acks that are compressed into an mbuf. The transport has to be aware of how to process these, and an upcoming update to rack will do so. You need the rack changes to actually test and validate these since if the transport does not support mbuf compression, then the old code paths stay in place. We do in this commit take out the concept of logging if you don't have a lock (which was quite dangerous and was only for some early debugging but has been left in the code). Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D28374
*	Address panic with PRR due to missed initialization of recover_fs	Richard Scheffenegger	2021-01-20	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Summary: When using the base stack in conjunction with RACK, it appears that infrequently, ++tp->t_dupacks is instantly larger than tcprexmtthresh. This leaves the recover flightsize (sackhint.recover_fs) uninitialized, leading to a div/0 panic. Address this by properly initializing the variable just prior to first use, if it is not properly initialized. In order to prevent stale information from a prior recovery to negatively impact the PRR calculations in this event, also clear recover_fs once loss recovery is finished. Finally, improve the readability of the initialization of recover_fs when t_dupacks == tcprexmtthresh by adjusting the indentation and using the max(1, snd_nxt - snd_una) macro. Reviewers: rrs, kbowling, tuexen, jtl, #transport, gnn!, jmg, manu, #manpages Reviewed By: rrs, kbowling, #transport Subscribers: bdrewery, andrew, rpokala, ae, emaste, bz, bcran, #linuxkpi, imp, melifaro Differential Revision: https://reviews.freebsd.org/D28114
*	tcp: add sysctl to tolerate TCP segments missing timestamps	Michael Tuexen	2021-01-14	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When timestamp support has been negotiated, TCP segements received without a timestamp should be discarded. However, there are broken TCP implementations (for example, stacks used by Omniswitch 63xx and 64xx models), which send TCP segments without timestamps although they negotiated timestamp support. This patch adds a sysctl variable which tolerates such TCP segments and allows to interoperate with broken stacks. Reviewed by: jtl@, rscheff@ Differential Revision: https://reviews.freebsd.org/D28142 Sponsored by: Netflix, Inc. PR: 252449 MFC after: 1 week
*	Save the current TCP pacing rate in t_pacing_rate.	John Baldwin	2020-10-29	1	-0/+1
\| \| \| \| \| \| \| \| \|	Reviewed by: gallatin, gnn Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D26875 Notes: svn path=/head/; revision=367122
*	Extend netstat to display TCP stack and detailed congestion state (2)	Richard Scheffenegger	2020-10-09	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extend netstat to display TCP stack and detailed congestion state Adding the "-c" option used to show detailed per-connection congestion control state for TCP sessions. This is one summary patch, which adds the relevant variables into xtcpcb. As previous "spare" space is used, these changes are ABI compatible. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26518 Notes: svn path=/head/; revision=366567
*	TCP: send full initial window when timestamps are in use	Richard Scheffenegger	2020-09-25	1	-7/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The fastpath in tcp_output tries to send out full segments, and avoid sending partial segments by comparing against the static t_maxseg variable. That value does not consider tcp options like timestamps, while the initial window calculation is using the correct dynamic tcp_maxseg() function. Due to this interaction, the last, full size segment is considered too short and not sent out immediately. Reviewed by: tuexen MFC after: 2 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D26478 Notes: svn path=/head/; revision=366150
*	Export the name of the congestion control. This will be used by sockstat	Michael Tuexen	2020-09-13	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \|	and netstat. Reviewed by: rscheff MFC after: 1 week Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D26412 Notes: svn path=/head/; revision=365686
*	net: clean up empty lines in .c and .h files	Mateusz Guzik	2020-09-01	1	-3/+0
\| \| \| \|	Notes: svn path=/head/; revision=365071
*	The recent changes to move the ref count increment	Randall Stewart	2020-07-31	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	back from the end of the function created an issue. If one of the routines returns NULL during setup we have inp's with extra references (which is why the increment was at the end). Also the stack switch return code was being ignored and actually has meaning if the stack cannot take over it should return NULL. Fix both of these situation by being sure to test the return code and of course in any case of return NULL (there are 3) make sure we properly reduce the ref count. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D25903 Notes: svn path=/head/; revision=363725
*	Fix KASSERT during tcp_newtcpcb when low on memory	Richard Scheffenegger	2020-07-07	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	While testing with system default cc set to cubic, and running a memory exhaustion validation, FreeBSD panics for a missing inpcb reference / lock. Reviewed by: rgrimes (mentor), tuexen (mentor) Approved by: rgrimes (mentor), tuexen (mentor) MFC after: 3 weeks Sponsored by: NetApp, Inc. Differential Revision: https://reviews.freebsd.org/D25583 Notes: svn path=/head/; revision=362988
*	Use fib[46]_lookup() in mtu calculations.	Alexander V. Chernikov	2020-05-28	1	-12/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fib[46]_lookup_nh_ represents pre-epoch generation of fib api, providing less guarantees over pointer validness and requiring on-stack data copying. Conversion is straight-forwarded, as the only 2 differences are requirement of running in network epoch and the need to handle RTF_GATEWAY case in the caller code. Differential Revision: https://reviews.freebsd.org/D24974 Notes: svn path=/head/; revision=361576
*	This change does a small prepratory step in getting the	Randall Stewart	2020-04-27	1	-0/+29
\| \| \| \| \| \| \| \| \| \|	latest rack and bbr in from the NF repo. When those come in the OOB data handling will be fixed where Skyzaller crashes. Differential Revision: https://reviews.freebsd.org/D24575 Notes: svn path=/head/; revision=360385
*	Convert route caching to nexthop caching.	Alexander V. Chernikov	2020-04-25	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This change is build on top of nexthop objects introduced in r359823. Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory. Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms. Route caching was (re)introduced to minimise (slow) routing lookups, allowing for notably better performance for large TCP senders. Caching works by acquiring rtentry reference, which is protected by per-rtentry mutex. If the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn. Nexthops have the same reference counting interface, backed by refcount(9). This change merely replaces rtentry with the actual forwarding nextop as a cached object, which is mostly mechanical. Other moving parts like cache cleanup on rtable change remains the same. Differential Revision: https://reviews.freebsd.org/D24340 Notes: svn path=/head/; revision=360292
*	Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)	Pawel Biernacki	2020-02-26	1	-24/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are still not MPSAFE (or already are but aren’t properly marked). Use it in preparation for a general review of all nodes. This is non-functional change that adds annotations to SYSCTL_NODE and SYSCTL_PROC nodes using one of the soon-to-be-required flags. Mark all obvious cases as MPSAFE. All entries that haven't been marked as MPSAFE before are by default marked as NEEDGIANT Approved by: kib (mentor, blanket) Commented by: kib, gallatin, melifaro Differential Revision: https://reviews.freebsd.org/D23718 Notes: svn path=/head/; revision=358333
*	White space cleanup -- remove trailing tab's or spaces	Randall Stewart	2020-02-12	1	-23/+23
\| \| \| \| \| \| \| \| \|	from any line. Sponsored by: Netflix Inc. Notes: svn path=/head/; revision=357818
*	Add documenting NET_EPOCH_ASSERT() to tcp_drop().	Gleb Smirnoff	2020-01-22	1	-0/+1
\| \| \| \|	Notes: svn path=/head/; revision=356970
*	Add some documenting NET_EPOCH_ASSERTs.	Gleb Smirnoff	2020-01-22	1	-0/+1
\| \| \| \|	Notes: svn path=/head/; revision=356969
*	Fix yet another regression from r354484. Error code from cr_cansee()	Gleb Smirnoff	2020-01-13	1	-4/+6
\| \| \| \| \| \| \| \| \|	aliases with hard error from other operations. Reported by: flo Notes: svn path=/head/; revision=356702
*	vnet: virtualise more network stack sysctls.	Bjoern A. Zeeb	2020-01-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Virtualise tcp_always_keepalive, TCP and UDP log_in_vain. All three are set in the netoptions startup script, which we would love to run for VNETs as well [1]. While virtualising the log_in_vain sysctls seems pointles at first for as long as the kernel message buffer is not virtualised, it at least allows an administrator to debug the base system or an individual jail if needed without turning the logging on for all jails running on a system. PR: 243193 [1] MFC after: 2 weeks Notes: svn path=/head/; revision=356527
*	This commit is a bit of a re-arrange of deck chairs. It	Randall Stewart	2019-12-17	1	-0/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	gets both rack and bbr ready for the completion of the STATs framework in FreeBSD. For now if you don't have both NF_stats and stats on it disables them. As soon as the rest of the stats framework lands we can remove that restriction and then just uses stats when defined. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D22479 Notes: svn path=/head/; revision=355859
*	Fix regression from r354484. Don't leak pcb lock if cr_canseeinpcb()	Gleb Smirnoff	2019-12-04	1	-2/+4
\| \| \| \| \| \| \| \| \|	returns non-zero. PR: 242415 Notes: svn path=/head/; revision=355405
*	Make use of the stats(3) framework in the TCP stack.	Edward Tomasz Napierala	2019-12-02	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This makes it possible to retrieve per-connection statistical information such as the receive window size, RTT, or goodput, using a newly added TCP_STATS getsockopt(3) option, and extract them using the stats_voistat_fetch(3) API. See the net/tcprtt port for an example consumer of this API. Compared to the existing TCP_INFO system, the main differences are that this mechanism is easy to extend without breaking ABI, and provides statistical information instead of raw "snapshots" of values at a given point in time. stats(3) is more generic and can be used in both userland and the kernel. Reviewed by: thj Tested by: thj Obtained from: Netflix Relnotes: yes Sponsored by: Klara Inc, Netflix Differential Revision: https://reviews.freebsd.org/D20655 Notes: svn path=/head/; revision=355304
*	Now that there is no R/W lock on PCB list the pcblist sysctls	Gleb Smirnoff	2019-11-07	1	-64/+30
\| \| \| \| \| \| \| \| \| \|	handlers can be greatly simplified. All the previous double cycling and complex locking was added to avoid these functions holding global PCB locks for extended period of time, preventing addition of new entries. Notes: svn path=/head/; revision=354484
*	Since r353292 on input path we are always in network epoch, when	Gleb Smirnoff	2019-11-07	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	we lookup PCBs. Thus, do not enter epoch recursively in in_pcblookup_hash() and in6_pcblookup_hash(). Same applies to tcp_ctlinput() and tcp6_ctlinput(). This leaves several sysctl(9) handlers that return PCB credentials unprotected. Add epoch enter/exit to all of them. Differential Revision: https://reviews.freebsd.org/D22197 Notes: svn path=/head/; revision=354477
*	Mechanically convert INP_INFO_RLOCK() to NET_EPOCH_ENTER().	Gleb Smirnoff	2019-11-07	1	-8/+8
\| \| \| \| \| \| \| \|	Remove few outdated comments and extraneous assertions. No functional change here. Notes: svn path=/head/; revision=354421
*	Replacing MD5 by SipHash improves the performance of the TCP time stamp	Michael Tuexen	2019-09-28	1	-17/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	initialisation, which is important when the host is dealing with a SYN flood. This affects the computation of the initial TCP sequence number for the client side. This has been discussed with secteam@. Reviewed by: gallatin@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D21616 Notes: svn path=/head/; revision=352843
*	This adds in the missing counter initialization which	Randall Stewart	2019-09-06	1	-0/+7
\| \| \| \| \| \| \| \| \|	I had forgotten to bring over.. opps. Differential Revision: https://reviews.freebsd.org/D21127 Notes: svn path=/head/; revision=351951
*	Add kernel-side support for in-kernel TLS.	John Baldwin	2019-08-27	1	-0/+118
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	KTLS adds support for in-kernel framing and encryption of Transport Layer Security (1.0-1.2) data on TCP sockets. KTLS only supports offload of TLS for transmitted data. Key negotation must still be performed in userland. Once completed, transmit session keys for a connection are provided to the kernel via a new TCP_TXTLS_ENABLE socket option. All subsequent data transmitted on the socket is placed into TLS frames and encrypted using the supplied keys. Any data written to a KTLS-enabled socket via write(2), aio_write(2), or sendfile(2) is assumed to be application data and is encoded in TLS frames with an application data type. Individual records can be sent with a custom type (e.g. handshake messages) via sendmsg(2) with a new control message (TLS_SET_RECORD_TYPE) specifying the record type. At present, rekeying is not supported though the in-kernel framework should support rekeying. KTLS makes use of the recently added unmapped mbufs to store TLS frames in the socket buffer. Each TLS frame is described by a single ext_pgs mbuf. The ext_pgs structure contains the header of the TLS record (and trailer for encrypted records) as well as references to the associated TLS session. KTLS supports two primary methods of encrypting TLS frames: software TLS and ifnet TLS. Software TLS marks mbufs holding socket data as not ready via M_NOTREADY similar to sendfile(2) when TLS framing information is added to an unmapped mbuf in ktls_frame(). ktls_enqueue() is then called to schedule TLS frames for encryption. In the case of sendfile_iodone() calls ktls_enqueue() instead of pru_ready() leaving the mbufs marked M_NOTREADY until encryption is completed. For other writes (vn_sendfile when pages are available, write(2), etc.), the PRUS_NOTREADY is set when invoking pru_send() along with invoking ktls_enqueue(). A pool of worker threads (the "KTLS" kernel process) encrypts TLS frames queued via ktls_enqueue(). Each TLS frame is temporarily mapped using the direct map and passed to a software encryption backend to perform the actual encryption. (Note: The use of PHYS_TO_DMAP could be replaced with sf_bufs if someone wished to make this work on architectures without a direct map.) KTLS supports pluggable software encryption backends. Internally, Netflix uses proprietary pure-software backends. This commit includes a simple backend in a new ktls_ocf.ko module that uses the kernel's OpenCrypto framework to provide AES-GCM encryption of TLS frames. As a result, software TLS is now a bit of a misnomer as it can make use of hardware crypto accelerators. Once software encryption has finished, the TLS frame mbufs are marked ready via pru_ready(). At this point, the encrypted data appears as regular payload to the TCP stack stored in unmapped mbufs. ifnet TLS permits a NIC to offload the TLS encryption and TCP segmentation. In this mode, a new send tag type (IF_SND_TAG_TYPE_TLS) is allocated on the interface a socket is routed over and associated with a TLS session. TLS records for a TLS session using ifnet TLS are not marked M_NOTREADY but are passed down the stack unencrypted. The ip_output_send() and ip6_output_send() helper functions that apply send tags to outbound IP packets verify that the send tag of the TLS record matches the outbound interface. If so, the packet is tagged with the TLS send tag and sent to the interface. The NIC device driver must recognize packets with the TLS send tag and schedule them for TLS encryption and TCP segmentation. If the the outbound interface does not match the interface in the TLS send tag, the packet is dropped. In addition, a task is scheduled to refresh the TLS send tag for the TLS session. If a new TLS send tag cannot be allocated, the connection is dropped. If a new TLS send tag is allocated, however, subsequent packets will be tagged with the correct TLS send tag. (This latter case has been tested by configuring both ports of a Chelsio T6 in a lagg and failing over from one port to another. As the connections migrated to the new port, new TLS send tags were allocated for the new port and connections resumed without being dropped.) ifnet TLS can be enabled and disabled on supported network interfaces via new '[-]txtls[46]' options to ifconfig(8). ifnet TLS is supported across both vlan devices and lagg interfaces using failover, lacp with flowid enabled, or lacp with flowid enabled. Applications may request the current KTLS mode of a connection via a new TCP_TXTLS_MODE socket option. They can also use this socket option to toggle between software and ifnet TLS modes. In addition, a testing tool is available in tools/tools/switch_tls. This is modeled on tcpdrop and uses similar syntax. However, instead of dropping connections, -s is used to force KTLS connections to switch to software TLS and -i is used to switch to ifnet TLS. Various sysctls and counters are available under the kern.ipc.tls sysctl node. The kern.ipc.tls.enable node must be set to true to enable KTLS (it is off by default). The use of unmapped mbufs must also be enabled via kern.ipc.mb_use_ext_pgs to enable KTLS. KTLS is enabled via the KERN_TLS kernel option. This patch is the culmination of years of work by several folks including Scott Long and Randall Stewart for the original design and implementation; Drew Gallatin for several optimizations including the use of ext_pgs mbufs, the M_NOTREADY mechanism for TLS records awaiting software encryption, and pluggable software crypto backends; and John Baldwin for modifications to support hardware TLS offload. Reviewed by: gallatin, hselasky, rrs Obtained from: Netflix Sponsored by: Netflix, Chelsio Communications Differential Revision: https://reviews.freebsd.org/D21277 Notes: svn path=/head/; revision=351522
*	Add a sysctl variable ts_offset_per_conn to change the computation	Michael Tuexen	2019-07-23	1	-1/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	of the TCP TS offset from taking the IP addresses and the TCP port numbers into account to a version just taking only the IP addresses into account. This works around broken middleboxes or endpoints. The default is to keep the behaviour, which is also the behaviour recommended in RFC 7323. Reported by: devgs@ukr.net Reviewed by: rrs@ MFC after: 2 weeks Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D20980 Notes: svn path=/head/; revision=350265
*	Reject attempts to register a TCP stack being unloaded.	John Baldwin	2019-06-27	1	-1/+5
\| \| \| \| \| \| \| \| \| \|	Reviewed by: gallatin MFC after: 2 weeks Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D20617 Notes: svn path=/head/; revision=349474
*	Add sysctl variable net.inet.tcp.rexmit_initial for setting RTO.Initial	Michael Tuexen	2019-03-23	1	-2/+5
\| \| \| \| \| \| \| \| \| \| \|	used by TCP. Reviewed by: rrs@, 0mp@ Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D19355 Notes: svn path=/head/; revision=345458
*	Various cleanups to the management of multiple TCP stacks.	John Baldwin	2019-02-27	1	-29/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Use strlcpy() with sizeof() instead of strncpy(). - Simplify initialization of TCP functions structures. init_tcp_functions() was already called before the first call to register a stack. Just inline the work in the SYSINIT and remove the racy helper variable. Instead, KASSERT that the rw lock is initialized when registering a stack. - Protect the default stack via a direct pointer comparison. The default stack uses the name "freebsd" instead of "default" so this protection wasn't working for the default stack anyway. Reviewed by: rrs Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19152 Notes: svn path=/head/; revision=344632
*	Plug some networking sysctl leaks.	Mark Johnston	2018-11-22	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Various network protocol sysctl handlers were not zero-filling their output buffers and thus would export uninitialized stack memory to userland. Fix a number of such handlers. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: tuexen MFC after: 3 days Security: kernel memory disclosure Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18301 Notes: svn path=/head/; revision=340783
*	Use arc4rand() instead of read_random() in the SCTP and TCP code.	Michael Tuexen	2018-08-23	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	This was suggested by jmg@. Reviewed by: delphij@, jmg@, jtl@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16860 Notes: svn path=/head/; revision=338273
*	Don't use the explicit number 32 for the length of the secrets,	Michael Tuexen	2018-08-23	1	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \|	use sizeof() or explicit #definesi instead. No functional change. This was suggested by jmg@. MFC after: 1 month XMFC with: r338053 Sponsored by: Netflix, Inc. Notes: svn path=/head/; revision=338241
*	Enabling the IPPROTO_IPV6 level socket option IPV6_USE_MIN_MTU on a TCP	Michael Tuexen	2018-08-21	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	socket resulted in sending fragmented IPV6 packets. This is fixes by reducing the MSS to the appropriate value. In addtion, if the socket option is set before the handshake happens, announce this MSS to the peer. This is not stricly required, but done since TCP is conservative. PR: 173444 Reviewed by: bz@, rrs@ MFC after: 1 month Sponsored by: Netflix, Inc. Differential Revision: https://reviews.freebsd.org/D16796 Notes: svn path=/head/; revision=338138
*	This change represents a substantial restructure of the way we	Randall Stewart	2018-08-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	reassembly inbound tcp segments. The old algorithm just blindly dropped in segments without coalescing. This meant that every segment could take up greater and greater room on the linked list of segments. This of course is now subject to a tighter limit (100) of segments which in a high BDP situation will cause us to be a lot more in-efficent as we drop segments beyond 100 entries that we receive. What this restructure does is cause the reassembly buffer to coalesce segments putting an emphasis on the two common cases (which avoid walking the list of segments) i.e. where we add to the back of the queue of segments and where we add to the front. We also have the reassembly buffer supporting a couple of debug options (black box logging as well as counters for code coverage). These are compiled out by default but can be added by uncommenting the defines. Sponsored by: Netflix Inc. Differential Revision: https://reviews.freebsd.org/D16626 Notes: svn path=/head/; revision=338102