aboutsummaryrefslogtreecommitdiff
path: root/sys/netinet/ip_divert.c
Commit message (Collapse)AuthorAgeFilesLines
* Switch the entire IPv4 stack to keep the IP packet headerGleb Smirnoff2012-10-221-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | in network byte order. Any host byte order processing is done in local variables and host byte order values are never[1] written to a packet. After this change a packet processed by the stack isn't modified at all[2] except for TTL. After this change a network stack hacker doesn't need to scratch his head trying to figure out what is the byte order at the given place in the stack. [1] One exception still remains. The raw sockets convert host byte order before pass a packet to an application. Probably this would remain for ages for compatibility. [2] The ip_input() still subtructs header len from ip->ip_len, but this is planned to be fixed soon. Reviewed by: luigi, Maxim Dounin <mdounin mdounin.ru> Tested by: ray, Olivier Cochard-Labbe <olivier cochard.me> Notes: svn path=/head/; revision=241913
* We don't need to convert ip6_len to host byte order beforeGleb Smirnoff2012-10-151-2/+0
| | | | | | | | | | | ip6_output(), the IPv6 stack is working in net byte order. The reason this code worked before is that ip6_output() doesn't look at ip6_plen at all and recalculates it based on mbuf length. Notes: svn path=/head/; revision=241575
* Revert previous commit...Kevin Lo2012-10-101-1/+1
| | | | | | | Pointyhat to: kevlo (myself) Notes: svn path=/head/; revision=241394
* Prefer NULL over 0 for pointersKevin Lo2012-10-091-1/+1
| | | | Notes: svn path=/head/; revision=241370
* After r241245 it appeared that in_delayed_cksum(), which still expectsGleb Smirnoff2012-10-081-2/+0
| | | | | | | | | | | | | | | | | host byte order, was sometimes called with net byte order. Since we are moving towards net byte order throughout the stack, the function was converted to expect net byte order, and its consumers fixed appropriately: - ip_output(), ipfilter(4) not changed, since already call in_delayed_cksum() with header in net byte order. - divert(4), ng_nat(4), ipfw_nat(4) now don't need to swap byte order there and back. - mrouting code and IPv6 ipsec now need to switch byte order there and back, but I hope, this is temporary solution. - In ipsec(4) shifted switch to net byte order prior to in_delayed_cksum(). - pf_route() catches up on r241245 changes to ip_output(). Notes: svn path=/head/; revision=241344
* No reason to play with IP header before calling sctp_delayed_cksum()Gleb Smirnoff2012-10-081-2/+0
| | | | | | | with offset beyond the IP header. Notes: svn path=/head/; revision=241342
* Make #error messages string-literals and remove punctuation.Bjoern A. Zeeb2012-01-221-1/+1
| | | | | | | | | Reported by: bde (for ip_divert) Reviewed by: bde MFC after: 3 days Notes: svn path=/head/; revision=230452
* Fix ip_divert handling of inet and inet6 and module building some more.Bjoern A. Zeeb2012-01-221-3/+1
| | | | | | | | | | | Properly sort the "carp" case in modules/Makefile after it was renamed. Reported by: bde (most) Reviewed by: bde MFC after: 3 days Notes: svn path=/head/; revision=230443
* Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs.Ed Schouten2011-11-071-1/+2
| | | | | | | | | The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static. Notes: svn path=/head/; revision=227309
* Add missing break; in r223593.Gleb Smirnoff2011-08-011-0/+1
| | | | | | | | | Submitted by: sem Pointy hat to: glebius Approved by: re (kib) Notes: svn path=/head/; revision=224575
* Add possibility to pass IPv6 packets to a divert(4) socket.Gleb Smirnoff2011-06-271-54/+103
| | | | | | | Submitted by: sem Notes: svn path=/head/; revision=223593
* Implement a CPU-affine TCP and UDP connection lookup data structure,Robert Watson2011-06-061-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | struct inpcbgroup. pcbgroups, or "connection groups", supplement the existing inpcbinfo connection hash table, which when pcbgroups are enabled, might now be thought of more usefully as a per-protocol 4-tuple reservation table. Connections are assigned to connection groups base on a hash of their 4-tuple; wildcard sockets require special handling, and are members of all connection groups. During a connection lookup, a per-connection group lock is employed rather than the global pcbinfo lock. By aligning connection groups with input path processing, connection groups take on an effective CPU affinity, especially when aligned with RSS work placement (see a forthcoming commit for details). This eliminates cache line migration associated with global, protocol-layer data structures in steady state TCP and UDP processing (with the exception of protocol-layer statistics; further commit to follow). Elements of this approach were inspired by Willman, Rixner, and Cox's 2006 USENIX paper, "An Evaluation of Network Stack Parallelization Strategies in Modern Operating Systems". However, there are also significant differences: we maintain the inpcb lock, rather than using the connection group lock for per-connection state. Likewise, the focus of this implementation is alignment with NIC packet distribution strategies such as RSS, rather than pure software strategies. Despite that focus, software distribution is supported through the parallel netisr implementation, and works well in configurations where the number of hardware threads is greater than the number of NIC input queues, such as in the RMI XLR threaded MIPS architecture. Another important difference is the continued maintenance of existing hash tables as "reservation tables" -- these are useful both to distinguish the resource allocation aspect of protocol name management and the more common-case lookup aspect. In configurations where connection tables are aligned with hardware hashes, it is desirable to use the traditional lookup tables for loopback or encapsulated traffic rather than take the expense of hardware hashes that are hard to implement efficiently in software (such as RSS Toeplitz). Connection group support is enabled by compiling "options PCBGROUP" into your kernel configuration; for the time being, this is an experimental feature, and hence is not enabled by default. Subject to the limited MFCability of change dependencies in inpcb, and its change to the inpcbinfo init function signature, this change in principle could be merged to FreeBSD 8.x. Reviewed by: bz Sponsored by: Juniper Networks, Inc. Notes: svn path=/head/; revision=222748
* IP divert sockets use their inpcbinfo for port reservation, although notRobert Watson2011-06-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | for lookup. I missed its call to in_pcbbind() when preparing previous patches, which would lead to a lock assertion failure (although problem not an actual race condition due to global pcbinfo locks providing required synchronisation -- in this particular case only). This change adds the missing locking of the pcbhash lock. (Existing comments in the ipdivert code question the need for using the global hash to manage the namespace, as really it's a simple port namespace and not an address/port namespace. Also, although in_pcbbind is used to manage reservations, the hash tables aren't used for lookup. It might be a good idea to make them use hashed lookup, or to use a different reservation scheme.) Reviewed by: bz Reported by: Kristof Provost <kristof at sigsegv.be> Sponsored by: Juniper Networks Notes: svn path=/head/; revision=222690
* Decompose the current single inpcbinfo lock into two locks:Robert Watson2011-05-301-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - The existing ipi_lock continues to protect the global inpcb list and inpcb counter. This lock is now relegated to a small number of allocation and free operations, and occasional operations that walk all connections (including, awkwardly, certain UDP multicast receive operations -- something to revisit). - A new ipi_hash_lock protects the two inpcbinfo hash tables for looking up connections and bound sockets, manipulated using new INP_HASH_*() macros. This lock, combined with inpcb locks, protects the 4-tuple address space. Unlike the current ipi_lock, ipi_hash_lock follows the individual inpcb connection locks, so may be acquired while manipulating a connection on which a lock is already held, avoiding the need to acquire the inpcbinfo lock preemptively when a binding change might later be required. As a result, however, lookup operations necessarily go through a reference acquire while holding the lookup lock, later acquiring an inpcb lock -- if required. A new function in_pcblookup() looks up connections, and accepts flags indicating how to return the inpcb. Due to lock order changes, callers no longer need acquire locks before performing a lookup: the lookup routine will acquire the ipi_hash_lock as needed. In the future, it will also be able to use alternative lookup and locking strategies transparently to callers, such as pcbgroup lookup. New lookup flags are, supplementing the existing INPLOOKUP_WILDCARD flag: INPLOOKUP_RLOCKPCB - Acquire a read lock on the returned inpcb INPLOOKUP_WLOCKPCB - Acquire a write lock on the returned inpcb Callers must pass exactly one of these flags (for the time being). Some notes: - All protocols are updated to work within the new regime; especially, TCP, UDPv4, and UDPv6. pcbinfo ipi_lock acquisitions are largely eliminated, and global hash lock hold times are dramatically reduced compared to previous locking. - The TCP syncache still relies on the pcbinfo lock, something that we may want to revisit. - Support for reverting to the FreeBSD 7.x locking strategy in TCP input is no longer available -- hash lookup locks are now held only very briefly during inpcb lookup, rather than for potentially extended periods. However, the pcbinfo ipi_lock will still be acquired if a connection state might change such that a connection is added or removed. - Raw IP sockets continue to use the pcbinfo ipi_lock for protection, due to maintaining their own hash tables. - The interface in6_pcblookup_hash_locked() is maintained, which allows callers to acquire hash locks and perform one or more lookups atomically with 4-tuple allocation: this is required only for TCPv6, as there is no in6_pcbconnect_setup(), which there should be. - UDPv6 locking remains significantly more conservative than UDPv4 locking, which relates to source address selection. This needs attention, as it likely significantly reduces parallelism in this code for multithreaded socket use (such as in BIND). - In the UDPv4 and UDPv6 multicast cases, we need to revisit locking somewhat, as they relied on ipi_lock to stablise 4-tuple matches, which is no longer sufficient. A second check once the inpcb lock is held should do the trick, keeping the general case from requiring the inpcb lock for every inpcb visited. - This work reminds us that we need to revisit locking of the v4/v6 flags, which may be accessed lock-free both before and after this change. - Right now, a single lock name is used for the pcbhash lock -- this is undesirable, and probably another argument is required to take care of this (or a char array name field in the pcbinfo?). This is not an MFC candidate for 8.x due to its impact on lookup and locking semantics. It's possible some of these issues could be worked around with compatibility wrappers, if necessary. Reviewed by: bz Sponsored by: Juniper Networks, Inc. Notes: svn path=/head/; revision=222488
* Specify a CTLTYPE_FOO so that a future sysctl(8) change does not needMatthew D Fleming2011-01-181-2/+2
| | | | | | | | | to rely on the format string. For SYSCTL_PROC instances that I noticed a discrepancy between the CTLTYPE and the format specifier, fix the CTLTYPE. Notes: svn path=/head/; revision=217554
* After some off-list discussion, revert a number of changes to theDimitry Andric2010-11-221-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DPCPU_DEFINE and VNET_DEFINE macros, as these cause problems for various people working on the affected files. A better long-term solution is still being considered. This reversal may give some modules empty set_pcpu or set_vnet sections, but these are harmless. Changes reverted: ------------------------------------------------------------------------ r215318 | dim | 2010-11-14 21:40:55 +0100 (Sun, 14 Nov 2010) | 4 lines Instead of unconditionally emitting .globl's for the __start_set_xxx and __stop_set_xxx symbols, only emit them when the set_vnet or set_pcpu sections are actually defined. ------------------------------------------------------------------------ r215317 | dim | 2010-11-14 21:38:11 +0100 (Sun, 14 Nov 2010) | 3 lines Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughout the tree. ------------------------------------------------------------------------ r215316 | dim | 2010-11-14 21:23:02 +0100 (Sun, 14 Nov 2010) | 2 lines Add macros to define static instances of VNET_DEFINE and DPCPU_DEFINE. Notes: svn path=/head/; revision=215701
* Apply the STATIC_VNET_DEFINE and STATIC_DPCPU_DEFINE macros throughoutDimitry Andric2010-11-141-2/+2
| | | | | | | the tree. Notes: svn path=/head/; revision=215317
* Ensure a minimum "slop" of 10 extra pcb structures when providing aJohn Baldwin2010-08-171-2/+2
| | | | | | | | | | | | memory size estimate to userland for pcb list sysctls. The previous behavior of a "slop" of n/8 does not work well for small values of n (e.g. no slop at all if you have less than 8 open UDP connections). Reviewed by: bz MFC after: 1 week Notes: svn path=/head/; revision=211433
* Add pcb reference counting to the pcblist sysctl handler functionsBjoern A. Zeeb2010-03-171-3/+14
| | | | | | | | | | | to ensure type stability while caching the pcb pointers for the copyout. Reviewed by: rwatson MFC after: 7 days Notes: svn path=/head/; revision=205251
* Abstract out initialization of most aspects of struct inpcbinfo fromRobert Watson2010-03-141-21/+6
| | | | | | | | | | | | | | | their calling contexts in {IP divert, raw IP sockets, TCP, UDP} and create new helper functions: in_pcbinfo_init() and in_pcbinfo_destroy() to do this work in a central spot. As inpcbinfo becomes more complex due to ongoing work to add connection groups, this will reduce code duplication. MFC after: 1 month Reviewed by: bz Sponsored by: Juniper Networks Notes: svn path=/head/; revision=205157
* The proper fix for the delayed SCTP checksum is toRandall Stewart2010-03-121-1/+1
| | | | | | | | | | | | | have the delayed function take an argument as to the offset to the SCTP header. This allows it to work for V4 and V6. This of course means changing all callers of the function to either pass the header len, if they have it, or create it (ip_hl << 2 or sizeof(ip6_hdr)). PR: 144529 MFC after: 2 weeks Notes: svn path=/head/; revision=205104
* Remove unnecessary locking of divcbinfo lock from div_output(): this has notRobert Watson2010-03-061-3/+0
| | | | | | | | | | | | | been required since FreeBSD 7.0 when the so_pcb pointer leading to inp was guaranteed to be stable when a valid socket reference is held (as it is in the output path). MFC after: 1 week Reviewed by: bz Sponsored by: Juniper Networks Notes: svn path=/head/; revision=204810
* Following up on a request from Ermal Luci to makeLuigi Rizzo2010-01-071-24/+20
| | | | | | | | | | | | | | | | | | | | ip_divert work as a client of pf(4), make ip_divert not depend on ipfw. This is achieved by moving to ip_var.h the struct ipfw_rule_ref (which is part of the mtag for all reinjected packets) and other declarations of global variables, and moving to raw_ip.c global variables for filter and divert hooks. Note that names and locations could be made more generic (ipfw_rule_ref is really a generic reference robust to reconfigurations; the packet filter is not necessarily ipfw; filters and their clients are not necessarily limited to ipv4), but _right now_ most of this stuff works on ipfw and ipv4, so i don't feel like doing a gratuitous renaming, at least for the time being. Notes: svn path=/head/; revision=201735
* Various cleanup done in ipfw3-head branch including:Luigi Rizzo2010-01-041-28/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | - use a uniform mtag format for all packets that exit and re-enter the firewall in the middle of a rulechain. On reentry, all tags containing reinject info are renamed to MTAG_IPFW_RULE so the processing is simpler. - make ipfw and dummynet use ip_len and ip_off in network format everywhere. Conversion is done only once instead of tracking the format in every place. - use a macro FREE_PKT to dispose of mbufs. This eases portability. On passing i also removed a few typos, staticise or localise variables, remove useless declarations and other minor things. Overall the code shrinks a bit and is hopefully more readable. I have tested functionality for all but ng_ipfw and if_bridge/if_ethersubr. For ng_ipfw i am actually waiting for feedback from glebius@ because we might have some small changes to make. For if_bridge and if_ethersubr feedback would be welcome (there are still some redundant parts in these two modules that I would like to remove, but first i need to check functionality). Notes: svn path=/head/; revision=201527
* Start splitting ip_fw2.c and ip_fw.h into smaller components.Luigi Rizzo2009-12-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At this time we pull out from ip_fw2.c the logging functions, and support for dynamic rules, and move kernel-only stuff into netinet/ipfw/ip_fw_private.h No ABI change involved in this commit, unless I made some mistake. ip_fw.h has changed, though not in the userland-visible part. Files touched by this commit: conf/files now references the two new source files netinet/ip_fw.h remove kernel-only definitions gone into netinet/ipfw/ip_fw_private.h. netinet/ipfw/ip_fw_private.h new file with kernel-specific ipfw definitions netinet/ipfw/ip_fw_log.c ipfw_log and related functions netinet/ipfw/ip_fw_dynamic.c code related to dynamic rules netinet/ipfw/ip_fw2.c removed the pieces that goes in the new files netinet/ipfw/ip_fw_nat.c minor rearrangement to remove LOOKUP_NAT from the main headers. This require a new function pointer. A bunch of other kernel files that included netinet/ip_fw.h now require netinet/ipfw/ip_fw_private.h as well. Not 100% sure i caught all of them. MFC after: 1 month Notes: svn path=/head/; revision=200580
* Introduce a div_destroy() function which takes over per-vnet cleanup tasksMarko Zec2009-08-241-5/+30
| | | | | | | | | | | | | | | | | | | | | | | | | from the existing modevent / MOD_UNLOAD handler, and register div_destroy() in protosw as per-vnet .pr_destroy() handler for options VIMAGE builds. In nooptions VIMAGE builds, div_destroy() will be invoked from the modevent handler, resulting in effectively identical operation as it was prior this change. div_destroy() also tears down hashtables used by ipdivert, which were previously left behind on ipdivert kldunloads. For options VIMAGE builds only, temporarily disable kldunloading of ipdivert, because without introducing additional locking logic it is impossible to atomically check whether all ipdivert instances in all vnets are idle, and proceed with cleanup without opening a race window for a vnet to open an ipdivert socket while ipdivert tear-down is in progress. While here, staticize div_init(), because it is not used outside of ip_divert.c. In cooperation with: julian Approved by: re (rwatson), julian (mentor) MFC after: 3 days Notes: svn path=/head/; revision=196502
* Many network stack subsystems use a single global data structure to holdRobert Watson2009-08-021-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | all pertinent statatistics for the subsystem. These structures are sometimes "borrowed" by kernel modules that require a place to store statistics for similar events. Add KPI accessor functions for statistics structures referenced by kernel modules so that they no longer encode certain specifics of how the data structures are named and stored. This change is intended to make it easier to move to per-CPU network stats following 8.0-RELEASE. The following modules are affected by this change: if_bridge if_cxgb if_gif ip_mroute ipdivert pf In practice, most of these statistics consumers should, in fact, maintain their own statistics data structures rather than borrowing structures from the base network stack. However, that change is too agressive for this point in the release cycle. Reviewed by: bz Approved by: re (kib) Notes: svn path=/head/; revision=196039
* Merge the remainder of kern_vimage.c and vimage.h into vnet.c andRobert Watson2009-08-011-1/+1
| | | | | | | | | | | | | vnet.h, we now use jails (rather than vimages) as the abstraction for virtualization management, and what remained was specific to virtual network stacks. Minor cleanups are done in the process, and comments updated to reflect these changes. Reviewed by: bz Approved by: re (vimage blanket) Notes: svn path=/head/; revision=196019
* Remove unused VNET_SET() and related macros; only VNET_GET() isRobert Watson2009-07-161-2/+2
| | | | | | | | | | | | ever actually used. Rename VNET_GET() to VNET() to shorten variable references. Discussed with: bz, julian Reviewed by: bz Approved by: re (kensmith, kib) Notes: svn path=/head/; revision=195727
* Build on Jeff Roberson's linker-set based dynamic per-CPU allocatorRobert Watson2009-07-141-16/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (DPCPU), as suggested by Peter Wemm, and implement a new per-virtual network stack memory allocator. Modify vnet to use the allocator instead of monolithic global container structures (vinet, ...). This change solves many binary compatibility problems associated with VIMAGE, and restores ELF symbols for virtualized global variables. Each virtualized global variable exists as a "reference copy", and also once per virtual network stack. Virtualized global variables are tagged at compile-time, placing the in a special linker set, which is loaded into a contiguous region of kernel memory. Virtualized global variables in the base kernel are linked as normal, but those in modules are copied and relocated to a reserved portion of the kernel's vnet region with the help of a the kernel linker. Virtualized global variables exist in per-vnet memory set up when the network stack instance is created, and are initialized statically from the reference copy. Run-time access occurs via an accessor macro, which converts from the current vnet and requested symbol to a per-vnet address. When "options VIMAGE" is not compiled into the kernel, normal global ELF symbols will be used instead and indirection is avoided. This change restores static initialization for network stack global variables, restores support for non-global symbols and types, eliminates the need for many subsystem constructors, eliminates large per-subsystem structures that caused many binary compatibility issues both for monitoring applications (netstat) and kernel modules, removes the per-function INIT_VNET_*() macros throughout the stack, eliminates the need for vnet_symmap ksym(2) munging, and eliminates duplicate definitions of virtualized globals under VIMAGE_GLOBALS. Bump __FreeBSD_version and update UPDATING. Portions submitted by: bz Reviewed by: bz, zec Discussed with: gnn, jamie, jeff, jhb, julian, sam Suggested by: peter Approved by: re (kensmith) Notes: svn path=/head/; revision=195699
* Update various IPFW-related modules to use if_addr_rlock()/Robert Watson2009-06-261-2/+2
| | | | | | | | | if_addr_runlock() rather than IF_ADDR_LOCK()/IF_ADDR_UNLOCK(). MFC after: 6 weeks Notes: svn path=/head/; revision=195023
* Modify most routines returning 'struct ifaddr *' to return referencesRobert Watson2009-06-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rather than pointers, requiring callers to properly dispose of those references. The following routines now return references: ifaddr_byindex ifa_ifwithaddr ifa_ifwithbroadaddr ifa_ifwithdstaddr ifa_ifwithnet ifaof_ifpforaddr ifa_ifwithroute ifa_ifwithroute_fib rt_getifa rt_getifa_fib IFP_TO_IA ip_rtaddr in6_ifawithifp in6ifa_ifpforlinklocal in6ifa_ifpwithaddr in6_ifadd carp_iamatch6 ip6_getdstifaddr Remove unused macro which didn't have required referencing: IFP_TO_IA6 This closes many small races in which changes to interface or address lists while an ifaddr was in use could lead to use of freed memory (etc). In a few cases, add missing if_addr_list locking required to safely acquire references. Because of a lack of deep copying support, we accept a race in which an in6_ifaddr pointed to by mbuf tags and extracted with ip6_getdstifaddr() doesn't hold a reference while in transmit. Once we have mbuf tag deep copy support, this can be fixed. Reviewed by: bz Obtained from: Apple, Inc. (portions) MFC after: 6 weeks (portions) Notes: svn path=/head/; revision=194760
* Move "options MAC" from opt_mac.h to opt_global.h, as it's now in GENERICRobert Watson2009-06-051-1/+0
| | | | | | | | | | | and used in a large number of files, but also because an increasing number of incorrect uses of MAC calls were sneaking in due to copy-and-paste of MAC-aware code without the associated opt_mac.h include. Discussed with: pjd Notes: svn path=/head/; revision=193511
* Add internal 'mac_policy_count' counter to the MAC Framework, which is aRobert Watson2009-06-021-2/+0
| | | | | | | | | | | | | | | | | | | | | count of the number of registered policies. Rather than unconditionally locking sockets before passing them into MAC, lock them in the MAC entry points only if mac_policy_count is non-zero. This avoids locking overhead for a number of socket system calls when no policies are registered, eliminating measurable overhead for the MAC Framework for the socket subsystem when there are no active policies. Possibly socket locks should be acquired by policies if they are required for socket labels, which would further avoid locking overhead when there are policies but they don't require labeling of sockets, or possibly don't even implement socket controls. Obtained from: TrustedBSD Project Notes: svn path=/head/; revision=193332
* Reimplement the netisr framework in order to support parallel netisrRobert Watson2009-06-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | threads: - Support up to one netisr thread per CPU, each processings its own workstream, or set of per-protocol queues. Threads may be bound to specific CPUs, or allowed to migrate, based on a global policy. In the future it would be desirable to support topology-centric policies, such as "one netisr per package". - Allow each protocol to advertise an ordering policy, which can currently be one of: NETISR_POLICY_SOURCE: packets must maintain ordering with respect to an implicit or explicit source (such as an interface or socket). NETISR_POLICY_FLOW: make use of mbuf flow identifiers to place work, as well as allowing protocols to provide a flow generation function for mbufs without flow identifers (m2flow). Falls back on NETISR_POLICY_SOURCE if now flow ID is available. NETISR_POLICY_CPU: allow protocols to inspect and assign a CPU for each packet handled by netisr (m2cpuid). - Provide utility functions for querying the number of workstreams being used, as well as a mapping function from workstream to CPU ID, which protocols may use in work placement decisions. - Add explicit interfaces to get and set per-protocol queue limits, and get and clear drop counters, which query data or apply changes across all workstreams. - Add a more extensible netisr registration interface, in which protocols declare 'struct netisr_handler' structures for each registered NETISR_ type. These include name, handler function, optional mbuf to flow ID function, optional mbuf to CPU ID function, queue limit, and ordering policy. Padding is present to allow these to be expanded in the future. If no queue limit is declared, then a default is used. - Queue limits are now per-workstream, and raised from the previous IFQ_MAXLEN default of 50 to 256. - All protocols are updated to use the new registration interface, and with the exception of netnatm, default queue limits. Most protocols register as NETISR_POLICY_SOURCE, except IPv4 and IPv6, which use NETISR_POLICY_FLOW, and will therefore take advantage of driver- generated flow IDs if present. - Formalize a non-packet based interface between interface polling and the netisr, rather than having polling pretend to be two protocols. Provide two explicit hooks in the netisr worker for start and end events for runs: netisr_poll() and netisr_pollmore(), as well as a function, netisr_sched_poll(), to allow the polling code to schedule netisr execution. DEVICE_POLLING still embeds single-netisr assumptions in its implementation, so for now if it is compiled into the kernel, a single and un-bound netisr thread is enforced regardless of tunable configuration. In the default configuration, the new netisr implementation maintains the same basic assumptions as the previous implementation: a single, un-bound worker thread processes all deferred work, and direct dispatch is enabled by default wherever possible. Performance measurement shows a marginal performance improvement over the old implementation due to the use of batched dequeue. An rmlock is used to synchronize use and registration/unregistration using the framework; currently, synchronized use is disabled (replicating current netisr policy) due to a measurable 3%-6% hit in ping-pong micro-benchmarking. It will be enabled once further rmlock optimization has taken place. However, in practice, netisrs are rarely registered or unregistered at runtime. A new man page for netisr will follow, but since one doesn't currently exist, it hasn't been updated. This change is not appropriate for MFC, although the polling shutdown handler should be merged to 7-STABLE. Bump __FreeBSD_version. Reviewed by: bz Notes: svn path=/head/; revision=193219
* Permit buiding kernels with options VIMAGE, restricted to only a singleMarko Zec2009-04-301-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | active network stack instance. Turning on options VIMAGE at compile time yields the following changes relative to default kernel build: 1) V_ accessor macros for virtualized variables resolve to structure fields via base pointers, instead of being resolved as fields in global structs or plain global variables. As an example, V_ifnet becomes: options VIMAGE: ((struct vnet_net *) vnet_net)->_ifnet default build: vnet_net_0._ifnet options VIMAGE_GLOBALS: ifnet 2) INIT_VNET_* macros will declare and set up base pointers to be used by V_ accessor macros, instead of resolving to whitespace: INIT_VNET_NET(ifp->if_vnet); becomes struct vnet_net *vnet_net = (ifp->if_vnet)->mod_data[VNET_MOD_NET]; 3) Memory for vnet modules registered via vnet_mod_register() is now allocated at run time in sys/kern/kern_vimage.c, instead of per vnet module structs being declared as globals. If required, vnet modules can now request the framework to provide them with allocated bzeroed memory by filling in the vmi_size field in their vmi_modinfo structures. 4) structs socket, ifnet, inpcbinfo, tcpcb and syncache_head are extended to hold a pointer to the parent vnet. options VIMAGE builds will fill in those fields as required. 5) curvnet is introduced as a new global variable in options VIMAGE builds, always pointing to the default and only struct vnet. 6) struct sysctl_oid has been extended with additional two fields to store major and minor virtualization module identifiers, oid_v_subs and oid_v_mod. SYSCTL_V_* family of macros will fill in those fields accordingly, and store the offset in the appropriate vnet container struct in oid_arg1. In sysctl handlers dealing with virtualized sysctls, the SYSCTL_RESOLVE_V_ARG1() macro will compute the address of the target variable and make it available in arg1 variable for further processing. Unused fields in structs vnet_inet, vnet_inet6 and vnet_ipfw have been deleted. Reviewed by: bz, rwatson Approved by: julian (mentor) Notes: svn path=/head/; revision=191688
* In preparation for turning on options VIMAGE in next commits,Marko Zec2009-04-261-0/+2
| | | | | | | | | | | rearrange / replace / adjust several INIT_VNET_* initializer macros, all of which currently resolve to whitespace. Reviewed by: bz (an older version of the patch) Approved by: julian (mentor) Notes: svn path=/head/; revision=191548
* In divert_packet(), lock the interface address list before iterating overRobert Watson2009-04-191-1/+5
| | | | | | | | | it in search of an address. MFC after: 2 weeks Notes: svn path=/head/; revision=191287
* Update stats in struct ipstat using four new macros, IPSTAT_ADD(),Robert Watson2009-04-111-5/+5
| | | | | | | | | | | | IPSTAT_INC(), IPSTAT_SUB(), and IPSTAT_DEC(), rather than directly manipulating the fields across the kernel. This will make it easier to change the implementation of these statistics, such as using per-CPU versions of the data structures. MFC after: 3 days Notes: svn path=/head/; revision=190951
* Adds support for SCTP checksum offload. This meansRandall Stewart2009-02-031-1/+12
| | | | | | | | | | | | | | | | we, like TCP and UDP, move the checksum calculation into the IP routines when there is no hardware support we call into the normal SCTP checksum routine. The next round of SCTP updates will use this functionality. Of course the IGB driver needs a few updates to support the new intel controller set that actually does SCTP csum offload too. Reviewed by: gnn, rwatson, kmacy Notes: svn path=/head/; revision=188066
* Conditionally compile out V_ globals while instantiating the appropriateMarko Zec2008-12-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | container structures, depending on VIMAGE_GLOBALS compile time option. Make VIMAGE_GLOBALS a new compile-time option, which by default will not be defined, resulting in instatiations of global variables selected for V_irtualization (enclosed in #ifdef VIMAGE_GLOBALS blocks) to be effectively compiled out. Instantiate new global container structures to hold V_irtualized variables: vnet_net_0, vnet_inet_0, vnet_inet6_0, vnet_ipsec_0, vnet_netgraph_0, and vnet_gif_0. Update the VSYM() macro so that depending on VIMAGE_GLOBALS the V_ macros resolve either to the original globals, or to fields inside container structures, i.e. effectively #ifdef VIMAGE_GLOBALS #define V_rt_tables rt_tables #else #define V_rt_tables vnet_net_0._rt_tables #endif Update SYSCTL_V_*() macros to operate either on globals or on fields inside container structs. Extend the internal kldsym() lookups with the ability to resolve selected fields inside the virtualization container structs. This applies only to the fields which are explicitly registered for kldsym() visibility via VNET_MOD_DECLARE() and vnet_mod_register(), currently this is done only in sys/net/if.c. Fix a few broken instances of MODULE_GLOBAL() macro use in SCTP code, and modify the MODULE_GLOBAL() macro to resolve to V_ macros, which in turn result in proper code being generated depending on VIMAGE_GLOBALS. De-virtualize local static variables in sys/contrib/pf/net/pf_subr.c which were prematurely V_irtualized by automated V_ prepending scripts during earlier merging steps. PF virtualization will be done separately, most probably after next PF import. Convert a few variable initializations at instantiation to initialization in init functions, most notably in ipfw. Also convert TUNABLE_INT() initializers for V_ variables to TUNABLE_FETCH_INT() in initializer functions. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation Notes: svn path=/head/; revision=185895
* Rather than using hidden includes (with cicular dependencies),Bjoern A. Zeeb2008-12-021-0/+1
| | | | | | | | | | | | | | directly include only the header files needed. This reduces the unneeded spamming of various headers into lots of files. For now, this leaves us with very few modules including vnet.h and thus needing to depend on opt_route.h. Reviewed by: brooks, gnn, des, zec, imp Sponsored by: The FreeBSD Foundation Notes: svn path=/head/; revision=185571
* Merge more of currently non-functional (i.e. resolving toMarko Zec2008-11-261-0/+1
| | | | | | | | | | | | | | | | | | | | whitespace) macros from p4/vimage branch. Do a better job at enclosing all instantiations of globals scheduled for virtualization in #ifdef VIMAGE_GLOBALS blocks. De-virtualize and mark as const saorder_state_alive and saorder_state_any arrays from ipsec code, given that they are never updated at runtime, so virtualizing them would be pointless. Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation Notes: svn path=/head/; revision=185348
* Fix a scope problem in the multiple routing table code that stopped theJulian Elischer2008-11-191-0/+1
| | | | | | | | | | SO_SETFIB socket option from working correctly. Obtained from: Ironport MFC after: 3 days Notes: svn path=/head/; revision=185101
* Change the initialization methodology for global variables scheduledMarko Zec2008-11-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | for virtualization. Instead of initializing the affected global variables at instatiation, assign initial values to them in initializer functions. As a rule, initialization at instatiation for such variables should never be introduced again from now on. Furthermore, enclose all instantiations of such global variables in #ifdef VIMAGE_GLOBALS blocks. Essentialy, this change should have zero functional impact. In the next phase of merging network stack virtualization infrastructure from p4/vimage branch, the new initialization methology will allow us to switch between using global variables and their counterparts residing in virtualization containers with minimum code churn, and in the long run allow us to intialize multiple instances of such container structures. Discussed at: devsummit Strassburg Reviewed by: bz, julian Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation Notes: svn path=/head/; revision=185088
* Add cr_canseeinpcb() doing checks using the cached socketBjoern A. Zeeb2008-10-171-1/+1
| | | | | | | | | | | | | credentials from inp_cred which is also available after the socket is gone. Switch cr_canseesocket consumers to cr_canseeinpcb. This removes an extra acquisition of the socket lock. Reviewed by: rwatson MFC after: 3 months (set timer; decide then) Notes: svn path=/head/; revision=183982
* Step 1.5 of importing the network stack virtualization infrastructureMarko Zec2008-10-021-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | from the vimage project, as per plan established at devsummit 08/08: http://wiki.freebsd.org/Image/Notes200808DevSummit Introduce INIT_VNET_*() initializer macros, VNET_FOREACH() iterator macros, and CURVNET_SET() context setting macros, all currently resolving to NOPs. Prepare for virtualization of selected SYSCTL objects by introducing a family of SYSCTL_V_*() macros, currently resolving to their global counterparts, i.e. SYSCTL_V_INT() == SYSCTL_INT(). Move selected #defines from sys/sys/vimage.h to newly introduced header files specific to virtualized subsystems (sys/net/vnet.h, sys/netinet/vinet.h etc.). All the changes are verified to have zero functional impact at this point in time by doing MD5 comparision between pre- and post-change object files(*). (*) netipsec/keysock.c did not validate depending on compile time options. Implemented by: julian, bz, brooks, zec Reviewed by: julian, bz, brooks, kris, rwatson, ... Approved by: julian (mentor) Obtained from: //depot/projects/vimage-commit2/... X-MFC after: never Sponsored by: NLnet Foundation, The FreeBSD Foundation Notes: svn path=/head/; revision=183550
* Commit step 1 of the vimage project, (network stack)Bjoern A. Zeeb2008-08-171-45/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | virtualization work done by Marko Zec (zec@). This is the first in a series of commits over the course of the next few weeks. Mark all uses of global variables to be virtualized with a V_ prefix. Use macros to map them back to their global names for now, so this is a NOP change only. We hope to have caught at least 85-90% of what is needed so we do not invalidate a lot of outstanding patches again. Obtained from: //depot/projects/vimage-commit2/... Reviewed by: brooks, des, ed, mav, julian, jamie, kris, rwatson, zec, ... (various people I forgot, different versions) md5 (with a bit of help) Sponsored by: NLnet Foundation, The FreeBSD Foundation X-MFC after: never V_Commit_Message_Reviewed_By: more people than the patch Notes: svn path=/head/; revision=181803
* According to in_pcb.h protocol binding information has double locking.Alexander Motin2008-07-271-2/+1
| | | | | | | It allows access it while list travercing holding only global pcbinfo lock. Notes: svn path=/head/; revision=180851
* Read lock, rather than write lock, the inpcb when transmitting with orRobert Watson2008-04-211-11/+11
| | | | | | | | | delivering to an IP divert socket. MFC after: 3 months Notes: svn path=/head/; revision=178376