diff options
Diffstat (limited to 'en_US.ISO8859-1/htdocs/projects/netperf/index.sgml')
-rw-r--r-- | en_US.ISO8859-1/htdocs/projects/netperf/index.sgml | 388 |
1 files changed, 388 insertions, 0 deletions
diff --git a/en_US.ISO8859-1/htdocs/projects/netperf/index.sgml b/en_US.ISO8859-1/htdocs/projects/netperf/index.sgml new file mode 100644 index 0000000000..9d643f332e --- /dev/null +++ b/en_US.ISO8859-1/htdocs/projects/netperf/index.sgml @@ -0,0 +1,388 @@ +<!DOCTYPE HTML PUBLIC "-//FreeBSD//DTD HTML 4.01 Transitional-Based Extension//EN" [ +<!ENTITY base CDATA "../.."> +<!ENTITY date "$FreeBSD: www/en/projects/netperf/index.sgml,v 1.20 2006/08/19 21:20:43 hrs Exp $"> +<!ENTITY title "FreeBSD Network Performance Project (netperf)"> +<!ENTITY email 'mux'> +<!ENTITY % navinclude.developers "INCLUDE"> + +<!ENTITY status.na "<font color=green>N/A</font>"> +<!ENTITY status.done "<font color=green>Done</font>"> +<!ENTITY status.prototyped "<font color=blue>Prototyped</font>"> +<!ENTITY status.head "<font color=orange>Merged to HEAD; RELENG_5 candidate</font>"> +<!ENTITY status.new "<font color=red>New task</font>"> +<!ENTITY status.unknown "<font color=red>Unknown</font>"> + +<!ENTITY % developers SYSTEM "../../developers.sgml"> %developers; + +]> + +<html> + &header; + + <h2>Contents</h2> + <ul> + <li><a href="#goal">Project Goal</a></li> + <li><a href="#strategies">Project Strategies</a></li> + <li><a href="#tasks">Project Tasks</a></li> + <li><a href="#cluster">Netperf Cluster</a></li> + <li><a href="#papers">Papers and Reports</a><li> + <li><a href="#links">Links</a></li> + </ul> + + <a name="goal"></a> + <h2>Project Goal</h2> + + <p>The netperf project is working to enhance the performance of the + FreeBSD network stack. This work grew out of the + <a href="../../smp">SMPng Project</a>, which moved the FreeBSD kernel from + a "Giant Lock" to more fine-grained locking and multi-threading. SMPng + offered both performance improvement and degradation for the network + stack, improving parallelism and preemption, but substantially + increasing per-packet processing costs. The netperf project is + primarily focussed on further improving parallelism in network + processing while reducing the SMP synchronization overhead. This in + turn will lead to higher processing throughput and lower processing + latency.</p> + + <a name="strategies"></a> + <h2>Project Strategies</h2> + <p>Robert Watson</p> + + <p>The two primary focuses of this work are to increase parallelism + while decreasing overhead. Several activities are being performed that + will work toward these goals:</p> + + <ul> + <li><p>The Netperf project has completed locking work for all components + of the network stack; as of FreeBSD 7.0 we have removed non-MPSAFE + protocol shims, and as of FreeBSD 8.0 we have removed non-MPSAFE + device driver shims.</p></li> + + <li><p>Optimize locking strategies to find better balances between + locking granularity and locking overhead. In the first cut at locking + for the kernel, the goal was to adopt a medium-grained locking + approach based on data locking. This approach identifies critical + data structures, and inserts new locks and locking operations to + protect those data structures. Depending on the data model of the + code being protected, this may lead to the introduction of a + substantial number of locks offering unnecessary granularity, where + the overhead of locking overwhelms the benefits of available + parallelism and preemption. By selectively reducing granularity, it + is possible to improve performance by decreasing locking overhead. + </p></li> + + <li><p>Amortize the cost of locking by processing queues of packets or + events. While the cost of individual synchronization operations may + be high, it is possible to amortize the cost of synchronization + operations by grouping processing of similar data (packets, events) + under the same protection. This approach focuses on identifying + places where similar locking occurs frequently in succession, and + introducing queueing or coalescing of lock operations across the + body of the work. For example, when a series of packets is inserted + into an outgoing interface queue, a basic locking approach would + lock the queue for each insert operation, unlock it, and hand off to + the interface driver to begin the send, repeating this sequence as + required. With a coalesced approach, the caller would pass off a + queue of packets in order to reduce the locking overhead, as well as + eliminate unnecessary synchronization due to the queue being + thread-local. This approach can be applied at several levels in the + stack, and is particularly applicable at lower levels of the stack + where streams of packets require almost identical processing. + </p></li> + + <li><p>Introduce new synchronization strategies with reduced overhead + relative to traditional strategies. Most traditional strategies + employ a combination of interrupt disabling and atomic operations to + achieve mutual exclusion and non-preemption guarantees. However, + these operations are expensive on modern CPUs, leading to the desire + for cheaper primitives with weaker semantics. For example, the + application of uni-processor primitives where synchronization is + required only on a single processor, and optimizations to critical + section primitives to avoid the need for interrupt disabling. + </p></li> + + <li><p>Modify synchronization strategies to take advantage of + additional, non-locking, synchronization primitives. This approach + might take the form of making increased use of per-CPU or per-thread + data structures, which require little or no synchronization. For + example, through the use of critical sections, it is possible to + synchronize access to per-CPU caches and queues. Through the use of + per-thread queues, data can be handed off between stack layers + without the use of synchronization.</p></li> + + <li><p>Increase the opportunities for parallelism through increased + threading in the network stack. The current network stack model + offers the opportunity for substantial parallelism, with outbound + processing typically taking place in the context of the sending + thread in kernel, crypto occurring in crypto worker threads, and + receive processing taking place in a combination of the receiving + ithread and dispatched netisr thread. While handoffs between + threads introduces overhead (synchronization, context switching), + there is the opportunity to increase parallelism in some workloads + through introducing additional worker threads. Identifying work + that may be relocated to new threads must be done carefully to + balance overhead, and latency concerns, but can pay off by + increasing effective CPU utilization and hence throughput. For + example, introducing additional netisr threads capable of running on + more than one CPU at a time can increase input parallelism, subject + to maintaining desirable packet ordering (present in FreeBSD + 8.0).</p></li> + </ul> + + <a name="tasks"></a> + <h2>Project Tasks</h2> + + <table class="tblbasic"> + <tr> + <th> Task </th> + <th> Responsible </th> + <th> Last updated </th> + <th> Status </th> + <th> Notes </th> + </tr> + + <tr> + <td> Prefer file descriptor reference counts to socket reference + counts for system calls. </td> + <td> &a.rwatson; </td> + <td> 20041124 </td> + <td> &status.done; </td> + <td> Sockets and file descriptors both have reference counts in order + to prevent these objects from being free'd while in use. However, + if a file descriptor is used to reach the socket, the reference + counts are somewhat interchangeable, as either will prevent + undesired garbage collection. For socket system calls, overhead + can be reduced by relying on the file descriptor reference count, + thus avoiding the synchronized operations necessary to modify the + socket reference count, an approach also taken in the VFS code. + This change has been made for most socket system calls, and has + been committed to HEAD (6.x). It has also been merged to RELENG_5 + for inclusion in 5.4.</td> + </tr> + + <tr> + <td> Mbuf queue library </td> + <td> &a.rwatson; </td> + <td> 20041124 </td> + <td> &status.prototyped; </td> + <td> In order to facilitate passing off queues of packets between + network stack components, create an mbuf queue primitive, struct + mbufqueue. The initial implementation is complete, and the + primitive is now being applied in several sample cases to determine + whether it offers the desired semantics and benefits. The + implementation can be found in the rwatson_dispatch Perforce + branch. Additional work must also be done to explore the + performance impact of "queues" vs arrays of mbuf pointers, which + are likely to behave better from a caching perspective. </td> + </tr> + + <tr> + <td> Employ queued dispatch in interface send API </td> + <td> &a.rwatson; </td> + <td> 20041106 </td> + <td> &status.prototyped; </td> + <td> An experimental if_start_mbufqueue() interface to struct ifnet + has been added, which passes an mbuf queue to the device driver for + processing, avoiding redundant synchronization against the + interface queue, even in the event that additional queueing is + required. This has not yet been benchmarked. A subset change to + dispatch a single mbuf to a driver has also been prototyped, and + benchmarked at a several percentage point improvement in packet send + rates from user space. </td> + </tr> + + <tr> + <td> Employ queued dispatch in the interface receive API </td> + <td> &a.rwatson; </td> + <td> 20041106 </td> + <td> &status.new; </td> + <td> Similar to if_start_mbufqueue, allow input of a queue of mbufs + from the device driver into the lowest protocol layers, such as + ether_input_mbufqueue. </td> + </tr> + + <tr> + <td> Employ queued dispatch across netisr dispatch API </td> + <td> &a.rwatson; </td> + <td> 20090601 </td> + <td> &status.done </td> + <td> Pull all of the mbufs in the netisr queue into a thread-local + mbuf queue to avoid repeated lock operations to access the queue. + This work was completed as part of the netisr2 project, and will + ship with 8.0-RELEASE. </td> + </tr> + + <tr> + <td> Modify UMA allocator to use critical sections not mutexes for + per-CPU caches. </td> + <td> &a.rwatson; </td> + <td> 20050429 </td> + <td> &status.done; </td> + <td> The mutexes protecting per-CPU caches require atomic operations + on SMP systems; as they are per-CPU objects, the cost of + synchronizing access to the caches can be reduced by combining + CPU pinning and/or critical sections instead. This change has now + been committed and will appear in 6.0-RELEASE; it results in a + several percentage performance in UDP send from user space, and + there have been reports of 20%+ improvements in allocation + intensive code within the kernel. In micro-benchmarks, the cost + of allocation on SMP is dramatically reduced. </td> + </tr> + + <tr> + <td> Modify malloc(9) allocator to use per-CPU statistics with + critical sections to protect malloc_type statistics rather than + global statistics with a mutex. </td> + <td> &a.rwatson; </td> + <td> 20050529 </td> + <td> &status.done; </td> + <td> Previously, malloc(9) used a single statistics structure + protected by a mutex to hold global malloc statistics for each + malloc type. This change moves to per-CPU statistics structures, + which are coalesced when reporting memory allocation statistics to + the user, and protects them using critical sections. This reduces + cache line contention for common allocation types by avoiding + shared lines, and also reduces synchronization costs by using + critical sections to synchronize access instead of a mutex. While + malloc(9) is less frequently used in the network stack than uma(9), + it is used for socket address data, so is on performance critical + paths for datagram operations. This has been committed and appeared + 6.0-RELEASE. </td> + </tr> + + <tr> + <td> Optimize critical section performance </td> + <td> &a.jhb; </td> + <td> 20050404 </td> + <td> &status.done; </td> + <td> Critical sections prevent preemption of a thread on a CPU, as + well as preventing migration of that thread to another CPU, and + maybe used for synchronizing access to per-CPU data structures, as + well as preventing recursion in interrupt processing. Currently, + critical sections disable interrupts on the CPU. In previous + versions of FreeBSD (4.x and before), optimizations were present + that allowed for software interrupt disabling, which lowers the + cost of critical sections in the common case by avoiding expensive + microcode operations on the CPU. By restoring this model, or a + variation on it, critical sections can be made substantially + cheaper to enter. In particular, this change lowers the cost + of critical sections on UP such that it is approximately the same + cost as a mutex, meaning that optimizations on SMP to use critical + sections instead of mutexes will not harm UP performance. This + change has now been committed, and appeared in 6.0-RELEASE. </td> + </tr> + + <tr> + <td> Normalize socket and protocol control block reference model </td> + <td> &a.rwatson; </td> + <td> 20060401 </td> + <td> &status.done; </td> + <td> The socket/protocol boundary is characterized by a set of data + structures and API interfaces, where the socket code acts as both + a consumer and a service library for protocols. This task is to + normalize the reference model by which protocol state is attached + to and detached from socket state in order to strengthen + invariants, allowing the removal of countless unused code paths + (especially error handling), the removal of unnecessary locking + in TCP, and a general improve the structure of the code. This + serves both the immediate purpose of improving the quality and + performance of this code, as well as being necessary for future + optimization work. These changes have been prototyped in + Perforce, and now merged to 7-CURRENT. They will be merged into + RELENG_6 once they have been thoroughly tested.</td> + </tr> + + <tr> + <td> Add true inpcb reference count support </td> + <td> &a.mohans;, &a.rwatson;, &a.peter; </td> + <td> 20081208 </td> + <td> &status.done; </td> + <td> Historically, the in-bound TCP and UDP socket paths relied on + global pcbinfo info locks to prevent PCBs being delivered to from + being garbage collected by another thread while in use. This set + of changes introduces a true reference model for PCBs so that the + global lock can be released during in-bound process, and appear + in 8.0-RELEASE./td> + </tr> + + <tr> + <td> Fine-grained locking for UNIX domain sockets </td> + <td> &a.rwatson; </td> + <td> 20070226 </td> + <td> &status.done; </td> + <td> UNIX domain sockets in FreeBSD 5.x and 6.x use a single global + subsystem lock. This is sufficient to allow it to run without + Giant, but results in contention with large numbers of processors + simultaneously operating on UNIX domain sockets. This task + introduced per-protocol control block locks in order to reduce + contention on a larger subsystem lock, and the results appeared in + 7.0-RELEASE. </td> + </tr> + + <tr> + <td> Multiple netisr threads </td> + <td> &a.rwatson; </td> + <td> 20090601 </td> + <td> &status.done; </td> + <td> Historically, the BSD network stack has used a single network + software interrupt context, for deferred network processing. With + the introduction of multi-processing, this became a single + software interrupt thread. In FreeBSD 8.0, multiple netisr + threads are now supported, up to the number of CPUs present in the + system.</td> + </tr> + + </table> + + <a name="cluster"></a> + <h2>Netperf Cluster</h2> + + <p>Through the generous donations and investment of Sentex Data + Communications, FreeBSD Systems, IronPort Systems, and the FreeBSD + Foundation, a network performance testbed has been created in Ontario, + Canada for use by FreeBSD developers working in the area of network + performance. A similar cluster, made possible through the generous + donation of Verio, is being prepared for use in more general SMP + performance work in Virginia, US. Each cluster consists of several SMP + systems inter-connected with giga-bit ethernet such that relatively + arbitrary topologies can be constructed in order to test host-host, IP + forwarding, and bridging performance scenarios. Systems are network + booted, have serial console, and remote power, in order to maximize + availability and minimize configuration overhead. These systems are + available on a check-out basis for experimentation and performance + measurement to FreeBSD developers working on the Netperf project, and + in related areas.</p> + + <p><a href="cluster.html">More detailed information on the netperf + cluster can be found by following this link.</a></p> + + <a name="papers"></a> + <h2>Papers and Reports</h2> + + <p>The following paper(s) have been produced by or are related to the + Netperf Project:</p> + + <ul> + <li><p><a href="http://www.watson.org/~robert/freebsd/netperf/20051027-eurobsdcon2005-netperf.pdf">"Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack", EuroBSDCon 2005, Basel, Switzerland</a>.</p></li> + </ul> + + <p>Additional papers can be found on the <a href="../../smp/">SMPng + Project</a> web page.</p> + + <a name="links"></a> + <h2>Links</h2> + + <p>Some useful links relating to the netperf work:</p> + + <ul> + <li><p><a href="../../smp/">SMPng Project</a> -- Project to introduce + finer grained locking in the FreeBSD kernel.</p></li> + + <li><p><a href="http://www.watson.org/~robert/freebsd/netperf/">Robert + Watson's netperf web page</a> -- Web page that includes a change log + and performance measurement/debugging information.</p></li> + </ul> + + &footer; + </body> +</html> |