1 files changed, 388 insertions, 0 deletions
diff --git a/en_US.ISO8859-1/htdocs/projects/netperf/index.sgml b/en_US.ISO8859-1/htdocs/projects/netperf/index.sgml
new file mode 100644
index 0000000000..9d643f332e
--- /dev/null
+++ b/en_US.ISO8859-1/htdocs/projects/netperf/index.sgml
@@ -0,0 +1,388 @@
+<!DOCTYPE HTML PUBLIC "-//FreeBSD//DTD HTML 4.01 Transitional-Based Extension//EN" [
+<!ENTITY base CDATA "../..">
+<!ENTITY date "$FreeBSD: www/en/projects/netperf/index.sgml,v 1.20 2006/08/19 21:20:43 hrs Exp $">
+<!ENTITY title "FreeBSD Network Performance Project (netperf)">
+<!ENTITY email 'mux'>
+<!ENTITY % navinclude.developers "INCLUDE">
+
+<!ENTITY status.na "<font color=green>N/A</font>">
+<!ENTITY status.done "<font color=green>Done</font>">
+<!ENTITY status.prototyped "<font color=blue>Prototyped</font>">
+<!ENTITY status.head "<font color=orange>Merged to HEAD; RELENG_5 candidate</font>">
+<!ENTITY status.new "<font color=red>New task</font>">
+<!ENTITY status.unknown "<font color=red>Unknown</font>">
+
+<!ENTITY % developers SYSTEM "../../developers.sgml"> %developers;
+
+]>
+
+<html>
+  &header;
+
+    <h2>Contents</h2>
+    <ul>
+      <li><a href="#goal">Project Goal</a></li>
+      <li><a href="#strategies">Project Strategies</a></li>
+      <li><a href="#tasks">Project Tasks</a></li>
+      <li><a href="#cluster">Netperf Cluster</a></li>
+      <li><a href="#papers">Papers and Reports</a><li>
+      <li><a href="#links">Links</a></li>
+    </ul>
+
+    <a name="goal"></a>
+    <h2>Project Goal</h2>
+
+    <p>The netperf project is working to enhance the performance of the
+      FreeBSD network stack.  This work grew out of the
+      <a href="../../smp">SMPng Project</a>, which moved the FreeBSD kernel from
+      a "Giant Lock" to more fine-grained locking and multi-threading.  SMPng
+      offered both performance improvement and degradation for the network
+      stack, improving parallelism and preemption, but substantially
+      increasing per-packet processing costs.  The netperf project is
+      primarily focussed on further improving parallelism in network
+      processing while reducing the SMP synchronization overhead.  This in
+      turn will lead to higher processing throughput and lower processing
+      latency.</p>
+
+    <a name="strategies"></a>
+    <h2>Project Strategies</h2>
+    <p>Robert Watson</p>
+
+    <p>The two primary focuses of this work are to increase parallelism
+      while decreasing overhead.  Several activities are being performed that
+      will work toward these goals:</p>
+
+    <ul>
+      <li><p>The Netperf project has completed locking work for all components
+	of the network stack; as of FreeBSD 7.0 we have removed non-MPSAFE
+	protocol shims, and as of FreeBSD 8.0 we have removed non-MPSAFE
+	device driver shims.</p></li>
+
+      <li><p>Optimize locking strategies to find better balances between
+	locking granularity and locking overhead.  In the first cut at locking
+	for the kernel, the goal was to adopt a medium-grained locking
+	approach based on data locking.  This approach identifies critical
+	data structures, and inserts new locks and locking operations to
+	protect those data structures.  Depending on the data model of the
+	code being protected, this may lead to the introduction of a
+	substantial number of locks offering unnecessary granularity, where
+	the overhead of locking overwhelms the benefits of available
+	parallelism and preemption.  By selectively reducing granularity, it
+	is possible to improve performance by decreasing locking overhead.
+	</p></li>
+
+      <li><p>Amortize the cost of locking by processing queues of packets or
+	events.  While the cost of individual synchronization operations may
+	be high, it is possible to amortize the cost of synchronization
+	operations by grouping processing of similar data (packets, events)
+	under the same protection.  This approach focuses on identifying
+	places where similar locking occurs frequently in succession, and
+	introducing queueing or coalescing of lock operations across the
+	body of the work.  For example, when a series of packets is inserted
+	into an outgoing interface queue, a basic locking approach would
+	lock the queue for each insert operation, unlock it, and hand off to
+	the interface driver to begin the send, repeating this sequence as
+	required.  With a coalesced approach, the caller would pass off a
+	queue of packets in order to reduce the locking overhead, as well as
+	eliminate unnecessary synchronization due to the queue being
+	thread-local.  This approach can be applied at several levels in the
+	stack, and is particularly applicable at lower levels of the stack
+	where streams of packets require almost identical processing.
+	</p></li>
+
+      <li><p>Introduce new synchronization strategies with reduced overhead
+	relative to traditional strategies.  Most traditional strategies
+	employ a combination of interrupt disabling and atomic operations to
+	achieve mutual exclusion and non-preemption guarantees.  However,
+	these operations are expensive on modern CPUs, leading to the desire
+	for cheaper primitives with weaker semantics.  For example, the
+	application of uni-processor primitives where synchronization is
+	required only on a single processor, and optimizations to critical
+	section primitives to avoid the need for interrupt disabling.
+	</p></li>
+
+      <li><p>Modify synchronization strategies to take advantage of
+	additional, non-locking, synchronization primitives.  This approach
+	might take the form of making increased use of per-CPU or per-thread
+	data structures, which require little or no synchronization.  For
+	example, through the use of critical sections, it is possible to
+	synchronize access to per-CPU caches and queues.  Through the use of
+	per-thread queues, data can be handed off between stack layers
+	without the use of synchronization.</p></li>
+
+      <li><p>Increase the opportunities for parallelism through increased
+	threading in the network stack.  The current network stack model
+	offers the opportunity for substantial parallelism, with outbound
+	processing typically taking place in the context of the sending
+	thread in kernel, crypto occurring in crypto worker threads, and
+	receive processing taking place in a combination of the receiving
+	ithread and dispatched netisr thread.  While handoffs between
+	threads introduces overhead (synchronization, context switching),
+	there is the opportunity to increase parallelism in some workloads
+	through introducing additional worker threads.  Identifying work
+	that may be relocated to new threads must be done carefully to
+	balance overhead, and latency concerns, but can pay off by
+	increasing effective CPU utilization and hence throughput.  For
+	example, introducing additional netisr threads capable of running on
+	more than one CPU at a time can increase input parallelism, subject
+	to maintaining desirable packet ordering (present in FreeBSD
+	8.0).</p></li>
+    </ul>
+
+    <a name="tasks"></a>
+    <h2>Project Tasks</h2>
+
+    <table class="tblbasic">
+      <tr>
+	<th> Task </th>
+	<th> Responsible </th>
+	<th> Last updated </th>
+	<th> Status </th>
+	<th> Notes </th>
+      </tr>
+
+      <tr>
+	<td> Prefer file descriptor reference counts to socket reference
+	  counts for system calls. </td>
+	<td> &a.rwatson; </td>
+	<td> 20041124 </td>
+	<td> &status.done; </td>
+	<td> Sockets and file descriptors both have reference counts in order
+	  to prevent these objects from being free'd while in use.  However,
+	  if a file descriptor is used to reach the socket, the reference
+	  counts are somewhat interchangeable, as either will prevent
+	  undesired garbage collection.  For socket system calls, overhead
+	  can be reduced by relying on the file descriptor reference count,
+	  thus avoiding the synchronized operations necessary to modify the
+	  socket reference count, an approach also taken in the VFS code.
+	  This change has been made for most socket system calls, and has
+	  been committed to HEAD (6.x).  It has also been merged to RELENG_5
+	  for inclusion in 5.4.</td>
+      </tr>
+
+      <tr>
+	<td> Mbuf queue library </td>
+	<td> &a.rwatson; </td>
+	<td> 20041124 </td>
+	<td> &status.prototyped; </td>
+	<td> In order to facilitate passing off queues of packets between
+	  network stack components, create an mbuf queue primitive, struct
+	  mbufqueue.  The initial implementation is complete, and the
+	  primitive is now being applied in several sample cases to determine
+	  whether it offers the desired semantics and benefits.  The
+	  implementation can be found in the rwatson_dispatch Perforce
+	  branch.  Additional work must also be done to explore the
+	  performance impact of "queues" vs arrays of mbuf pointers, which
+	  are likely to behave better from a caching perspective. </td>
+      </tr>
+
+      <tr>
+	<td> Employ queued dispatch in interface send API </td>
+	<td> &a.rwatson; </td>
+	<td> 20041106 </td>
+	<td> &status.prototyped; </td>
+	<td> An experimental if_start_mbufqueue() interface to struct ifnet
+	  has been added, which passes an mbuf queue to the device driver for
+	  processing, avoiding redundant synchronization against the
+	  interface queue, even in the event that additional queueing is
+	  required.  This has not yet been benchmarked.  A subset change to
+	  dispatch a single mbuf to a driver has also been prototyped, and
+	  benchmarked at a several percentage point improvement in packet send
+	  rates from user space. </td>
+      </tr>
+
+      <tr>
+	<td> Employ queued dispatch in the interface receive API </td>
+	<td> &a.rwatson; </td>
+	<td> 20041106 </td>
+	<td> &status.new; </td>
+	<td> Similar to if_start_mbufqueue, allow input of a queue of mbufs
+	  from the device driver into the lowest protocol layers, such as
+	  ether_input_mbufqueue. </td>
+      </tr>
+
+      <tr>
+	<td> Employ queued dispatch across netisr dispatch API </td>
+	<td> &a.rwatson; </td>
+	<td> 20090601 </td>
+	<td> &status.done </td>
+	<td> Pull all of the mbufs in the netisr queue into a thread-local
+	  mbuf queue to avoid repeated lock operations to access the queue.
+	  This work was completed as part of the netisr2 project, and will
+	  ship with 8.0-RELEASE. </td>
+      </tr>
+
+      <tr>
+	<td> Modify UMA allocator to use critical sections not mutexes for
+	  per-CPU caches. </td>
+	<td> &a.rwatson; </td>
+	<td> 20050429 </td>
+	<td> &status.done; </td>
+	<td> The mutexes protecting per-CPU caches require atomic operations
+	  on SMP systems; as they are per-CPU objects, the cost of
+	  synchronizing access to the caches can be reduced by combining
+	  CPU pinning and/or critical sections instead.  This change has now
+	  been committed and will appear in 6.0-RELEASE; it results in a
+	  several percentage performance in UDP send from user space, and
+	  there have been reports of 20%+ improvements in allocation
+	  intensive code within the kernel.  In micro-benchmarks, the cost
+	  of allocation on SMP is dramatically reduced. </td>
+      </tr>
+
+      <tr>
+	<td> Modify malloc(9) allocator to use per-CPU statistics with
+	  critical sections to protect malloc_type statistics rather than
+	  global statistics with a mutex. </td>
+	<td> &a.rwatson; </td>
+	<td> 20050529 </td>
+	<td> &status.done; </td>
+	<td> Previously, malloc(9) used a single statistics structure
+	  protected by a mutex to hold global malloc statistics for each
+	  malloc type.  This change moves to per-CPU statistics structures,
+	  which are coalesced when reporting memory allocation statistics to
+	  the user, and protects them using critical sections.  This reduces
+	  cache line contention for common allocation types by avoiding
+	  shared lines, and also reduces synchronization costs by using
+	  critical sections to synchronize access instead of a mutex.  While
+	  malloc(9) is less frequently used in the network stack than uma(9),
+	  it is used for socket address data, so is on performance critical
+	  paths for datagram operations.  This has been committed and appeared
+	  6.0-RELEASE. </td>
+      </tr>
+
+      <tr>
+	<td> Optimize critical section performance </td>
+	<td> &a.jhb; </td>
+	<td> 20050404 </td>
+	<td> &status.done; </td>
+	<td> Critical sections prevent preemption of a thread on a CPU, as
+	  well as preventing migration of that thread to another CPU, and
+	  maybe used for synchronizing access to per-CPU data structures, as
+	  well as preventing recursion in interrupt processing.  Currently,
+	  critical sections disable interrupts on the CPU.  In previous
+	  versions of FreeBSD (4.x and before), optimizations were present
+	  that allowed for software interrupt disabling, which lowers the
+	  cost of critical sections in the common case by avoiding expensive
+	  microcode operations on the CPU.  By restoring this model, or a
+	  variation on it, critical sections can be made substantially
+	  cheaper to enter.  In particular, this change lowers the cost
+	  of critical sections on UP such that it is approximately the same
+	  cost as a mutex, meaning that optimizations on SMP to use critical
+	  sections instead of mutexes will not harm UP performance.  This
+	  change has now been committed, and appeared in 6.0-RELEASE. </td>
+      </tr>
+
+      <tr>
+	<td> Normalize socket and protocol control block reference model </td>
+	<td> &a.rwatson; </td>
+	<td> 20060401 </td>
+	<td> &status.done; </td>
+	<td> The socket/protocol boundary is characterized by a set of data
+	  structures and API interfaces, where the socket code acts as both
+	  a consumer and a service library for protocols.  This task is to
+	  normalize the reference model by which protocol state is attached
+	  to and detached from socket state in order to strengthen
+	  invariants, allowing the removal of countless unused code paths
+	  (especially error handling), the removal of unnecessary locking
+	  in TCP, and a general improve the structure of the code.  This
+	  serves both the immediate purpose of improving the quality and
+	  performance of this code, as well as being necessary for future
+	  optimization work.  These changes have been prototyped in
+	  Perforce, and now merged to 7-CURRENT.  They will be merged into
+	  RELENG_6 once they have been thoroughly tested.</td>
+      </tr>
+
+      <tr>
+	<td> Add true inpcb reference count support </td>
+	<td> &a.mohans;, &a.rwatson;, &a.peter; </td>
+	<td> 20081208 </td>
+	<td> &status.done; </td>
+	<td> Historically, the in-bound TCP and UDP socket paths relied on
+	  global pcbinfo info locks to prevent PCBs being delivered to from
+	  being garbage collected by another thread while in use.  This set
+	  of changes introduces a true reference model for PCBs so that the
+	  global lock can be released during in-bound process, and appear
+	  in 8.0-RELEASE./td>
+      </tr>
+
+      <tr>
+	<td> Fine-grained locking for UNIX domain sockets </td>
+	<td> &a.rwatson; </td>
+	<td> 20070226 </td>
+	<td> &status.done; </td>
+	<td> UNIX domain sockets in FreeBSD 5.x and 6.x use a single global
+	  subsystem lock.  This is sufficient to allow it to run without
+	  Giant, but results in contention with large numbers of processors
+	  simultaneously operating on UNIX domain sockets.  This task
+	  introduced per-protocol control block locks in order to reduce
+	  contention on a larger subsystem lock, and the results appeared in
+	  7.0-RELEASE. </td>
+      </tr>
+
+      <tr>
+	<td> Multiple netisr threads </td>
+	<td> &a.rwatson; </td>
+	<td> 20090601 </td>
+	<td> &status.done; </td>
+	<td> Historically, the BSD network stack has used a single network
+	  software interrupt context, for deferred network processing.  With
+	  the introduction of multi-processing, this became a single
+	  software interrupt thread.  In FreeBSD 8.0, multiple netisr
+	  threads are now supported, up to the number of CPUs present in the
+	  system.</td>
+      </tr>
+
+    </table>
+
+    <a name="cluster"></a>
+    <h2>Netperf Cluster</h2>
+
+    <p>Through the generous donations and investment of Sentex Data
+      Communications, FreeBSD Systems, IronPort Systems, and the FreeBSD
+      Foundation, a network performance testbed has been created in Ontario,
+      Canada for use by FreeBSD developers working in the area of network
+      performance.  A similar cluster, made possible through the generous
+      donation of Verio, is being prepared for use in more general SMP
+      performance work in Virginia, US.  Each cluster consists of several SMP
+      systems inter-connected with giga-bit ethernet such that relatively
+      arbitrary topologies can be constructed in order to test host-host, IP
+      forwarding, and bridging performance scenarios.  Systems are network
+      booted, have serial console, and remote power, in order to maximize
+      availability and minimize configuration overhead.  These systems are
+      available on a check-out basis for experimentation and performance
+      measurement to FreeBSD developers working on the Netperf project, and
+      in related areas.</p>
+
+    <p><a href="cluster.html">More detailed information on the netperf
+      cluster can be found by following this link.</a></p>
+
+    <a name="papers"></a>
+    <h2>Papers and Reports</h2>
+
+    <p>The following paper(s) have been produced by or are related to the
+      Netperf Project:</p>
+
+    <ul>
+      <li><p><a href="http://www.watson.org/~robert/freebsd/netperf/20051027-eurobsdcon2005-netperf.pdf">"Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack", EuroBSDCon 2005, Basel, Switzerland</a>.</p></li>
+    </ul>
+
+    <p>Additional papers can be found on the <a href="../../smp/">SMPng
+      Project</a> web page.</p>
+
+    <a name="links"></a>
+    <h2>Links</h2>
+
+    <p>Some useful links relating to the netperf work:</p>
+
+    <ul>
+      <li><p><a href="../../smp/">SMPng Project</a> -- Project to introduce
+	finer grained locking in the FreeBSD kernel.</p></li>
+
+      <li><p><a href="http://www.watson.org/~robert/freebsd/netperf/">Robert
+	Watson's netperf web page</a> -- Web page that includes a change log
+	and performance measurement/debugging information.</p></li>
+    </ul>
+
+  &footer;
+  </body>
+</html>