1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
|
<!DOCTYPE HTML PUBLIC "-//FreeBSD//DTD HTML 4.01 Transitional-Based Extension//EN" [
<!ENTITY base CDATA "../..">
<!ENTITY date "$FreeBSD: www/en/projects/netperf/index.sgml,v 1.20 2006/08/19 21:20:43 hrs Exp $">
<!ENTITY title "FreeBSD Network Performance Project (netperf)">
<!ENTITY email 'mux'>
<!ENTITY % navinclude.developers "INCLUDE">
<!ENTITY status.na "<font color=green>N/A</font>">
<!ENTITY status.done "<font color=green>Done</font>">
<!ENTITY status.prototyped "<font color=blue>Prototyped</font>">
<!ENTITY status.head "<font color=orange>Merged to HEAD; RELENG_5 candidate</font>">
<!ENTITY status.new "<font color=red>New task</font>">
<!ENTITY status.unknown "<font color=red>Unknown</font>">
<!ENTITY % developers PUBLIC "-//FreeBSD//ENTITIES FreeBSD Developers Entities//EN"> %developers;
]>
<html>
&header;
<h2>Contents</h2>
<ul>
<li><a href="#goal">Project Goal</a></li>
<li><a href="#strategies">Project Strategies</a></li>
<li><a href="#tasks">Project Tasks</a></li>
<li><a href="#cluster">Netperf Cluster</a></li>
<li><a href="#papers">Papers and Reports</a><li>
<li><a href="#links">Links</a></li>
</ul>
<a name="goal"></a>
<h2>Project Goal</h2>
<p>The netperf project is working to enhance the performance of the
FreeBSD network stack. This work grew out of the
<a href="../../smp">SMPng Project</a>, which moved the FreeBSD kernel from
a "Giant Lock" to more fine-grained locking and multi-threading. SMPng
offered both performance improvement and degradation for the network
stack, improving parallelism and preemption, but substantially
increasing per-packet processing costs. The netperf project is
primarily focussed on further improving parallelism in network
processing while reducing the SMP synchronization overhead. This in
turn will lead to higher processing throughput and lower processing
latency.</p>
<a name="strategies"></a>
<h2>Project Strategies</h2>
<p>Robert Watson</p>
<p>The two primary focuses of this work are to increase parallelism
while decreasing overhead. Several activities are being performed that
will work toward these goals:</p>
<ul>
<li><p>The Netperf project has completed locking work for all components
of the network stack; as of FreeBSD 7.0 we have removed non-MPSAFE
protocol shims, and as of FreeBSD 8.0 we have removed non-MPSAFE
device driver shims.</p></li>
<li><p>Optimize locking strategies to find better balances between
locking granularity and locking overhead. In the first cut at locking
for the kernel, the goal was to adopt a medium-grained locking
approach based on data locking. This approach identifies critical
data structures, and inserts new locks and locking operations to
protect those data structures. Depending on the data model of the
code being protected, this may lead to the introduction of a
substantial number of locks offering unnecessary granularity, where
the overhead of locking overwhelms the benefits of available
parallelism and preemption. By selectively reducing granularity, it
is possible to improve performance by decreasing locking overhead.
</p></li>
<li><p>Amortize the cost of locking by processing queues of packets or
events. While the cost of individual synchronization operations may
be high, it is possible to amortize the cost of synchronization
operations by grouping processing of similar data (packets, events)
under the same protection. This approach focuses on identifying
places where similar locking occurs frequently in succession, and
introducing queueing or coalescing of lock operations across the
body of the work. For example, when a series of packets is inserted
into an outgoing interface queue, a basic locking approach would
lock the queue for each insert operation, unlock it, and hand off to
the interface driver to begin the send, repeating this sequence as
required. With a coalesced approach, the caller would pass off a
queue of packets in order to reduce the locking overhead, as well as
eliminate unnecessary synchronization due to the queue being
thread-local. This approach can be applied at several levels in the
stack, and is particularly applicable at lower levels of the stack
where streams of packets require almost identical processing.
</p></li>
<li><p>Introduce new synchronization strategies with reduced overhead
relative to traditional strategies. Most traditional strategies
employ a combination of interrupt disabling and atomic operations to
achieve mutual exclusion and non-preemption guarantees. However,
these operations are expensive on modern CPUs, leading to the desire
for cheaper primitives with weaker semantics. For example, the
application of uni-processor primitives where synchronization is
required only on a single processor, and optimizations to critical
section primitives to avoid the need for interrupt disabling.
</p></li>
<li><p>Modify synchronization strategies to take advantage of
additional, non-locking, synchronization primitives. This approach
might take the form of making increased use of per-CPU or per-thread
data structures, which require little or no synchronization. For
example, through the use of critical sections, it is possible to
synchronize access to per-CPU caches and queues. Through the use of
per-thread queues, data can be handed off between stack layers
without the use of synchronization.</p></li>
<li><p>Increase the opportunities for parallelism through increased
threading in the network stack. The current network stack model
offers the opportunity for substantial parallelism, with outbound
processing typically taking place in the context of the sending
thread in kernel, crypto occurring in crypto worker threads, and
receive processing taking place in a combination of the receiving
ithread and dispatched netisr thread. While handoffs between
threads introduces overhead (synchronization, context switching),
there is the opportunity to increase parallelism in some workloads
through introducing additional worker threads. Identifying work
that may be relocated to new threads must be done carefully to
balance overhead, and latency concerns, but can pay off by
increasing effective CPU utilization and hence throughput. For
example, introducing additional netisr threads capable of running on
more than one CPU at a time can increase input parallelism, subject
to maintaining desirable packet ordering (present in FreeBSD
8.0).</p></li>
</ul>
<a name="tasks"></a>
<h2>Project Tasks</h2>
<table class="tblbasic">
<tr>
<th> Task </th>
<th> Responsible </th>
<th> Last updated </th>
<th> Status </th>
<th> Notes </th>
</tr>
<tr>
<td> Prefer file descriptor reference counts to socket reference
counts for system calls. </td>
<td> &a.rwatson; </td>
<td> 20041124 </td>
<td> &status.done; </td>
<td> Sockets and file descriptors both have reference counts in order
to prevent these objects from being free'd while in use. However,
if a file descriptor is used to reach the socket, the reference
counts are somewhat interchangeable, as either will prevent
undesired garbage collection. For socket system calls, overhead
can be reduced by relying on the file descriptor reference count,
thus avoiding the synchronized operations necessary to modify the
socket reference count, an approach also taken in the VFS code.
This change has been made for most socket system calls, and has
been committed to HEAD (6.x). It has also been merged to RELENG_5
for inclusion in 5.4.</td>
</tr>
<tr>
<td> Mbuf queue library </td>
<td> &a.rwatson; </td>
<td> 20041124 </td>
<td> &status.prototyped; </td>
<td> In order to facilitate passing off queues of packets between
network stack components, create an mbuf queue primitive, struct
mbufqueue. The initial implementation is complete, and the
primitive is now being applied in several sample cases to determine
whether it offers the desired semantics and benefits. The
implementation can be found in the rwatson_dispatch Perforce
branch. Additional work must also be done to explore the
performance impact of "queues" vs arrays of mbuf pointers, which
are likely to behave better from a caching perspective. </td>
</tr>
<tr>
<td> Employ queued dispatch in interface send API </td>
<td> &a.rwatson; </td>
<td> 20041106 </td>
<td> &status.prototyped; </td>
<td> An experimental if_start_mbufqueue() interface to struct ifnet
has been added, which passes an mbuf queue to the device driver for
processing, avoiding redundant synchronization against the
interface queue, even in the event that additional queueing is
required. This has not yet been benchmarked. A subset change to
dispatch a single mbuf to a driver has also been prototyped, and
benchmarked at a several percentage point improvement in packet send
rates from user space. </td>
</tr>
<tr>
<td> Employ queued dispatch in the interface receive API </td>
<td> &a.rwatson; </td>
<td> 20041106 </td>
<td> &status.new; </td>
<td> Similar to if_start_mbufqueue, allow input of a queue of mbufs
from the device driver into the lowest protocol layers, such as
ether_input_mbufqueue. </td>
</tr>
<tr>
<td> Employ queued dispatch across netisr dispatch API </td>
<td> &a.rwatson; </td>
<td> 20090601 </td>
<td> &status.done </td>
<td> Pull all of the mbufs in the netisr queue into a thread-local
mbuf queue to avoid repeated lock operations to access the queue.
This work was completed as part of the netisr2 project, and will
ship with 8.0-RELEASE. </td>
</tr>
<tr>
<td> Modify UMA allocator to use critical sections not mutexes for
per-CPU caches. </td>
<td> &a.rwatson; </td>
<td> 20050429 </td>
<td> &status.done; </td>
<td> The mutexes protecting per-CPU caches require atomic operations
on SMP systems; as they are per-CPU objects, the cost of
synchronizing access to the caches can be reduced by combining
CPU pinning and/or critical sections instead. This change has now
been committed and will appear in 6.0-RELEASE; it results in a
several percentage performance in UDP send from user space, and
there have been reports of 20%+ improvements in allocation
intensive code within the kernel. In micro-benchmarks, the cost
of allocation on SMP is dramatically reduced. </td>
</tr>
<tr>
<td> Modify malloc(9) allocator to use per-CPU statistics with
critical sections to protect malloc_type statistics rather than
global statistics with a mutex. </td>
<td> &a.rwatson; </td>
<td> 20050529 </td>
<td> &status.done; </td>
<td> Previously, malloc(9) used a single statistics structure
protected by a mutex to hold global malloc statistics for each
malloc type. This change moves to per-CPU statistics structures,
which are coalesced when reporting memory allocation statistics to
the user, and protects them using critical sections. This reduces
cache line contention for common allocation types by avoiding
shared lines, and also reduces synchronization costs by using
critical sections to synchronize access instead of a mutex. While
malloc(9) is less frequently used in the network stack than uma(9),
it is used for socket address data, so is on performance critical
paths for datagram operations. This has been committed and appeared
6.0-RELEASE. </td>
</tr>
<tr>
<td> Optimize critical section performance </td>
<td> &a.jhb; </td>
<td> 20050404 </td>
<td> &status.done; </td>
<td> Critical sections prevent preemption of a thread on a CPU, as
well as preventing migration of that thread to another CPU, and
maybe used for synchronizing access to per-CPU data structures, as
well as preventing recursion in interrupt processing. Currently,
critical sections disable interrupts on the CPU. In previous
versions of FreeBSD (4.x and before), optimizations were present
that allowed for software interrupt disabling, which lowers the
cost of critical sections in the common case by avoiding expensive
microcode operations on the CPU. By restoring this model, or a
variation on it, critical sections can be made substantially
cheaper to enter. In particular, this change lowers the cost
of critical sections on UP such that it is approximately the same
cost as a mutex, meaning that optimizations on SMP to use critical
sections instead of mutexes will not harm UP performance. This
change has now been committed, and appeared in 6.0-RELEASE. </td>
</tr>
<tr>
<td> Normalize socket and protocol control block reference model </td>
<td> &a.rwatson; </td>
<td> 20060401 </td>
<td> &status.done; </td>
<td> The socket/protocol boundary is characterized by a set of data
structures and API interfaces, where the socket code acts as both
a consumer and a service library for protocols. This task is to
normalize the reference model by which protocol state is attached
to and detached from socket state in order to strengthen
invariants, allowing the removal of countless unused code paths
(especially error handling), the removal of unnecessary locking
in TCP, and a general improve the structure of the code. This
serves both the immediate purpose of improving the quality and
performance of this code, as well as being necessary for future
optimization work. These changes have been prototyped in
Perforce, and now merged to 7-CURRENT. They will be merged into
RELENG_6 once they have been thoroughly tested.</td>
</tr>
<tr>
<td> Add true inpcb reference count support </td>
<td> &a.mohans;, &a.rwatson;, &a.peter; </td>
<td> 20081208 </td>
<td> &status.done; </td>
<td> Historically, the in-bound TCP and UDP socket paths relied on
global pcbinfo info locks to prevent PCBs being delivered to from
being garbage collected by another thread while in use. This set
of changes introduces a true reference model for PCBs so that the
global lock can be released during in-bound process, and appear
in 8.0-RELEASE./td>
</tr>
<tr>
<td> Fine-grained locking for UNIX domain sockets </td>
<td> &a.rwatson; </td>
<td> 20070226 </td>
<td> &status.done; </td>
<td> UNIX domain sockets in FreeBSD 5.x and 6.x use a single global
subsystem lock. This is sufficient to allow it to run without
Giant, but results in contention with large numbers of processors
simultaneously operating on UNIX domain sockets. This task
introduced per-protocol control block locks in order to reduce
contention on a larger subsystem lock, and the results appeared in
7.0-RELEASE. </td>
</tr>
<tr>
<td> Multiple netisr threads </td>
<td> &a.rwatson; </td>
<td> 20090601 </td>
<td> &status.done; </td>
<td> Historically, the BSD network stack has used a single network
software interrupt context, for deferred network processing. With
the introduction of multi-processing, this became a single
software interrupt thread. In FreeBSD 8.0, multiple netisr
threads are now supported, up to the number of CPUs present in the
system.</td>
</tr>
</table>
<a name="cluster"></a>
<h2>Netperf Cluster</h2>
<p>Through the generous donations and investment of Sentex Data
Communications, FreeBSD Systems, IronPort Systems, and the FreeBSD
Foundation, a network performance testbed has been created in Ontario,
Canada for use by FreeBSD developers working in the area of network
performance. A similar cluster, made possible through the generous
donation of Verio, is being prepared for use in more general SMP
performance work in Virginia, US. Each cluster consists of several SMP
systems inter-connected with giga-bit ethernet such that relatively
arbitrary topologies can be constructed in order to test host-host, IP
forwarding, and bridging performance scenarios. Systems are network
booted, have serial console, and remote power, in order to maximize
availability and minimize configuration overhead. These systems are
available on a check-out basis for experimentation and performance
measurement to FreeBSD developers working on the Netperf project, and
in related areas.</p>
<p><a href="cluster.html">More detailed information on the netperf
cluster can be found by following this link.</a></p>
<a name="papers"></a>
<h2>Papers and Reports</h2>
<p>The following paper(s) have been produced by or are related to the
Netperf Project:</p>
<ul>
<li><p><a href="http://www.watson.org/~robert/freebsd/netperf/20051027-eurobsdcon2005-netperf.pdf">"Introduction to Multithreading and Multiprocessing in the FreeBSD SMPng Network Stack", EuroBSDCon 2005, Basel, Switzerland</a>.</p></li>
</ul>
<p>Additional papers can be found on the <a href="../../smp/">SMPng
Project</a> web page.</p>
<a name="links"></a>
<h2>Links</h2>
<p>Some useful links relating to the netperf work:</p>
<ul>
<li><p><a href="../../smp/">SMPng Project</a> -- Project to introduce
finer grained locking in the FreeBSD kernel.</p></li>
<li><p><a href="http://www.watson.org/~robert/freebsd/netperf/">Robert
Watson's netperf web page</a> -- Web page that includes a change log
and performance measurement/debugging information.</p></li>
</ul>
&footer;
</body>
</html>
|