aboutsummaryrefslogtreecommitdiff
path: root/share/doc
diff options
context:
space:
mode:
Diffstat (limited to 'share/doc')
-rw-r--r--share/doc/IPv6/IMPLEMENTATION2392
-rw-r--r--share/doc/IPv6/Makefile7
-rw-r--r--share/doc/Makefile25
-rw-r--r--share/doc/bind9/Makefile31
-rw-r--r--share/doc/legal/Makefile8
-rw-r--r--share/doc/legal/intel_ipw/Makefile7
-rw-r--r--share/doc/legal/intel_iwi/Makefile7
-rw-r--r--share/doc/legal/intel_iwn/Makefile7
-rw-r--r--share/doc/legal/intel_wpi/Makefile8
-rw-r--r--share/doc/llvm/Makefile15
-rw-r--r--share/doc/llvm/clang/Makefile13
-rw-r--r--share/doc/papers/Makefile19
-rw-r--r--share/doc/papers/beyond4.3/Makefile9
-rw-r--r--share/doc/papers/beyond4.3/beyond43.ms519
-rw-r--r--share/doc/papers/bufbio/Makefile14
-rw-r--r--share/doc/papers/bufbio/bio.ms830
-rw-r--r--share/doc/papers/bufbio/bufsize.eps479
-rw-r--r--share/doc/papers/contents/Makefile8
-rw-r--r--share/doc/papers/contents/contents.ms218
-rw-r--r--share/doc/papers/devfs/Makefile9
-rw-r--r--share/doc/papers/devfs/paper.me1277
-rw-r--r--share/doc/papers/diskperf/Makefile11
-rw-r--r--share/doc/papers/diskperf/abs.ms176
-rw-r--r--share/doc/papers/diskperf/appendix.ms102
-rw-r--r--share/doc/papers/diskperf/conclusions.ms128
-rw-r--r--share/doc/papers/diskperf/equip.ms177
-rw-r--r--share/doc/papers/diskperf/methodology.ms111
-rw-r--r--share/doc/papers/diskperf/motivation.ms95
-rw-r--r--share/doc/papers/diskperf/results.ms337
-rw-r--r--share/doc/papers/diskperf/tests.ms109
-rw-r--r--share/doc/papers/fsinterface/Makefile9
-rw-r--r--share/doc/papers/fsinterface/abstract.ms73
-rw-r--r--share/doc/papers/fsinterface/fsinterface.ms1176
-rw-r--r--share/doc/papers/fsinterface/slides.t318
-rw-r--r--share/doc/papers/hwpmc/Makefile8
-rw-r--r--share/doc/papers/hwpmc/hwpmc.ms34
-rw-r--r--share/doc/papers/jail/Makefile14
-rw-r--r--share/doc/papers/jail/future.ms104
-rw-r--r--share/doc/papers/jail/implementation.ms126
-rw-r--r--share/doc/papers/jail/jail01.eps234
-rw-r--r--share/doc/papers/jail/jail01.fig86
-rw-r--r--share/doc/papers/jail/mgt.ms216
-rw-r--r--share/doc/papers/jail/paper.ms438
-rw-r--r--share/doc/papers/kernmalloc/Makefile14
-rw-r--r--share/doc/papers/kernmalloc/alloc.fig115
-rw-r--r--share/doc/papers/kernmalloc/appendix.ms275
-rw-r--r--share/doc/papers/kernmalloc/appendix.t137
-rw-r--r--share/doc/papers/kernmalloc/kernmalloc.t653
-rw-r--r--share/doc/papers/kernmalloc/spell.ok57
-rw-r--r--share/doc/papers/kernmalloc/usage.tbl75
-rw-r--r--share/doc/papers/kerntune/0.t129
-rw-r--r--share/doc/papers/kerntune/1.t49
-rw-r--r--share/doc/papers/kerntune/2.t234
-rw-r--r--share/doc/papers/kerntune/3.t290
-rw-r--r--share/doc/papers/kerntune/4.t99
-rw-r--r--share/doc/papers/kerntune/Makefile14
-rw-r--r--share/doc/papers/kerntune/fig2.pic57
-rw-r--r--share/doc/papers/malloc/Makefile10
-rw-r--r--share/doc/papers/malloc/abs.ms35
-rw-r--r--share/doc/papers/malloc/alternatives.ms45
-rw-r--r--share/doc/papers/malloc/conclusion.ms48
-rw-r--r--share/doc/papers/malloc/implementation.ms225
-rw-r--r--share/doc/papers/malloc/intro.ms74
-rw-r--r--share/doc/papers/malloc/kernel.ms56
-rw-r--r--share/doc/papers/malloc/malloc.ms72
-rw-r--r--share/doc/papers/malloc/performance.ms113
-rw-r--r--share/doc/papers/malloc/problems.ms54
-rw-r--r--share/doc/papers/newvm/0.t86
-rw-r--r--share/doc/papers/newvm/1.t378
-rw-r--r--share/doc/papers/newvm/Makefile9
-rw-r--r--share/doc/papers/newvm/a.t240
-rw-r--r--share/doc/papers/newvm/spell.ok56
-rw-r--r--share/doc/papers/relengr/0.t92
-rw-r--r--share/doc/papers/relengr/1.t69
-rw-r--r--share/doc/papers/relengr/2.t146
-rw-r--r--share/doc/papers/relengr/3.t390
-rw-r--r--share/doc/papers/relengr/Makefile15
-rw-r--r--share/doc/papers/relengr/ref.bib26
-rw-r--r--share/doc/papers/relengr/spell.ok15
-rw-r--r--share/doc/papers/sysperf/0.t247
-rw-r--r--share/doc/papers/sysperf/1.t81
-rw-r--r--share/doc/papers/sysperf/2.t258
-rw-r--r--share/doc/papers/sysperf/3.t694
-rw-r--r--share/doc/papers/sysperf/4.t776
-rw-r--r--share/doc/papers/sysperf/5.t287
-rw-r--r--share/doc/papers/sysperf/6.t70
-rw-r--r--share/doc/papers/sysperf/7.t164
-rw-r--r--share/doc/papers/sysperf/Makefile12
-rw-r--r--share/doc/papers/sysperf/a1.t668
-rw-r--r--share/doc/papers/sysperf/a2.t117
-rw-r--r--share/doc/papers/sysperf/appendix.ms1040
-rw-r--r--share/doc/papers/timecounter/Makefile20
-rw-r--r--share/doc/papers/timecounter/fig1.eps227
-rw-r--r--share/doc/papers/timecounter/fig2.eps150
-rw-r--r--share/doc/papers/timecounter/fig3.eps126
-rw-r--r--share/doc/papers/timecounter/fig4.eps259
-rw-r--r--share/doc/papers/timecounter/fig5.eps211
-rw-r--r--share/doc/papers/timecounter/gps.ps1488
-rw-r--r--share/doc/papers/timecounter/intr.ps1501
-rw-r--r--share/doc/papers/timecounter/timecounter.ms1076
-rw-r--r--share/doc/papers/timecounter/tmac.usenix953
-rw-r--r--share/doc/psd/01.cacm/Makefile16
-rw-r--r--share/doc/psd/01.cacm/p.mac31
-rw-r--r--share/doc/psd/01.cacm/p1567
-rw-r--r--share/doc/psd/01.cacm/p2448
-rw-r--r--share/doc/psd/01.cacm/p3190
-rw-r--r--share/doc/psd/01.cacm/p4524
-rw-r--r--share/doc/psd/01.cacm/p5235
-rw-r--r--share/doc/psd/01.cacm/p672
-rw-r--r--share/doc/psd/01.cacm/ref.bib113
-rw-r--r--share/doc/psd/02.implement/Makefile17
-rw-r--r--share/doc/psd/02.implement/fig1.pic100
-rw-r--r--share/doc/psd/02.implement/fig2.pic110
-rw-r--r--share/doc/psd/02.implement/implement1282
-rw-r--r--share/doc/psd/02.implement/ref.bib54
-rw-r--r--share/doc/psd/03.iosys/Makefile8
-rw-r--r--share/doc/psd/03.iosys/iosys1086
-rw-r--r--share/doc/psd/04.uprog/Makefile8
-rw-r--r--share/doc/psd/04.uprog/p.mac71
-rw-r--r--share/doc/psd/04.uprog/p082
-rw-r--r--share/doc/psd/04.uprog/p188
-rw-r--r--share/doc/psd/04.uprog/p2275
-rw-r--r--share/doc/psd/04.uprog/p3469
-rw-r--r--share/doc/psd/04.uprog/p4600
-rw-r--r--share/doc/psd/04.uprog/p5577
-rw-r--r--share/doc/psd/04.uprog/p6361
-rw-r--r--share/doc/psd/04.uprog/p862
-rw-r--r--share/doc/psd/04.uprog/p9680
-rw-r--r--share/doc/psd/05.sysman/0.t292
-rw-r--r--share/doc/psd/05.sysman/1.0.t56
-rw-r--r--share/doc/psd/05.sysman/1.1.t216
-rw-r--r--share/doc/psd/05.sysman/1.2.t273
-rw-r--r--share/doc/psd/05.sysman/1.3.t254
-rw-r--r--share/doc/psd/05.sysman/1.4.t137
-rw-r--r--share/doc/psd/05.sysman/1.5.t225
-rw-r--r--share/doc/psd/05.sysman/1.6.t135
-rw-r--r--share/doc/psd/05.sysman/1.7.t100
-rw-r--r--share/doc/psd/05.sysman/2.0.t83
-rw-r--r--share/doc/psd/05.sysman/2.1.t138
-rw-r--r--share/doc/psd/05.sysman/2.2.t470
-rw-r--r--share/doc/psd/05.sysman/2.3.t413
-rw-r--r--share/doc/psd/05.sysman/2.4.t174
-rw-r--r--share/doc/psd/05.sysman/2.5.t39
-rw-r--r--share/doc/psd/05.sysman/Makefile10
-rw-r--r--share/doc/psd/05.sysman/a.t235
-rw-r--r--share/doc/psd/05.sysman/spell.ok332
-rw-r--r--share/doc/psd/06.Clang/Clang.ms4575
-rw-r--r--share/doc/psd/06.Clang/Makefile9
-rw-r--r--share/doc/psd/12.make/Makefile8
-rw-r--r--share/doc/psd/12.make/stubs9
-rw-r--r--share/doc/psd/12.make/tutorial.ms3747
-rw-r--r--share/doc/psd/13.rcs/Makefile6
-rw-r--r--share/doc/psd/13.rcs/Makefile.inc5
-rw-r--r--share/doc/psd/13.rcs/rcs/Makefile7
-rw-r--r--share/doc/psd/13.rcs/rcs_func/Makefile6
-rw-r--r--share/doc/psd/15.yacc/Makefile16
-rw-r--r--share/doc/psd/15.yacc/ref.bib71
-rw-r--r--share/doc/psd/15.yacc/ss0238
-rw-r--r--share/doc/psd/15.yacc/ss1175
-rw-r--r--share/doc/psd/15.yacc/ss10221
-rw-r--r--share/doc/psd/15.yacc/ss1163
-rw-r--r--share/doc/psd/15.yacc/ss2190
-rw-r--r--share/doc/psd/15.yacc/ss3141
-rw-r--r--share/doc/psd/15.yacc/ss4367
-rw-r--r--share/doc/psd/15.yacc/ss5339
-rw-r--r--share/doc/psd/15.yacc/ss6183
-rw-r--r--share/doc/psd/15.yacc/ss7161
-rw-r--r--share/doc/psd/15.yacc/ss8130
-rw-r--r--share/doc/psd/15.yacc/ss9206
-rw-r--r--share/doc/psd/15.yacc/ss_94
-rw-r--r--share/doc/psd/15.yacc/ssa150
-rw-r--r--share/doc/psd/15.yacc/ssb147
-rw-r--r--share/doc/psd/15.yacc/ssc347
-rw-r--r--share/doc/psd/15.yacc/ssd76
-rw-r--r--share/doc/psd/16.lex/Makefile9
-rw-r--r--share/doc/psd/16.lex/lex.ms2345
-rw-r--r--share/doc/psd/17.m4/Makefile8
-rw-r--r--share/doc/psd/17.m4/m4.ms973
-rw-r--r--share/doc/psd/18.gprof/Makefile14
-rw-r--r--share/doc/psd/18.gprof/abstract.me66
-rw-r--r--share/doc/psd/18.gprof/gathering.me231
-rw-r--r--share/doc/psd/18.gprof/header.me38
-rw-r--r--share/doc/psd/18.gprof/intro.me81
-rw-r--r--share/doc/psd/18.gprof/postp.me190
-rw-r--r--share/doc/psd/18.gprof/postp1.pic54
-rw-r--r--share/doc/psd/18.gprof/postp2.pic56
-rw-r--r--share/doc/psd/18.gprof/postp3.pic51
-rw-r--r--share/doc/psd/18.gprof/pres1.pic56
-rw-r--r--share/doc/psd/18.gprof/pres2.pic52
-rw-r--r--share/doc/psd/18.gprof/present.me306
-rw-r--r--share/doc/psd/18.gprof/profiling.me115
-rw-r--r--share/doc/psd/18.gprof/refs.me63
-rw-r--r--share/doc/psd/20.ipctut/Makefile14
-rw-r--r--share/doc/psd/20.ipctut/dgramread.c83
-rw-r--r--share/doc/psd/20.ipctut/dgramsend.c80
-rw-r--r--share/doc/psd/20.ipctut/fig2.pic77
-rw-r--r--share/doc/psd/20.ipctut/fig2.xfig100
-rw-r--r--share/doc/psd/20.ipctut/fig3.pic69
-rw-r--r--share/doc/psd/20.ipctut/fig3.xfig100
-rw-r--r--share/doc/psd/20.ipctut/fig8.pic79
-rw-r--r--share/doc/psd/20.ipctut/fig8.xfig116
-rw-r--r--share/doc/psd/20.ipctut/pipe.c74
-rw-r--r--share/doc/psd/20.ipctut/socketpair.c77
-rw-r--r--share/doc/psd/20.ipctut/strchkread.c106
-rw-r--r--share/doc/psd/20.ipctut/streamread.c102
-rw-r--r--share/doc/psd/20.ipctut/streamwrite.c81
-rw-r--r--share/doc/psd/20.ipctut/tutor.me939
-rw-r--r--share/doc/psd/20.ipctut/udgramread.c80
-rw-r--r--share/doc/psd/20.ipctut/udgramsend.c68
-rw-r--r--share/doc/psd/20.ipctut/ustreamread.c96
-rw-r--r--share/doc/psd/20.ipctut/ustreamwrite.c71
-rw-r--r--share/doc/psd/21.ipc/0.t93
-rw-r--r--share/doc/psd/21.ipc/1.t106
-rw-r--r--share/doc/psd/21.ipc/2.t714
-rw-r--r--share/doc/psd/21.ipc/3.t411
-rw-r--r--share/doc/psd/21.ipc/4.t515
-rw-r--r--share/doc/psd/21.ipc/5.t1668
-rw-r--r--share/doc/psd/21.ipc/Makefile9
-rw-r--r--share/doc/psd/21.ipc/spell.ok347
-rw-r--r--share/doc/psd/22.rpcgen/Makefile8
-rw-r--r--share/doc/psd/22.rpcgen/rpcgen.ms1301
-rw-r--r--share/doc/psd/22.rpcgen/stubs3
-rw-r--r--share/doc/psd/23.rpc/Makefile9
-rw-r--r--share/doc/psd/23.rpc/rpc.prog.ms2686
-rw-r--r--share/doc/psd/23.rpc/stubs3
-rw-r--r--share/doc/psd/24.xdr/Makefile8
-rw-r--r--share/doc/psd/24.xdr/stubs3
-rw-r--r--share/doc/psd/24.xdr/xdr.nts.ms1968
-rw-r--r--share/doc/psd/25.xdrrfc/Makefile8
-rw-r--r--share/doc/psd/25.xdrrfc/stubs3
-rw-r--r--share/doc/psd/25.xdrrfc/xdr.rfc.ms1060
-rw-r--r--share/doc/psd/26.rpcrfc/Makefile8
-rw-r--r--share/doc/psd/26.rpcrfc/rpc.rfc.ms1304
-rw-r--r--share/doc/psd/26.rpcrfc/stubs3
-rw-r--r--share/doc/psd/27.nfsrpc/Makefile8
-rw-r--r--share/doc/psd/27.nfsrpc/nfs.rfc.ms1374
-rw-r--r--share/doc/psd/27.nfsrpc/stubs3
-rw-r--r--share/doc/psd/28.cvs/Makefile10
-rw-r--r--share/doc/psd/Makefile41
-rw-r--r--share/doc/psd/contents/Makefile8
-rw-r--r--share/doc/psd/contents/contents.ms289
-rw-r--r--share/doc/psd/title/Makefile7
-rw-r--r--share/doc/psd/title/Title132
-rw-r--r--share/doc/smm/01.setup/0.t133
-rw-r--r--share/doc/smm/01.setup/1.t172
-rw-r--r--share/doc/smm/01.setup/2.t1649
-rw-r--r--share/doc/smm/01.setup/3.t1996
-rw-r--r--share/doc/smm/01.setup/4.t684
-rw-r--r--share/doc/smm/01.setup/5.t558
-rw-r--r--share/doc/smm/01.setup/6.t663
-rw-r--r--share/doc/smm/01.setup/Makefile9
-rw-r--r--share/doc/smm/01.setup/spell.ok618
-rw-r--r--share/doc/smm/01.setup/stubs6
-rw-r--r--share/doc/smm/02.config/0.t88
-rw-r--r--share/doc/smm/02.config/1.t61
-rw-r--r--share/doc/smm/02.config/2.t188
-rw-r--r--share/doc/smm/02.config/3.t299
-rw-r--r--share/doc/smm/02.config/4.t442
-rw-r--r--share/doc/smm/02.config/5.t271
-rw-r--r--share/doc/smm/02.config/6.t233
-rw-r--r--share/doc/smm/02.config/Makefile9
-rw-r--r--share/doc/smm/02.config/a.t162
-rw-r--r--share/doc/smm/02.config/b.t137
-rw-r--r--share/doc/smm/02.config/c.t109
-rw-r--r--share/doc/smm/02.config/d.t272
-rw-r--r--share/doc/smm/02.config/e.t114
-rw-r--r--share/doc/smm/02.config/spell.ok305
-rw-r--r--share/doc/smm/03.fsck/0.t147
-rw-r--r--share/doc/smm/03.fsck/1.t80
-rw-r--r--share/doc/smm/03.fsck/2.t262
-rw-r--r--share/doc/smm/03.fsck/3.t449
-rw-r--r--share/doc/smm/03.fsck/4.t1421
-rw-r--r--share/doc/smm/03.fsck/Makefile8
-rw-r--r--share/doc/smm/04.quotas/Makefile8
-rw-r--r--share/doc/smm/04.quotas/quotas.ms318
-rw-r--r--share/doc/smm/05.fastfs/0.t159
-rw-r--r--share/doc/smm/05.fastfs/1.t112
-rw-r--r--share/doc/smm/05.fastfs/2.t143
-rw-r--r--share/doc/smm/05.fastfs/3.t598
-rw-r--r--share/doc/smm/05.fastfs/4.t252
-rw-r--r--share/doc/smm/05.fastfs/5.t293
-rw-r--r--share/doc/smm/05.fastfs/6.t159
-rw-r--r--share/doc/smm/05.fastfs/Makefile10
-rw-r--r--share/doc/smm/06.nfs/0.t75
-rw-r--r--share/doc/smm/06.nfs/1.t555
-rw-r--r--share/doc/smm/06.nfs/2.t532
-rw-r--r--share/doc/smm/06.nfs/Makefile8
-rw-r--r--share/doc/smm/06.nfs/ref.t123
-rw-r--r--share/doc/smm/07.lpd/0.t68
-rw-r--r--share/doc/smm/07.lpd/1.t77
-rw-r--r--share/doc/smm/07.lpd/2.t141
-rw-r--r--share/doc/smm/07.lpd/3.t73
-rw-r--r--share/doc/smm/07.lpd/4.t206
-rw-r--r--share/doc/smm/07.lpd/5.t116
-rw-r--r--share/doc/smm/07.lpd/6.t94
-rw-r--r--share/doc/smm/07.lpd/7.t226
-rw-r--r--share/doc/smm/07.lpd/Makefile9
-rw-r--r--share/doc/smm/07.lpd/spell.ok70
-rw-r--r--share/doc/smm/08.sendmailop/Makefile11
-rw-r--r--share/doc/smm/11.timedop/Makefile8
-rw-r--r--share/doc/smm/11.timedop/timed.ms279
-rw-r--r--share/doc/smm/12.timed/Makefile11
-rw-r--r--share/doc/smm/12.timed/date53
-rw-r--r--share/doc/smm/12.timed/loop54
-rw-r--r--share/doc/smm/12.timed/spell.ok34
-rw-r--r--share/doc/smm/12.timed/time53
-rw-r--r--share/doc/smm/12.timed/timed.ms462
-rw-r--r--share/doc/smm/12.timed/unused53
-rw-r--r--share/doc/smm/18.net/0.t184
-rw-r--r--share/doc/smm/18.net/1.t66
-rw-r--r--share/doc/smm/18.net/2.t85
-rw-r--r--share/doc/smm/18.net/3.t59
-rw-r--r--share/doc/smm/18.net/4.t67
-rw-r--r--share/doc/smm/18.net/5.t184
-rw-r--r--share/doc/smm/18.net/6.t664
-rw-r--r--share/doc/smm/18.net/7.t258
-rw-r--r--share/doc/smm/18.net/8.t166
-rw-r--r--share/doc/smm/18.net/9.t124
-rw-r--r--share/doc/smm/18.net/Makefile8
-rw-r--r--share/doc/smm/18.net/a.t219
-rw-r--r--share/doc/smm/18.net/b.t145
-rw-r--r--share/doc/smm/18.net/c.t151
-rw-r--r--share/doc/smm/18.net/d.t73
-rw-r--r--share/doc/smm/18.net/e.t129
-rw-r--r--share/doc/smm/18.net/f.t117
-rw-r--r--share/doc/smm/18.net/spell.ok307
-rw-r--r--share/doc/smm/Makefile31
-rw-r--r--share/doc/smm/contents/Makefile8
-rw-r--r--share/doc/smm/contents/contents.ms195
-rw-r--r--share/doc/smm/title/Makefile7
-rw-r--r--share/doc/smm/title/Title146
-rw-r--r--share/doc/usd/04.csh/Makefile9
-rw-r--r--share/doc/usd/04.csh/csh.11012
-rw-r--r--share/doc/usd/04.csh/csh.21304
-rw-r--r--share/doc/usd/04.csh/csh.3649
-rw-r--r--share/doc/usd/04.csh/csh.4176
-rw-r--r--share/doc/usd/04.csh/csh.a93
-rw-r--r--share/doc/usd/04.csh/csh.g1719
-rw-r--r--share/doc/usd/04.csh/tabs32
-rw-r--r--share/doc/usd/05.dc/Makefile8
-rw-r--r--share/doc/usd/05.dc/dc753
-rw-r--r--share/doc/usd/06.bc/Makefile8
-rw-r--r--share/doc/usd/06.bc/bc1241
-rw-r--r--share/doc/usd/07.mail/Makefile10
-rw-r--r--share/doc/usd/07.mail/mail0.nr72
-rw-r--r--share/doc/usd/07.mail/mail1.nr92
-rw-r--r--share/doc/usd/07.mail/mail2.nr617
-rw-r--r--share/doc/usd/07.mail/mail3.nr133
-rw-r--r--share/doc/usd/07.mail/mail4.nr437
-rw-r--r--share/doc/usd/07.mail/mail5.nr1042
-rw-r--r--share/doc/usd/07.mail/mail6.nr125
-rw-r--r--share/doc/usd/07.mail/mail7.nr107
-rw-r--r--share/doc/usd/07.mail/mail8.nr75
-rw-r--r--share/doc/usd/07.mail/mail9.nr203
-rw-r--r--share/doc/usd/07.mail/maila.nr33
-rw-r--r--share/doc/usd/10.exref/Makefile6
-rw-r--r--share/doc/usd/10.exref/Makefile.inc5
-rw-r--r--share/doc/usd/10.exref/exref/Makefile5
-rw-r--r--share/doc/usd/10.exref/summary/Makefile7
-rw-r--r--share/doc/usd/11.vitut/Makefile18
-rw-r--r--share/doc/usd/12.vi/Makefile6
-rw-r--r--share/doc/usd/12.vi/Makefile.inc5
-rw-r--r--share/doc/usd/12.vi/summary/Makefile7
-rw-r--r--share/doc/usd/12.vi/vi/Makefile6
-rw-r--r--share/doc/usd/12.vi/viapwh/Makefile6
-rw-r--r--share/doc/usd/13.viref/Makefile35
-rw-r--r--share/doc/usd/18.msdiffs/Makefile8
-rw-r--r--share/doc/usd/18.msdiffs/ms.diffs288
-rw-r--r--share/doc/usd/19.memacros/Makefile18
-rw-r--r--share/doc/usd/20.meref/Makefile18
-rw-r--r--share/doc/usd/21.troff/Makefile8
-rw-r--r--share/doc/usd/21.troff/m.mac288
-rw-r--r--share/doc/usd/21.troff/m0290
-rw-r--r--share/doc/usd/21.troff/m0a607
-rw-r--r--share/doc/usd/21.troff/m1746
-rw-r--r--share/doc/usd/21.troff/m2400
-rw-r--r--share/doc/usd/21.troff/m3521
-rw-r--r--share/doc/usd/21.troff/m4416
-rw-r--r--share/doc/usd/21.troff/m5462
-rw-r--r--share/doc/usd/21.troff/table1129
-rw-r--r--share/doc/usd/21.troff/table2253
-rw-r--r--share/doc/usd/22.trofftut/Makefile45
-rw-r--r--share/doc/usd/22.trofftut/tt.mac111
-rw-r--r--share/doc/usd/22.trofftut/tt00122
-rw-r--r--share/doc/usd/22.trofftut/tt01223
-rw-r--r--share/doc/usd/22.trofftut/tt02244
-rw-r--r--share/doc/usd/22.trofftut/tt03240
-rw-r--r--share/doc/usd/22.trofftut/tt04189
-rw-r--r--share/doc/usd/22.trofftut/tt05130
-rw-r--r--share/doc/usd/22.trofftut/tt06351
-rw-r--r--share/doc/usd/22.trofftut/tt07124
-rw-r--r--share/doc/usd/22.trofftut/tt08199
-rw-r--r--share/doc/usd/22.trofftut/tt09322
-rw-r--r--share/doc/usd/22.trofftut/tt10256
-rw-r--r--share/doc/usd/22.trofftut/tt11233
-rw-r--r--share/doc/usd/22.trofftut/tt12164
-rw-r--r--share/doc/usd/22.trofftut/tt1399
-rw-r--r--share/doc/usd/22.trofftut/tt14155
-rw-r--r--share/doc/usd/22.trofftut/ttack100
-rw-r--r--share/doc/usd/22.trofftut/ttcharset135
-rw-r--r--share/doc/usd/22.trofftut/ttindex200
-rw-r--r--share/doc/usd/Makefile23
-rw-r--r--share/doc/usd/contents/Makefile8
-rw-r--r--share/doc/usd/contents/contents.ms312
-rw-r--r--share/doc/usd/title/Makefile7
-rw-r--r--share/doc/usd/title/Title121
406 files changed, 108488 insertions, 0 deletions
diff --git a/share/doc/IPv6/IMPLEMENTATION b/share/doc/IPv6/IMPLEMENTATION
new file mode 100644
index 000000000000..95cff2c000ce
--- /dev/null
+++ b/share/doc/IPv6/IMPLEMENTATION
@@ -0,0 +1,2392 @@
+ Implementation Note
+
+ KAME Project
+ http://www.kame.net/
+ $KAME: IMPLEMENTATION,v 1.216 2001/05/25 07:43:01 jinmei Exp $
+ $FreeBSD$
+
+NOTE: The document tries to describe behaviors/implementation choices
+of the latest KAME/*BSD stack. The description here may not be
+applicable to KAME-integrated *BSD releases, as we have certain amount
+of changes between them. Still, some of the content can be useful for
+KAME-integrated *BSD releases.
+
+Table of Contents
+
+ 1. IPv6
+ 1.1 Conformance
+ 1.2 Neighbor Discovery
+ 1.3 Scope Zone Index
+ 1.3.1 Kernel internal
+ 1.3.2 Interaction with API
+ 1.3.3 Interaction with users (command line)
+ 1.4 Plug and Play
+ 1.4.1 Assignment of link-local, and special addresses
+ 1.4.2 Stateless address autoconfiguration on hosts
+ 1.4.3 DHCPv6
+ 1.5 Generic tunnel interface
+ 1.6 Address Selection
+ 1.6.1 Source Address Selection
+ 1.6.2 Destination Address Ordering
+ 1.7 Jumbo Payload
+ 1.8 Loop prevention in header processing
+ 1.9 ICMPv6
+ 1.10 Applications
+ 1.11 Kernel Internals
+ 1.12 IPv4 mapped address and IPv6 wildcard socket
+ 1.12.1 KAME/BSDI3 and KAME/FreeBSD228
+ 1.12.2 KAME/FreeBSD[34]x
+ 1.12.2.1 KAME/FreeBSD[34]x, listening side
+ 1.12.2.2 KAME/FreeBSD[34]x, initiating side
+ 1.12.3 KAME/NetBSD
+ 1.12.3.1 KAME/NetBSD, listening side
+ 1.12.3.2 KAME/NetBSD, initiating side
+ 1.12.4 KAME/BSDI4
+ 1.12.4.1 KAME/BSDI4, listening side
+ 1.12.4.2 KAME/BSDI4, initiating side
+ 1.12.5 KAME/OpenBSD
+ 1.12.5.1 KAME/OpenBSD, listening side
+ 1.12.5.2 KAME/OpenBSD, initiating side
+ 1.12.6 More issues
+ 1.12.7 Interaction with SIIT translator
+ 1.13 sockaddr_storage
+ 1.14 Invalid addresses on the wire
+ 1.15 Node's required addresses
+ 1.15.1 Host case
+ 1.15.2 Router case
+ 1.16 Advanced API
+ 1.17 DNS resolver
+ 2. Network Drivers
+ 2.1 FreeBSD 2.2.x-RELEASE
+ 2.2 BSD/OS 3.x
+ 2.3 NetBSD
+ 2.4 FreeBSD 3.x-RELEASE
+ 2.5 FreeBSD 4.x-RELEASE
+ 2.6 OpenBSD 2.x
+ 2.7 BSD/OS 4.x
+ 3. Translator
+ 3.1 FAITH TCP relay translator
+ 3.2 IPv6-to-IPv4 header translator
+ 4. IPsec
+ 4.1 Policy Management
+ 4.2 Key Management
+ 4.3 AH and ESP handling
+ 4.4 IPComp handling
+ 4.5 Conformance to RFCs and IDs
+ 4.6 ECN consideration on IPsec tunnels
+ 4.7 Interoperability
+ 4.8 Operations with IPsec tunnel mode
+ 4.8.1 RFC2401 IPsec tunnel mode approach
+ 4.8.2 draft-touch-ipsec-vpn approach
+ 5. ALTQ
+ 6. Mobile IPv6
+ 6.1 KAME node as correspondent node
+ 6.2 KAME node as home agent/mobile node
+ 6.3 Old Mobile IPv6 code
+ 7. Coding style
+ 8. Policy on technology with intellectual property right restriction
+
+1. IPv6
+
+1.1 Conformance
+
+The KAME kit conforms, or tries to conform, to the latest set of IPv6
+specifications. For future reference we list some of the relevant documents
+below (NOTE: this is not a complete list - this is too hard to maintain...).
+For details please refer to specific chapter in the document, RFCs, manpages
+come with KAME, or comments in the source code.
+
+Conformance tests have been performed on past and latest KAME STABLE kit,
+at TAHI project. Results can be viewed at http://www.tahi.org/report/KAME/.
+We also attended Univ. of New Hampshire IOL tests (http://www.iol.unh.edu/)
+in the past, with our past snapshots.
+
+RFC1639: FTP Operation Over Big Address Records (FOOBAR)
+ * RFC2428 is preferred over RFC1639. ftp clients will first try RFC2428,
+ then RFC1639 if failed.
+RFC1886: DNS Extensions to support IPv6
+RFC1933: (see RFC2893)
+RFC1981: Path MTU Discovery for IPv6
+RFC2080: RIPng for IPv6
+ * KAME-supplied route6d, bgpd and hroute6d support this.
+RFC2283: Multiprotocol Extensions for BGP-4
+ * so-called "BGP4+".
+ * KAME-supplied bgpd supports this.
+RFC2292: Advanced Sockets API for IPv6
+ * see RFC3542
+RFC2362: Protocol Independent Multicast-Sparse Mode (PIM-SM)
+ * RFC2362 defines the packet formats and the protcol of PIM-SM.
+RFC2373: IPv6 Addressing Architecture
+ * KAME supports node required addresses, and conforms to the scope
+ requirement.
+RFC2374: An IPv6 Aggregatable Global Unicast Address Format
+ * KAME supports 64-bit length of Interface ID.
+RFC2375: IPv6 Multicast Address Assignments
+ * Userland applications use the well-known addresses assigned in the RFC.
+RFC2428: FTP Extensions for IPv6 and NATs
+ * RFC2428 is preferred over RFC1639. ftp clients will first try RFC2428,
+ then RFC1639 if failed.
+RFC2460: IPv6 specification
+RFC2461: Neighbor discovery for IPv6
+ * See 1.2 in this document for details.
+RFC2462: IPv6 Stateless Address Autoconfiguration
+ * See 1.4 in this document for details.
+RFC2463: ICMPv6 for IPv6 specification
+ * See 1.9 in this document for details.
+RFC2464: Transmission of IPv6 Packets over Ethernet Networks
+RFC2465: MIB for IPv6: Textual Conventions and General Group
+ * Necessary statistics are gathered by the kernel. Actual IPv6 MIB
+ support is provided as patchkit for ucd-snmp.
+RFC2466: MIB for IPv6: ICMPv6 group
+ * Necessary statistics are gathered by the kernel. Actual IPv6 MIB
+ support is provided as patchkit for ucd-snmp.
+RFC2467: Transmission of IPv6 Packets over FDDI Networks
+RFC2472: IPv6 over PPP
+RFC2492: IPv6 over ATM Networks
+ * only PVC is supported.
+RFC2497: Transmission of IPv6 packet over ARCnet Networks
+RFC2545: Use of BGP-4 Multiprotocol Extensions for IPv6 Inter-Domain Routing
+RFC2553: (see RFC3493)
+RFC2671: Extension Mechanisms for DNS (EDNS0)
+ * see USAGE for how to use it.
+ * not supported on kame/freebsd4 and kame/bsdi4.
+RFC2673: Binary Labels in the Domain Name System
+ * KAME/bsdi4 supports A6, DNAME and binary label to some extent.
+ * KAME apps/bind8 repository has resolver library with partial A6, DNAME
+ and binary label support.
+RFC2675: IPv6 Jumbograms
+ * See 1.7 in this document for details.
+RFC2710: Multicast Listener Discovery for IPv6
+RFC2711: IPv6 router alert option
+RFC2732: Format for Literal IPv6 Addresses in URL's
+ * The spec is implemented in programs that handle URLs
+ (like freebsd ftpio(3) and fetch(1), or netbsd ftp(1))
+RFC2874: DNS Extensions to Support IPv6 Address Aggregation and Renumbering
+ * KAME/bsdi4 supports A6, DNAME and binary label to some extent.
+ * KAME apps/bind8 repository has resolver library with partial A6, DNAME
+ and binary label support.
+RFC2893: Transition Mechanisms for IPv6 Hosts and Routers
+ * IPv4 compatible address is not supported.
+ * automatic tunneling (4.3) is not supported.
+ * "gif" interface implements IPv[46]-over-IPv[46] tunnel in a generic way,
+ and it covers "configured tunnel" described in the spec.
+ See 1.5 in this document for details.
+RFC2894: Router renumbering for IPv6
+RFC3041: Privacy Extensions for Stateless Address Autoconfiguration in IPv6
+RFC3056: Connection of IPv6 Domains via IPv4 Clouds
+ * So-called "6to4".
+ * "stf" interface implements it. Be sure to read
+ draft-itojun-ipv6-transition-abuse-01.txt
+ below before configuring it, there can be security issues.
+RFC3142: An IPv6-to-IPv4 transport relay translator
+ * FAITH tcp relay translator (faithd) implements this. See 3.1 for more
+ details.
+RFC3152: Delegation of IP6.ARPA
+ * libinet6 resolvers contained in the KAME snaps support to use
+ the ip6.arpa domain (with the nibble format) for IPv6 reverse
+ lookups.
+RFC3484: Default Address Selection for IPv6
+ * the selection algorithm for both source and destination addresses
+ is implemented based on the RFC, though some rules are still omitted.
+RFC3493: Basic Socket Interface Extensions for IPv6
+ * IPv4 mapped address (3.7) and special behavior of IPv6 wildcard bind
+ socket (3.8) are,
+ - supported and turned on by default on KAME/FreeBSD[34]
+ and KAME/BSDI4,
+ - supported but turned off by default on KAME/NetBSD and KAME/FreeBSD5,
+ - not supported on KAME/FreeBSD228, KAME/OpenBSD and KAME/BSDI3.
+ see 1.12 in this document for details.
+ * The AI_ALL and AI_V4MAPPED flags are not supported.
+RFC3542: Advanced Sockets API for IPv6 (revised)
+ * For supported library functions/kernel APIs, see sys/netinet6/ADVAPI.
+ * Some of the updates in the draft are not implemented yet. See
+ TODO.2292bis for more details.
+RFC4007: IPv6 Scoped Address Architecture
+ * some part of the documentation (especially about the routing
+ model) is not supported yet.
+ * zone indices that contain scope types have not been supported yet.
+
+draft-ietf-ipngwg-icmp-name-lookups-09: IPv6 Name Lookups Through ICMP
+draft-ietf-ipv6-router-selection-07.txt:
+ Default Router Preferences and More-Specific Routes
+ * router-side: both router preference and specific routes are supported.
+ * host-side: only router preference is supported.
+draft-ietf-pim-sm-v2-new-02.txt
+ A revised version of RFC2362, which includes the IPv6 specific
+ packet format and protocol descriptions.
+draft-ietf-dnsext-mdns-00.txt: Multicast DNS
+ * kame/mdnsd has test implementation, which will not be built in
+ default compilation. The draft will experience a major change in the
+ near future, so don't rely upon it.
+draft-ietf-ipngwg-icmp-v3-02.txt: ICMPv6 for IPv6 specification (revised)
+ * See 1.9 in this document for details.
+draft-itojun-ipv6-tcp-to-anycast-01.txt:
+ Disconnecting TCP connection toward IPv6 anycast address
+draft-ietf-ipv6-rfc2462bis-06.txt: IPv6 Stateless Address
+ Autoconfiguration (revised)
+draft-itojun-ipv6-transition-abuse-01.txt:
+ Possible abuse against IPv6 transition technologies (expired)
+ * KAME does not implement RFC1933/2893 automatic tunnel.
+ * "stf" interface implements some address filters. Refer to stf(4)
+ for details. Since there's no way to make 6to4 interface 100% secure,
+ we do not include "stf" interface into GENERIC.v6 compilation.
+ * kame/openbsd completely disables IPv4 mapped address support.
+ * kame/netbsd makes IPv4 mapped address support off by default.
+ * See section 1.12.6 and 1.14 for more details.
+draft-itojun-ipv6-flowlabel-api-01.txt: Socket API for IPv6 flow label field
+ * no consideration is made against the use of routing headers and such.
+
+1.2 Neighbor Discovery
+
+Our implementation of Neighbor Discovery is fairly stable. Currently
+Address Resolution, Duplicated Address Detection, and Neighbor
+Unreachability Detection are supported. In the near future we will be
+adding an Unsolicited Neighbor Advertisement transmission command as
+an administration tool.
+
+Duplicated Address Detection (DAD) will be performed when an IPv6 address
+is assigned to a network interface, or the network interface is enabled
+(ifconfig up). It is documented in RFC2462 5.4.
+If DAD fails, the address will be marked "duplicated" and message will be
+generated to syslog (and usually to console). The "duplicated" mark
+can be checked with ifconfig. It is administrators' responsibility to check
+for and recover from DAD failures. We may try to improve failure recovery
+in future KAME code.
+
+A successor version of RFC2462 (called rfc2462bis) clarifies the
+behavior when DAD fails (i.e., duplicate is detected): if the
+duplicate address is a link-local address formed from an interface
+identifier based on the hardware address which is supposed to be
+uniquely assigned (e.g., EUI-64 for an Ethernet interface), IPv6
+operation on the interface should be disabled. The KAME
+implementation supports this as follows: if this type of duplicate is
+detected, the kernel marks "disabled" in the ND specific data
+structure for the interface. Every IPv6 I/O operation in the kernel
+checks this mark, and the kernel will drop packets received on or
+being sent to the "disabled" interface. Whether the IPv6 operation is
+disabled or not can be confirmed by the ndp(8) command. See the man
+page for more details.
+
+DAD procedure may not be effective on certain network interfaces/drivers.
+If a network driver needs long initialization time (with wireless network
+interfaces this situation is popular), and the driver mistakingly raises
+IFF_RUNNING before the driver becomes ready, DAD code will try to transmit
+DAD probes to not-really-ready network driver and the packet will not go out
+from the interface. In such cases, network drivers should be corrected.
+
+Some of network drivers loop multicast packets back to themselves,
+even if instructed not to do so (especially in promiscuous mode). In
+such cases DAD may fail, because the DAD engine sees inbound NS packet
+(actually from the node itself) and considers it as a sign of
+duplicate. In this case, drivers should be corrected to honor
+IFF_SIMPLEX behavior. For example, you may need to check source MAC
+address on an inbound packet, and reject it if it is from the node
+itself.
+
+Neighbor Discovery specification (RFC2461) does not talk about neighbor
+cache handling in the following cases:
+(1) when there was no neighbor cache entry, node received unsolicited
+ RS/NS/NA/redirect packet without link-layer address
+(2) neighbor cache handling on medium without link-layer address
+ (we need a neighbor cache entry for IsRouter bit)
+For (1), we implemented workaround based on discussions on IETF ipngwg mailing
+list. For more details, see the comments in the source code and email
+thread started from (IPng 7155), dated Feb 6 1999.
+
+IPv6 on-link determination rule (RFC2461) is quite different from
+assumptions in BSD IPv4 network code. To implement the behavior in
+RFC2461 section 6.3.6 (3), the kernel needs to know the default
+outgoing interface. To configure the default outgoing interface, use
+commands like "ndp -I de0" as root. Then the kernel will have a
+"default" route to the interface with the cloning "C" bit being on.
+This default route will cause to make a neighbor cache entry for every
+destination that does not match an explicit route entry.
+
+Note that we intentionally disable configuring the default interface
+by default. This is because we found it sometimes caused inconvenient
+situation while it was rarely useful in practical usage. For example,
+consider a destination that has both IPv4 and IPv6 addresses but is
+only reachable via IPv4. Since our getaddrinfo(3) prefers IPv6 by
+default, an (TCP) application using the library with PF_UNSPEC first
+tries to connect to the IPv6 address. If we turn on RFC 2461 6.3.6
+(3), we have to wait for quite a long period before the first attempt
+to make a connection fails. If we turn it off, the first attempt will
+immediately fail with EHOSTUNREACH, and then the application can try
+the next, reachable address.
+
+The notion of the default interface is also disabled when the node is
+acting as a router. The reason is that routers tend to control all
+routes stored in the kernel and the default route automatically
+installed would rather confuse the routers. Note that the spec misuse
+the word "host" and "node" in several places in Section 5.2 of RFC
+2461. We basically read the word "node" in this section as "host,"
+and thus believe the implementation policy does not break the
+specification.
+
+To avoid possible DoS attacks and infinite loops, KAME stack will accept
+only 10 options on ND packet. Therefore, if you have 20 prefix options
+attached to RA, only the first 10 prefixes will be recognized.
+If this troubles you, please contact the KAME team and/or modify
+nd6_maxndopt in sys/netinet6/nd6.c. If there are high demands we may
+provide a sysctl knob for the variable.
+
+Proxy Neighbor Advertisement support is implemented in the kernel.
+For instance, you can configure it by using the following command:
+ # ndp -s fe80::1234%ne0 0:1:2:3:4:5 proxy
+where ne0 is the interface which attaches to the same link as the
+proxy target.
+There are certain limitations, though:
+- It does not send unsolicited multicast NA on configuration. This is MAY
+ behavior in RFC2461.
+- It does not add random delay before transmission of solicited NA. This is
+ SHOULD behavior in RFC2461.
+- We cannot configure proxy NDP for off-link address. The target address for
+ proxying must be link-local address, or must be in prefixes configured to
+ node which does proxy NDP.
+- RFC2461 is unclear about if it is legal for a host to perform proxy ND.
+ We do not prohibit hosts from doing proxy ND, but there will be very limited
+ use in it.
+
+Starting mid March 2000, we support Neighbor Unreachability Detection
+(NUD) on p2p interfaces, including tunnel interfaces (gif). NUD is
+turned on by default. Before March 2000 the KAME stack did not
+perform NUD on p2p interfaces. If the change raises any
+interoperability issues, you can turn off/on NUD by per-interface
+basis. Use "ndp -i interface -nud" to turn it off. Consult ndp(8)
+for details.
+
+RFC2461 specifies upper-layer reachability confirmation hint. Whenever
+upper-layer reachability confirmation hint comes, ND process can use it
+to optimize neighbor discovery process - ND process can omit real ND exchange
+and keep the neighbor cache state in REACHABLE.
+We currently have two sources for hints: (1) setsockopt(IPV6_REACHCONF)
+defined by the RFC3542 API, and (2) hints from tcp(6)_input.
+
+It is questionable if they are really trustworthy. For example, a
+rogue userland program can use IPV6_REACHCONF to confuse the ND
+process. Neighbor cache is a system-wide information pool, and it is
+bad to allow a single process to affect others. Also, tcp(6)_input
+can be hosed by hijack attempts. It is wrong to allow hijack attempts
+to affect the ND process.
+
+Starting June 2000, the ND code has a protection mechanism against
+incorrect upper-layer reachability confirmation. The ND code counts
+subsequent upper-layer hints. If the number of hints reaches the
+maximum, the ND code will ignore further upper-layer hints and run
+real ND process to confirm reachability to the peer. sysctl
+net.inet6.icmp6.nd6_maxnudhint defines the maximum # of subsequent
+upper-layer hints to be accepted.
+(from April 2000 to June 2000, we rejected setsockopt(IPV6_REACHCONF) from
+non-root process - after a local discussion, it looks that hints are not
+that trustworthy even if they are from privileged processes)
+
+If inbound ND packets carry invalid values, the KAME kernel will
+drop these packet and increment statistics variable. See
+"netstat -sn", icmp6 section. For detailed debugging session, you can
+turn on syslog output from the kernel on errors, by turning on sysctl MIB
+net.inet6.icmp6.nd6_debug. nd6_debug can be turned on at bootstrap
+time, by defining ND6_DEBUG kernel compilation option (so you can
+debug behavior during bootstrap). nd6_debug configuration should
+only be used for test/debug purposes - for a production environment,
+nd6_debug must be set to 0. If you leave it to 1, malicious parties
+can inject broken packet and fill up /var/log partition.
+
+1.3 Scope Zone Index
+
+IPv6 uses scoped addresses. It is therefore very important to
+specify the scope zone index (link index for a link-local address, or
+site index for a site-local address) with an IPv6 address. Without a
+zone index, a scoped IPv6 address is ambiguous to the kernel, and
+the kernel would not be able to determine the outbound zone for a
+packet to the scoped address. KAME code tries to address the issue in
+several ways.
+
+The entire architecture of scoped addresses is documented in RFC4007.
+One non-trivial point of the architecture is that the link scope is
+(theoretically) larger than the interface scope. That is, two
+different interfaces can belong to a same single link. However, in a
+normal operation, we can assume that there is 1-to-1 relationship
+between links and interfaces. In other words, we can usually put
+links and interfaces in the same scope type. The current KAME
+implementation assumes the 1-to-1 relationship. In particular, we use
+interface names such as "ne1" as unique link identifiers. This would
+be much more human-readable and intuitive than numeric identifiers,
+but please keep your mind on the theoretical difference between links
+and interfaces.
+
+Site-local addresses are very vaguely defined in the specs, and both
+the specification and the KAME code need tons of improvements to
+enable its actual use. For example, it is still very unclear how we
+define a site, or how we resolve host names in a site. There is work
+underway to define behavior of routers at site border, but, we have
+almost no code for site boundary node support (neither forwarding nor
+routing) and we bet almost noone has. We recommend, at this moment,
+you to use global addresses for experiments - there are way too many
+pitfalls if you use site-local addresses.
+
+1.3.1 Kernel internal
+
+In the kernel, the link index for a link-local scope address is
+embedded into the 2nd 16bit-word (the 3rd and 4th bytes) in the IPv6
+address.
+For example, you may see something like:
+ fe80:1::200:f8ff:fe01:6317
+in the routing table and the interface address structure (struct
+in6_ifaddr). The address above is a link-local unicast address which
+belongs to a network link whose link identifier is 1 (note that it
+eqauls to the interface index by the assumption of our
+implementation). The embedded index enables us to identify IPv6
+link-local addresses over multiple links effectively and with only a
+little code change.
+
+The use of the internal format must be limited inside the kernel. In
+particular, addresses sent by an application should not contain the
+embedded index (except via some very special APIs such as routing
+sockets). Instead, the index should be specified in the sin6_scope_id
+field of a sockaddr_in6 structure. Obviously, packets sent to or
+received from must not contain the embedded index either, since the
+index is meaningful only within the sending/receiving node.
+
+In order to deal with the differences, several kernel routines are
+provided. These are available by including <netinet6/scope_var.h>.
+Typically, the following functions will be most generally used:
+
+- int sa6_embedscope(struct sockaddr_in6 *sa6, int defaultok);
+ Embed sa6->sin6_scope_id into sa6->sin6_addr. If sin6_scope_id is
+ 0, defaultok is non-0, and the default zone ID (see RFC4007) is
+ configured, the default ID will be used instead of the value of the
+ sin6_scope_id field. On success, sa6->sin6_scope_id will be reset
+ to 0.
+
+ This function returns 0 on success, or a non-0 error code otherwise.
+
+- int sa6_recoverscope(struct sockaddr_in6 *sa6);
+ Extract embedded zone ID in sa6->sin6_addr and set
+ sa6->sin6_scope_id to that ID. The embedded ID will be cleared with
+ 0.
+
+ This function returns 0 on success, or a non-0 error code otherwise.
+
+- int in6_clearscope(struct in6_addr *in6);
+ Reset the embedded zone ID in 'in6' to 0. This function never fails, and
+ returns 0 if the original address is intact or non 0 if the address is
+ modified. The return value doesn't matter in most cases; currently, the
+ only point where we care about the return value is ip6_input() for checking
+ whether the source or destination addresses of the incoming packet is in
+ the embedded form.
+
+- int in6_setscope(struct in6_addr *in6, struct ifnet *ifp,
+ u_int32_t *zoneidp);
+ Embed zone ID determined by the address scope type for 'in6' and the
+ interface 'ifp' into 'in6'. If zoneidp is non NULL, *zoneidp will
+ also have the zone ID.
+
+ This function returns 0 on success, or a non-0 error code otherwise.
+
+The typical usage of these functions is as follows:
+
+sa6_embedscope() will be used at the socket or transport layer to
+convert a sockaddr_in6 structure passed by an application into the
+kernel-internal form. In this usage, the second argument is often the
+'ip6_use_defzone' global variable.
+
+sa6_recoverscope() will also be used at the socket or transport layer
+to convert an in6_addr structure with the embedded zone ID into a
+sockaddr_in6 structure with the corresponding ID in the sin6_scope_id
+field (and without the embedded ID in sin6_addr).
+
+in6_clearscope() will be used just before sending a packet to the wire
+to remove the embedded ID. In general, this must be done at the last
+stage of an output path, since otherwise the address would lose the ID
+and could be ambiguous with regard to scope.
+
+in6_setscope() will be used when the kernel receives a packet from the
+wire to construct the kernel internal form for each address field in
+the packet (typical examples are the source and destination addresses
+of the packet). In the typical usage, the third argument 'zoneidp'
+will be NULL. A non-NULL value will be used when the validity of the
+zone ID must be checked, e.g., when forwarding a packet to another
+link (see ip6_forward() for this usage).
+
+An application, when sending a packet, is basically assumed to specify
+the appropriate scope zone of the destination address by the
+sin6_scope_id field (this might be done transparently from the
+application with getaddrinfo() and the extended textual format - see
+below), or at least the default scope zone(s) must be configured as a
+last resort. In some cases, however, an application could specify an
+ambiguous address with regard to scope, expecting it is disambiguated
+in the kernel by some other means. A typical usage is to specify the
+outgoing interface through another API, which can disambiguate the
+unspecified scope zone. Such a usage is not recommended, but the
+kernel implements some trick to deal with even this case.
+
+A rough sketch of the trick can be summarized as the following
+sequence.
+
+ sa6_embedscope(dst, ip6_use_defzone);
+ in6_selectsrc(dst, ..., &ifp, ...);
+ in6_setscope(&dst->sin6_addr, ifp, NULL);
+
+sa6_embedscope() first tries to convert sin6_scope_id (or the default
+zone ID) into the kernel-internal form. This can fail with an
+ambiguous destination, but it still tries to get the outgoing
+interface (ifp) in the attempt of determining the source address of
+the outgoing packet using in6_selectsrc(). If the interface is
+detected, and the scope zone was originally ambiguous, in6_setscope()
+can finally determine the appropriate ID with the address itself and
+the interface, and construct the kernel-internal form. See, for
+example, comments in udp6_output() for more concrete example.
+
+In any case, kernel routines except ones in netinet6/scope6.c MUST NOT
+directly refer to the embedded form. They MUST use the above
+interface functions. In particular, kernel routines MUST NOT have the
+following code fragment:
+
+ /* This is a bad practice. Don't do this */
+ if (IN6_IS_ADDR_LINKLOCAL(&sin6->sin6_addr))
+ sin6->sin6_addr.s6_addr16[1] = htons(ifp->if_index);
+
+This is bad for several reasons. First, address ambiguity is not
+specific to link-local addresses (any non-global multicast addresses
+are inherently ambiguous, and this is particularly true for
+interface-local addresses). Secondly, this is vulnerable to future
+changes of the embedded form (the embedded position may change, or the
+zone ID may not actually be the interface index). Only scope6.c
+routines should know the details.
+
+The above code fragment should thus actually be as follows:
+
+ /* This is correct. */
+ in6_setscope(&sin6->sin6_addr, ifp, NULL);
+ (and catch errors if possible and necessary)
+
+1.3.2 Interaction with API
+
+There are several candidates of API to deal with scoped addresses
+without ambiguity.
+
+The IPV6_PKTINFO ancillary data type or socket option defined in the
+advanced API (RFC2292 or RFC3542) can specify
+the outgoing interface of a packet. Similarly, the IPV6_PKTINFO or
+IPV6_RECVPKTINFO socket options tell kernel to pass the incoming
+interface to user applications.
+
+These options are enough to disambiguate scoped addresses of an
+incoming packet, because we can uniquely identify the corresponding
+zone of the scoped address(es) by the incoming interface. However,
+they are too strong for outgoing packets. For example, consider a
+multi-sited node and suppose that more than one interface of the node
+belongs to a same site. When we want to send a packet to the site,
+we can only specify one of the interfaces for the outgoing packet with
+these options; we cannot just say "send the packet to (one of the
+interfaces of) the site."
+
+Another kind of candidates is to use the sin6_scope_id member in the
+sockaddr_in6 structure, defined in RFC2553. The KAME kernel
+interprets the sin6_scope_id field properly in order to disambiguate scoped
+addresses. For example, if an application passes a sockaddr_in6
+structure that has a non-zero sin6_scope_id value to the sendto(2)
+system call, the kernel should send the packet to the appropriate zone
+according to the sin6_scope_id field. Similarly, when the source or
+the destination address of an incoming packet is a scoped one, the
+kernel should detect the correct zone identifier based on the address
+and the receiving interface, fill the identifier in the sin6_scope_id
+field of a sockaddr_in6 structure, and then pass the packet to an
+application via the recvfrom(2) system call, etc.
+
+However, the semantics of the sin6_scope_id is still vague and on the
+way to standardization. Additionally, not so many operating systems
+support the behavior above at this moment.
+
+In summary,
+- If your target system is limited to KAME based ones (i.e. BSD
+ variants and KAME snaps), use the sin6_scope_id field assuming the
+ kernel behavior described above.
+- Otherwise, (i.e. if your program should be portable on other systems
+ than BSDs)
+ + Use the advanced API to disambiguate scoped addresses of incoming
+ packets.
+ + To disambiguate scoped addresses of outgoing packets,
+ * if it is okay to just specify the outgoing interface, use the
+ advanced API. This would be the case, for example, when you
+ should only consider link-local addresses and your system
+ assumes 1-to-1 relationship between links and interfaces.
+ * otherwise, sorry but you lose. Please rush the IETF IPv6
+ community into standardizing the semantics of the sin6_scope_id
+ field.
+
+Routing daemons and configuration programs, like route6d and ifconfig,
+will need to manipulate the "embedded" zone index. These programs use
+routing sockets and ioctls (like SIOCGIFADDR_IN6) and the kernel API
+will return IPv6 addresses with the 2nd 16bit-word filled in. The
+APIs are for manipulating kernel internal structure. Programs that
+use these APIs have to be prepared about differences in kernels
+anyway.
+
+getaddrinfo(3) and getnameinfo(3) support an extended numeric IPv6
+syntax, as documented in RFC4007. You can specify the outgoing link,
+by using the name of the outgoing interface as the link, like
+"fe80::1%ne0" (again, note that we assume there is 1-to-1 relationship
+between links and interfaces.) This way you will be able to specify a
+link-local scoped address without much trouble.
+
+Other APIs like inet_pton(3) and inet_ntop(3) are inherently
+unfriendly with scoped addresses, since they are unable to annotate
+addresses with zone identifier.
+
+1.3.3 Interaction with users (command line)
+
+Most of user applications now support the extended numeric IPv6
+syntax. In this case, you can specify outgoing link, by using the name
+of the outgoing interface like "fe80::1%ne0" (sorry for the duplicated
+notice, but please recall again that we assume 1-to-1 relationship
+between links and interfaces). This is even the case for some
+management tools such as route(8) or ndp(8). For example, to install
+the IPv6 default route by hand, you can type like
+ # route add -inet6 default fe80::9876:5432:1234:abcd%ne0
+(Although we suggest you to run dynamic routing instead of static
+routes, in order to avoid configuration mistakes.)
+
+Some applications have command line options for specifying an
+appropriate zone of a scoped address (like "ping6 -I ne0 ff02::1" to
+specify the outgoing interface). However, you can't always expect such
+options. Additionally, specifying the outgoing "interface" is in
+theory an overspecification as a way to specify the outgoing "link"
+(see above). Thus, we recommend you to use the extended format
+described above. This should apply to the case where the outgoing
+interface is specified.
+
+In any case, when you specify a scoped address to the command line,
+NEVER write the embedded form (such as ff02:1::1 or fe80:2::fedc),
+which should only be used inside the kernel (see Section 1.3.1), and
+is not supposed to work.
+
+1.4 Plug and Play
+
+The KAME kit implements most of the IPv6 stateless address
+autoconfiguration in the kernel.
+Neighbor Discovery functions are implemented in the kernel as a whole.
+Router Advertisement (RA) input for hosts is implemented in the
+kernel. Router Solicitation (RS) output for endhosts, RS input
+for routers, and RA output for routers are implemented in the
+userland.
+
+1.4.1 Assignment of link-local, and special addresses
+
+IPv6 link-local address is generated from IEEE802 address (ethernet MAC
+address). Each of interface is assigned an IPv6 link-local address
+automatically, when the interface becomes up (IFF_UP). Also, direct route
+for the link-local address is added to routing table.
+
+Here is an output of netstat command:
+
+Internet6:
+Destination Gateway Flags Netif Expire
+fe80::%ed0/64 link#1 UC ed0
+fe80::%ep0/64 link#2 UC ep0
+
+Interfaces that has no IEEE802 address (pseudo interfaces like tunnel
+interfaces, or ppp interfaces) will borrow IEEE802 address from other
+interfaces, such as ethernet interfaces, whenever possible.
+If there is no IEEE802 hardware attached, last-resort pseudorandom value,
+which is from MD5(hostname), will be used as source of link-local address.
+If it is not suitable for your usage, you will need to configure the
+link-local address manually.
+
+If an interface is not capable of handling IPv6 (such as lack of multicast
+support), link-local address will not be assigned to that interface.
+See section 2 for details.
+
+Each interface joins the solicited multicast address and the
+link-local all-nodes multicast addresses (e.g. fe80::1:ff01:6317
+and ff02::1, respectively, on the link the interface is attached).
+In addition to a link-local address, the loopback address (::1) will be
+assigned to the loopback interface. Also, ::1/128 and ff01::/32 are
+automatically added to routing table, and loopback interface joins
+node-local multicast group ff01::1.
+
+1.4.2 Stateless address autoconfiguration on hosts
+
+In IPv6 specification, nodes are separated into two categories:
+routers and hosts. Routers forward packets addressed to others, hosts does
+not forward the packets. net.inet6.ip6.forwarding defines whether this
+node is a router or a host (router if it is 1, host if it is 0).
+
+It is NOT recommended to change net.inet6.ip6.forwarding while the node
+is in operation. IPv6 specification defines behavior for "host" and "router"
+quite differently, and switching from one to another can cause serious
+troubles. It is recommended to configure the variable at bootstrap time only.
+
+The first step in stateless address configuration is Duplicated Address
+Detection (DAD). See 1.2 for more detail on DAD.
+
+When a host hears Router Advertisement from the router, a host may
+autoconfigure itself by stateless address autoconfiguration. This
+behavior can be controlled by the net.inet6.ip6.accept_rtadv sysctl
+variable and a per-interface flag managed in the kernel. The latter,
+which we call "if_accept_rtadv" here, can be changed by the ndp(8)
+command (see the manpage for more details). When the sysctl variable
+is set to 1, and the flag is set, the host autoconfigures itself. By
+autoconfiguration, network address prefixes for the receiving
+interface (usually global address prefix) are added. The default
+route is also configured.
+
+Routers periodically generate Router Advertisement packets. To
+request an adjacent router to generate RA packet, a host can transmit
+Router Solicitation. To generate an RS packet at any time, use the
+"rtsol" command. The "rtsold" daemon is also available. "rtsold"
+generates Router Solicitation whenever necessary, and it works greatly
+for nomadic usage (notebooks/laptops). If one wishes to ignore Router
+Advertisements, use sysctl to set net.inet6.ip6.accept_rtadv to 0.
+Additionally, ndp(8) command can be used to control the behavior
+per-interface basis.
+
+To generate Router Advertisement from a router, use the "rtadvd" daemon.
+
+Note that the IPv6 specification assumes the following items and that
+nonconforming cases are left unspecified:
+- Only hosts will listen to router advertisements
+- Hosts have a single network interface (except loopback)
+This is therefore unwise to enable net.inet6.ip6.accept_rtadv on routers,
+or multi-interface hosts. A misconfigured node can behave strange
+(KAME code allows nonconforming configuration, for those who would like
+to do some experiments).
+
+To summarize the sysctl knob:
+ accept_rtadv forwarding role of the node
+ --- --- ---
+ 0 0 host (to be manually configured)
+ 0 1 router
+ 1 0 autoconfigured host
+ (spec assumes that hosts have a single
+ interface only, autoconfigred hosts
+ with multiple interfaces are
+ out-of-scope)
+ 1 1 invalid, or experimental
+ (out-of-scope of spec)
+
+The if_accept_rtadv flag is referred only when accept_rtadv is 1 (the
+latter two cases). The flag does not have any effects when the sysctl
+variable is 0.
+
+See 1.2 in the document for relationship between DAD and autoconfiguration.
+
+1.4.3 DHCPv6
+
+We supply a tiny DHCPv6 server/client in kame/dhcp6. However, the
+implementation is premature (for example, this does NOT implement
+address lease/release), and it is not in default compilation tree on
+some platforms. If you want to do some experiment, compile it on your
+own.
+
+DHCPv6 and autoconfiguration also needs more work. "Managed" and "Other"
+bits in RA have no special effect to stateful autoconfiguration procedure
+in DHCPv6 client program ("Managed" bit actually prevents stateless
+autoconfiguration, but no special action will be taken for DHCPv6 client).
+
+1.5 Generic tunnel interface
+
+GIF (Generic InterFace) is a pseudo interface for configured tunnel.
+Details are described in gif(4) manpage.
+Currently
+ v6 in v6
+ v6 in v4
+ v4 in v6
+ v4 in v4
+are available. Use "gifconfig" to assign physical (outer) source
+and destination address to gif interfaces.
+Configuration that uses same address family for inner and outer IP
+header (v4 in v4, or v6 in v6) is dangerous. It is very easy to
+configure interfaces and routing tables to perform infinite level
+of tunneling. Please be warned.
+
+gif can be configured to be ECN-friendly. See 4.5 for ECN-friendliness
+of tunnels, and gif(4) manpage for how to configure.
+
+If you would like to configure an IPv4-in-IPv6 tunnel with gif interface,
+read gif(4) carefully. You may need to remove IPv6 link-local address
+automatically assigned to the gif interface.
+
+1.6 Address Selection
+
+1.6.1 Source Address Selection
+
+The KAME kernel chooses the source address for an outgoing packet
+sent from a user application as follows:
+
+1. if the source address is explicitly specified via an IPV6_PKTINFO
+ ancillary data item or the socket option of that name, just use it.
+ Note that this item/option overrides the bound address of the
+ corresponding (datagram) socket.
+
+2. if the corresponding socket is bound, use the bound address.
+
+3. otherwise, the kernel first tries to find the outgoing interface of
+ the packet. If it fails, the source address selection also fails.
+ If the kernel can find an interface, choose the most appropriate
+ address based on the algorithm described in RFC3484.
+
+ The policy table used in this algorithm is stored in the kernel.
+ To install or view the policy, use the ip6addrctl(8) command. The
+ kernel does not have pre-installed policy. It is expected that the
+ default policy described in the draft should be installed at the
+ bootstrap time using this command.
+
+ This draft allows an implementation to add implementation-specific
+ rules with higher precedence than the rule "Use longest matching
+ prefix." KAME's implementation has the following additional rules
+ (that apply in the appeared order):
+
+ - prefer addresses on alive interfaces, that is, interfaces with
+ the UP flag being on. This rule is particularly useful for
+ routers, since some routing daemons stop advertising prefixes
+ (addresses) on interfaces that have become down.
+
+ - prefer addresses on "preferred" interfaces. "Preferred"
+ interfaces can be specified by the ndp(8) command. By default,
+ no interface is preferred, that is, this rule does not apply.
+ Again, this rule is particularly useful for routers, since there
+ is a convention, among router administrators, of assigning
+ "stable" addresses on a particular interface (typically a
+ loopback interface).
+
+ In any case, addresses that break the scope zone of the
+ destination, or addresses whose zone do not contain the outgoing
+ interface are never chosen.
+
+When the procedure above fails, the kernel usually returns
+EADDRNOTAVAIL to the application.
+
+In some cases, the specification explicitly requires the
+implementation to choose a particular source address. The source
+address for a Neighbor Advertisement (NA) message is an example.
+Under the spec (RFC2461 7.2.2) NA's source should be the target
+address of the corresponding NS's target. In this case we follow the
+spec rather than the above rule.
+
+If you would like to prohibit the use of deprecated address for some
+reason, configure net.inet6.ip6.use_deprecated to 0. The issue
+related to deprecated address is described in RFC2462 5.5.4 (NOTE:
+there is some debate underway in IETF ipngwg on how to use
+"deprecated" address).
+
+As documented in the source address selection document, temporary
+addresses for privacy extension are less preferred to public addresses
+by default. However, for administrators who are particularly aware of
+the privacy, there is a system-wide sysctl(3) variable
+"net.inet6.ip6.prefer_tempaddr". When the variable is set to
+non-zero, the kernel will rather prefer temporary addresses. The
+default value of this variable is 0.
+
+1.6.2 Destination Address Ordering
+
+KAME's getaddrinfo(3) supports the destination address ordering
+algorithm described in RFC3484. Getaddrinfo(3) needs to know the
+source address for each destination address and policy entries
+(described in the previous section) for the source and destination
+addresses. To get the source address, the library function opens a
+UDP socket and tries to connect(2) for the destination. To get the
+policy entry, the function issues sysctl(3).
+
+1.7 Jumbo Payload
+
+KAME supports the Jumbo Payload hop-by-hop option used to send IPv6
+packets with payloads longer than 65,535 octets. But since currently
+KAME does not support any physical interface whose MTU is more than
+65,535, such payloads can be seen only on the loopback interface(i.e.
+lo0).
+
+If you want to try jumbo payloads, you first have to reconfigure the
+kernel so that the MTU of the loopback interface is more than 65,535
+bytes; add the following to the kernel configuration file:
+ options "LARGE_LOMTU" #To test jumbo payload
+and recompile the new kernel.
+
+Then you can test jumbo payloads by the ping6 command with -b and -s
+options. The -b option must be specified to enlarge the size of the
+socket buffer and the -s option specifies the length of the packet,
+which should be more than 65,535. For example, type as follows;
+ % ping6 -b 70000 -s 68000 ::1
+
+The IPv6 specification requires that the Jumbo Payload option must not
+be used in a packet that carries a fragment header. If this condition
+is broken, an ICMPv6 Parameter Problem message must be sent to the
+sender. KAME kernel follows the specification, but you cannot usually
+see an ICMPv6 error caused by this requirement.
+
+If KAME kernel receives an IPv6 packet, it checks the frame length of
+the packet and compares it to the length specified in the payload
+length field of the IPv6 header or in the value of the Jumbo Payload
+option, if any. If the former is shorter than the latter, KAME kernel
+discards the packet and increments the statistics. You can see the
+statistics as output of netstat command with `-s -p ip6' option:
+ % netstat -s -p ip6
+ ip6:
+ (snip)
+ 1 with data size < data length
+
+So, KAME kernel does not send an ICMPv6 error unless the erroneous
+packet is an actual Jumbo Payload, that is, its packet size is more
+than 65,535 bytes. As described above, KAME kernel currently does not
+support physical interface with such a huge MTU, so it rarely returns an
+ICMPv6 error.
+
+TCP/UDP over jumbogram is not supported at this moment. This is because
+we have no medium (other than loopback) to test this. Contact us if you
+need this.
+
+IPsec does not work on jumbograms. This is due to some specification twists
+in supporting AH with jumbograms (AH header size influences payload length,
+and this makes it real hard to authenticate inbound packet with jumbo payload
+option as well as AH).
+
+There are fundamental issues in *BSD support for jumbograms. We would like to
+address those, but we need more time to finalize the task. To name a few:
+- mbuf pkthdr.len field is typed as "int" in 4.4BSD, so it cannot hold
+ jumbogram with len > 2G on 32bit architecture CPUs. If we would like to
+ support jumbogram properly, the field must be expanded to hold 4G +
+ IPv6 header + link-layer header. Therefore, it must be expanded to at least
+ int64_t (u_int32_t is NOT enough).
+- We mistakingly use "int" to hold packet length in many places. We need
+ to convert them into larger numeric type. It needs a great care, as we may
+ experience overflow during packet length computation.
+- We mistakingly check for ip6_plen field of IPv6 header for packet payload
+ length in various places. We should be checking mbuf pkthdr.len instead.
+ ip6_input() will perform sanity check on jumbo payload option on input,
+ and we can safely use mbuf pkthdr.len afterwards.
+- TCP code needs careful updates in bunch of places, of course.
+
+1.8 Loop prevention in header processing
+
+IPv6 specification allows arbitrary number of extension headers to
+be placed onto packets. If we implement IPv6 packet processing
+code in the way BSD IPv4 code is implemented, kernel stack may
+overflow due to long function call chain. KAME sys/netinet6 code
+is carefully designed to avoid kernel stack overflow. Because of
+this, KAME sys/netinet6 code defines its own protocol switch
+structure, as "struct ip6protosw" (see netinet6/ip6protosw.h).
+
+In addition to this, we restrict the number of extension headers
+(including the IPv6 header) in each incoming packet, in order to
+prevent a DoS attack that tries to send packets with a massive number
+of extension headers. The upper limit can be configured by the sysctl
+value net.inet6.ip6.hdrnestlimit. In particular, if the value is 0,
+the node will allow an arbitrary number of headers. As of writing this
+document, the default value is 50.
+
+IPv4 part (sys/netinet) remains untouched for compatibility.
+Because of this, if you receive IPsec-over-IPv4 packet with massive
+number of IPsec headers, kernel stack may blow up. IPsec-over-IPv6 is okay.
+
+1.9 ICMPv6
+
+After RFC2463 was published, IETF ipngwg has decided to disallow ICMPv6 error
+packet against ICMPv6 redirect, to prevent ICMPv6 storm on a network medium.
+KAME already implements this into the kernel.
+
+RFC2463 requires rate limitation for ICMPv6 error packets generated by a
+node, to avoid possible DoS attacks. KAME kernel implements two rate-
+limitation mechanisms, tunable via sysctl:
+- Minimum time interval between ICMPv6 error packets
+ KAME kernel will generate no more than one ICMPv6 error packet,
+ during configured time interval. net.inet6.icmp6.errratelimit
+ controls the interval (default: disabled).
+- Maximum ICMPv6 error packet-per-second
+ KAME kernel will generate no more than the configured number of
+ packets in one second. net.inet6.icmp6.errppslimit controls the
+ maximum packet-per-second value (default: 200pps)
+Basically, we need to pick values that are suitable against the bandwidth
+of link layer devices directly attached to the node. In some cases the
+default values may not fit well. We are still unsure if the default value
+is sane or not. Comments are welcome.
+
+1.10 Applications
+
+For userland programming, we support IPv6 socket API as specified in
+RFC2553/3493, RFC3542 and upcoming internet drafts.
+
+TCP/UDP over IPv6 is available and quite stable. You can enjoy "telnet",
+"ftp", "rlogin", "rsh", "ssh", etc. These applications are protocol
+independent. That is, they automatically chooses IPv4 or IPv6
+according to DNS.
+
+1.11 Kernel Internals
+
+ (*) TCP/UDP part is handled differently between operating system platforms.
+ See 1.12 for details.
+
+The current KAME has escaped from the IPv4 netinet logic. While
+ip_forward() calls ip_output(), ip6_forward() directly calls
+if_output() since routers must not divide IPv6 packets into fragments.
+
+ICMPv6 should contain the original packet as long as possible up to
+1280. UDP6/IP6 port unreach, for instance, should contain all
+extension headers and the *unchanged* UDP6 and IP6 headers.
+So, all IP6 functions except TCP6 never convert network byte
+order into host byte order, to save the original packet.
+
+tcp6_input(), udp6_input() and icmp6_input() can't assume that IP6
+header is preceding the transport headers due to extension
+headers. So, in6_cksum() was implemented to handle packets whose IP6
+header and transport header is not continuous. TCP/IP6 nor UDP/IP6
+header structure don't exist for checksum calculation.
+
+To process IP6 header, extension headers and transport headers easily,
+KAME requires network drivers to store packets in one internal mbuf or
+one or more external mbufs. A typical old driver prepares two
+internal mbufs for 100 - 208 bytes data, however, KAME's reference
+implementation stores it in one external mbuf.
+
+"netstat -s -p ip6" tells you whether or not your driver conforms
+KAME's requirement. In the following example, "cce0" violates the
+requirement. (For more information, refer to Section 2.)
+
+ Mbuf statistics:
+ 317 one mbuf
+ two or more mbuf::
+ lo0 = 8
+ cce0 = 10
+ 3282 one ext mbuf
+ 0 two or more ext mbuf
+
+Each input function calls IP6_EXTHDR_CHECK in the beginning to check
+if the region between IP6 and its header is
+continuous. IP6_EXTHDR_CHECK calls m_pullup() only if the mbuf has
+M_LOOP flag, that is, the packet comes from the loopback
+interface. m_pullup() is never called for packets coming from physical
+network interfaces.
+
+TCP6 reassembly makes use of IP6 header to store reassemble
+information. IP6 is not supposed to be just before TCP6, so
+ip6tcpreass structure has a pointer to TCP6 header. Of course, it has
+also a pointer back to mbuf to avoid m_pullup().
+
+Like TCP6, both IP and IP6 reassemble functions never call m_pullup().
+
+xxx_ctlinput() calls in_mrejoin() on PRC_IFNEWADDR. We think this is
+one of 4.4BSD implementation flaws. Since 4.4BSD keeps ia_multiaddrs
+in in_ifaddr{}, it can't use multicast feature if the interface has no
+unicast address. So, if an application joins to an interface and then
+all unicast addresses are removed from the interface, the application
+can't send/receive any multicast packets. Moreover, if a new unicast
+address is assigned to the interface, in_mrejoin() must be called.
+KAME's interfaces, however, have ALWAYS one link-local unicast
+address. These extensions have thus not been implemented in KAME.
+
+1.12 IPv4 mapped address and IPv6 wildcard socket
+
+RFC2553/3493 describes IPv4 mapped address (3.7) and special behavior
+of IPv6 wildcard bind socket (3.8). The spec allows you to:
+- Accept IPv4 connections by AF_INET6 wildcard bind socket.
+- Transmit IPv4 packet over AF_INET6 socket by using special form of
+ the address like ::ffff:10.1.1.1.
+but the spec itself is very complicated and does not specify how the
+socket layer should behave.
+Here we call the former one "listening side" and the latter one "initiating
+side", for reference purposes.
+
+Almost all KAME implementations treat tcp/udp port number space separately
+between IPv4 and IPv6. You can perform wildcard bind on both of the address
+families, on the same port.
+
+There are some OS-platform differences in KAME code, as we use tcp/udp
+code from different origin. The following table summarizes the behavior.
+
+ listening side initiating side
+ (AF_INET6 wildcard (connection to ::ffff:10.1.1.1)
+ socket gets IPv4 conn.)
+ --- ---
+KAME/BSDI3 not supported not supported
+KAME/FreeBSD228 not supported not supported
+KAME/FreeBSD3x configurable supported
+ default: enabled
+KAME/FreeBSD4x configurable supported
+ default: enabled
+KAME/NetBSD configurable supported
+ default: disabled
+KAME/BSDI4 enabled supported
+KAME/OpenBSD not supported not supported
+
+The following sections will give you more details, and how you can
+configure the behavior.
+
+Comments on listening side:
+
+It looks that RFC2553/3493 talks too little on wildcard bind issue,
+specifically on (1) port space issue, (2) failure mode, (3) relationship
+between AF_INET/INET6 wildcard bind like ordering constraint, and (4) behavior
+when conflicting socket is opened/closed. There can be several separate
+interpretation for this RFC which conform to it but behaves differently.
+So, to implement portable application you should assume nothing
+about the behavior in the kernel. Using getaddrinfo() is the safest way.
+Port number space and wildcard bind issues were discussed in detail
+on ipv6imp mailing list, in mid March 1999 and it looks that there's
+no concrete consensus (means, up to implementers). You may want to
+check the mailing list archives.
+We supply a tool called "bindtest" that explores the behavior of
+kernel bind(2). The tool will not be compiled by default.
+
+If a server application would like to accept IPv4 and IPv6 connections,
+it should use AF_INET and AF_INET6 socket (you'll need two sockets).
+Use getaddrinfo() with AI_PASSIVE into ai_flags, and socket(2) and bind(2)
+to all the addresses returned.
+By opening multiple sockets, you can accept connections onto the socket with
+proper address family. IPv4 connections will be accepted by AF_INET socket,
+and IPv6 connections will be accepted by AF_INET6 socket (NOTE: KAME/BSDI4
+kernel sometimes violate this - we will fix it).
+
+If you try to support IPv6 traffic only and would like to reject IPv4
+traffic, always check the peer address when a connection is made toward
+AF_INET6 listening socket. If the address is IPv4 mapped address, you may
+want to reject the connection. You can check the condition by using
+IN6_IS_ADDR_V4MAPPED() macro. This is one of the reasons the author of
+the section (itojun) dislikes special behavior of AF_INET6 wildcard bind.
+
+Comments on initiating side:
+
+Advise to application implementers: to implement a portable IPv6 application
+(which works on multiple IPv6 kernels), we believe that the following
+is the key to the success:
+- NEVER hardcode AF_INET nor AF_INET6.
+- Use getaddrinfo() and getnameinfo() throughout the system.
+ Never use gethostby*(), getaddrby*(), inet_*() or getipnodeby*().
+- If you would like to connect to destination, use getaddrinfo() and try
+ all the destination returned, like telnet does.
+- Some of the IPv6 stack is shipped with buggy getaddrinfo(). Ship a minimal
+ working version with your application and use that as last resort.
+
+If you would like to use AF_INET6 socket for both IPv4 and IPv6 outgoing
+connection, you will need tweaked implementation in DNS support libraries,
+as documented in RFC2553/3493 6.1. KAME libinet6 includes the tweak in
+getipnodebyname(). Note that getipnodebyname() itself is not recommended as
+it does not handle scoped IPv6 addresses at all. For IPv6 name resolution
+getaddrinfo() is the preferred API. getaddrinfo() does not implement the
+tweak.
+
+When writing applications that make outgoing connections, story goes much
+simpler if you treat AF_INET and AF_INET6 as totally separate address family.
+{set,get}sockopt issue goes simpler, DNS issue will be made simpler. We do
+not recommend you to rely upon IPv4 mapped address.
+
+1.12.1 KAME/BSDI3 and KAME/FreeBSD228
+
+The platforms do not support IPv4 mapped address at all (both listening side
+and initiating side). AF_INET6 and AF_INET sockets are totally separated.
+
+Port number space is totally separate between AF_INET and AF_INET6 sockets.
+
+It should be noted that KAME/BSDI3 and KAME/FreeBSD228 are not conformant
+to RFC2553/3493 section 3.7 and 3.8. It is due to code sharing reasons.
+
+1.12.2 KAME/FreeBSD[34]x
+
+KAME/FreeBSD3x and KAME/FreeBSD4x use shared tcp4/6 code (from
+sys/netinet/tcp*) and shared udp4/6 code (from sys/netinet/udp*).
+They use unified inpcb/in6pcb structure.
+
+1.12.2.1 KAME/FreeBSD[34]x, listening side
+
+The platform can be configured to support IPv4 mapped address/special
+AF_INET6 wildcard bind (enabled by default). There is no kernel compilation
+option to disable it. You can enable/disable the behavior with sysctl
+(per-node), or setsockopt (per-socket).
+
+Wildcard AF_INET6 socket grabs IPv4 connection if and only if the following
+conditions are satisfied:
+- there's no AF_INET socket that matches the IPv4 connection
+- the AF_INET6 socket is configured to accept IPv4 traffic, i.e.
+ getsockopt(IPV6_V6ONLY) returns 0.
+
+(XXX need checking)
+
+1.12.2.2 KAME/FreeBSD[34]x, initiating side
+
+KAME/FreeBSD3x supports outgoing connection to IPv4 mapped address
+(::ffff:10.1.1.1), if the node is configured to accept IPv4 connections
+by AF_INET6 socket.
+
+(XXX need checking)
+
+1.12.3 KAME/NetBSD
+
+KAME/NetBSD uses shared tcp4/6 code (from sys/netinet/tcp*) and shared
+udp4/6 code (from sys/netinet/udp*). The implementation is made differently
+from KAME/FreeBSD[34]x. KAME/NetBSD uses separate inpcb/in6pcb structures,
+while KAME/FreeBSD[34]x uses merged inpcb structure.
+
+It should be noted that the default configuration of KAME/NetBSD is not
+conformant to RFC2553/3493 section 3.8. It is intentionally turned off by
+default for security reasons.
+
+The platform can be configured to support IPv4 mapped address/special AF_INET6
+wildcard bind (disabled by default). Kernel behavior can be summarized as
+follows:
+- default: special support code will be compiled in, but is disabled by
+ default. It can be controlled by sysctl (net.inet6.ip6.v6only),
+ or setsockopt(IPV6_V6ONLY).
+- add "INET6_BINDV6ONLY": No special support code for AF_INET6 wildcard socket
+ will be compiled in. AF_INET6 sockets and AF_INET sockets are totally
+ separate. The behavior is similar to what described in 1.12.1.
+
+sysctl setting will affect per-socket configuration at in6pcb creation time
+only. In other words, per-socket configuration will be copied from sysctl
+configuration at in6pcb creation time. To change per-socket behavior, you
+must perform setsockopt or reopen the socket. Change in sysctl configuration
+will not change the behavior or sockets that are already opened.
+
+1.12.3.1 KAME/NetBSD, listening side
+
+Wildcard AF_INET6 socket grabs IPv4 connection if and only if the following
+conditions are satisfied:
+- there's no AF_INET socket that matches the IPv4 connection
+- the AF_INET6 socket is configured to accept IPv4 traffic, i.e.
+ getsockopt(IPV6_V6ONLY) returns 0.
+
+You cannot bind(2) with IPv4 mapped address. This is a workaround for port
+number duplicate and other twists.
+
+1.12.3.2 KAME/NetBSD, initiating side
+
+When getsockopt(IPV6_V6ONLY) is 0 for a socket, you can make an outgoing
+traffic to IPv4 destination over AF_INET6 socket, using IPv4 mapped
+address destination (::ffff:10.1.1.1).
+
+When getsockopt(IPV6_V6ONLY) is 1 for a socket, you cannot use IPv4 mapped
+address for outgoing traffic.
+
+1.12.4 KAME/BSDI4
+
+KAME/BSDI4 uses NRL-based TCP/UDP stack and inpcb source code,
+which was derived from NRL IPv6/IPsec stack. We guess it supports IPv4 mapped
+address and speical AF_INET6 wildcard bind. The implementation is, again,
+different from other KAME/*BSDs.
+
+1.12.4.1 KAME/BSDI4, listening side
+
+NRL inpcb layer supports special behavior of AF_INET6 wildcard socket.
+There is no way to disable the behavior.
+
+Wildcard AF_INET6 socket grabs IPv4 connection if and only if the following
+condition is satisfied:
+- there's no AF_INET socket that matches the IPv4 connection
+
+1.12.4.2 KAME/BSDI4, initiating side
+
+KAME/BSDi4 supports connection initiation to IPv4 mapped address
+(like ::ffff:10.1.1.1).
+
+1.12.5 KAME/OpenBSD
+
+KAME/OpenBSD uses NRL-based TCP/UDP stack and inpcb source code,
+which was derived from NRL IPv6/IPsec stack.
+
+It should be noted that KAME/OpenBSD is not conformant to RFC2553/3493 section
+3.7 and 3.8. It is intentionally omitted for security reasons.
+
+1.12.5.1 KAME/OpenBSD, listening side
+
+KAME/OpenBSD disables special behavior on AF_INET6 wildcard bind for
+security reasons (if IPv4 traffic toward AF_INET6 wildcard bind is allowed,
+access control will become much harder). KAME/BSDI4 uses NRL-based TCP/UDP
+stack as well, however, the behavior is different due to OpenBSD's security
+policy.
+
+As a result the behavior of KAME/OpenBSD is similar to KAME/BSDI3 and
+KAME/FreeBSD228 (see 1.12.1 for more detail).
+
+1.12.5.2 KAME/OpenBSD, initiating side
+
+KAME/OpenBSD does not support connection initiation to IPv4 mapped address
+(like ::ffff:10.1.1.1).
+
+1.12.6 More issues
+
+IPv4 mapped address support adds a big requirement to EVERY userland codebase.
+Every userland code should check if an AF_INET6 sockaddr contains IPv4
+mapped address or not. This adds many twists:
+
+- Access controls code becomes harder to write.
+ For example, if you would like to reject packets from 10.0.0.0/8,
+ you need to reject packets to AF_INET socket from 10.0.0.0/8,
+ and to AF_INET6 socket from ::ffff:10.0.0.0/104.
+- If a protocol on top of IPv4 is defined differently with IPv6, we need to be
+ really careful when we determine which protocol to use.
+ For example, with FTP protocol, we can not simply use sa_family to determine
+ FTP command sets. The following example is incorrect:
+ if (sa_family == AF_INET)
+ use EPSV/EPRT or PASV/PORT; /*IPv4*/
+ else if (sa_family == AF_INET6)
+ use EPSV/EPRT or LPSV/LPRT; /*IPv6*/
+ else
+ error;
+ The correct code, with consideration to IPv4 mapped address, would be:
+ if (sa_family == AF_INET)
+ use EPSV/EPRT or PASV/PORT; /*IPv4*/
+ else if (sa_family == AF_INET6 && IPv4 mapped address)
+ use EPSV/EPRT or PASV/PORT; /*IPv4 command set on AF_INET6*/
+ else if (sa_family == AF_INET6 && !IPv4 mapped address)
+ use EPSV/EPRT or LPSV/LPRT; /*IPv6*/
+ else
+ error;
+ It is too much to ask for every body to be careful like this.
+ The problem is, we are not sure if the above code fragment is perfect for
+ all situations.
+- By enabling kernel support for IPv4 mapped address (outgoing direction),
+ servers on the kernel can be hosed by IPv6 native packet that has IPv4
+ mapped address in IPv6 header source, and can generate unwanted IPv4 packets.
+ draft-itojun-ipv6-transition-abuse-01.txt, draft-cmetz-v6ops-v4mapped-api-
+ harmful-00.txt, and draft-itojun-v6ops-v4mapped-harmful-01.txt
+ has more on this scenario.
+
+Due to the above twists, some of KAME userland programs has restrictions on
+the use of IPv4 mapped addresses:
+- rshd/rlogind do not accept connections from IPv4 mapped address.
+ This is to avoid malicious use of IPv4 mapped address in IPv6 native
+ packet, to bypass source-address based authentication.
+- ftp/ftpd assume that you are on dual stack network. IPv4 mapped address
+ will be decoded in userland, and will be passed to AF_INET sockets
+ (in other words, ftp/ftpd do not support SIIT environment).
+
+1.12.7 Interaction with SIIT translator
+
+SIIT translator is specified in RFC2765. KAME node cannot become a SIIT
+translator box, nor SIIT end node (a node in SIIT cloud).
+
+To become a SIIT translator box, we need to put additional code for that.
+We do not have the code in our tree at this moment.
+
+There are multiple reasons that we are unable to become SIIT end node.
+(1) SIIT translators require end nodes in the SIIT cloud to be IPv6-only.
+Since we are unable to compile INET-less kernel, we are unable to become
+SIIT end node. (2) As presented in 1.12.6, some of our userland code assumes
+dual stack network. (3) KAME stack filters out IPv6 packets with IPv4
+mapped address in the header, to secure non-SIIT case (which is much more
+common). Effectively KAME node will reject any packets via SIIT translator
+box. See section 1.14 for more detail about the last item.
+
+There are documentation issues too - SIIT document requires very strange
+things. For example, SIIT document asks IPv6-only (meaning no IPv4 code)
+node to be able to construct IPv4 IPsec headers. If a node knows how to
+construct IPv4 IPsec headers, that is not an IPv6-only node, it is a dual-stack
+node. The requirements imposed in SIIT document contradict with the other
+part of the document itself.
+
+1.13 sockaddr_storage
+
+When RFC2553 was about to be finalized, there was discussion on how struct
+sockaddr_storage members are named. One proposal is to prepend "__" to the
+members (like "__ss_len") as they should not be touched. The other proposal
+was that don't prepend it (like "ss_len") as we need to touch those members
+directly. There was no clear consensus on it.
+
+As a result, RFC2553 defines struct sockaddr_storage as follows:
+ struct sockaddr_storage {
+ u_char __ss_len; /* address length */
+ u_char __ss_family; /* address family */
+ /* and bunch of padding */
+ };
+On the contrary, XNET draft defines as follows:
+ struct sockaddr_storage {
+ u_char ss_len; /* address length */
+ u_char ss_family; /* address family */
+ /* and bunch of padding */
+ };
+
+In December 1999, it was agreed that RFC2553bis (RFC3493) should pick the
+latter (XNET) definition.
+
+KAME kit prior to December 1999 used RFC2553 definition. KAME kit after
+December 1999 (including December) will conform to XNET definition,
+based on RFC3493 discussion.
+
+If you look at multiple IPv6 implementations, you will be able to see
+both definitions. As an userland programmer, the most portable way of
+dealing with it is to:
+(1) ensure ss_family and/or ss_len are available on the platform, by using
+ GNU autoconf,
+(2) have -Dss_family=__ss_family to unify all occurrences (including header
+ file) into __ss_family, or
+(3) never touch __ss_family. cast to sockaddr * and use sa_family like:
+ struct sockaddr_storage ss;
+ family = ((struct sockaddr *)&ss)->sa_family
+
+1.14 Invalid addresses on the wire
+
+Some of IPv6 transition technologies embed IPv4 address into IPv6 address.
+These specifications themselves are fine, however, there can be certain
+set of attacks enabled by these specifications. Recent specification
+documents covers up those issues, however, there are already-published RFCs
+that does not have protection against those (like using source address of
+::ffff:127.0.0.1 to bypass "reject packet from remote" filter).
+
+To name a few, these address ranges can be used to hose an IPv6 implementation,
+or bypass security controls:
+- IPv4 mapped address that embeds unspecified/multicast/loopback/broadcast
+ IPv4 address (if they are in IPv6 native packet header, they are malicious)
+ ::ffff:0.0.0.0/104 ::ffff:127.0.0.0/104
+ ::ffff:224.0.0.0/100 ::ffff:255.0.0.0/104
+- 6to4 (RFC3056) prefix generated from unspecified/multicast/loopback/
+ broadcast/private IPv4 address
+ 2002:0000::/24 2002:7f00::/24 2002:e000::/24
+ 2002:ff00::/24 2002:0a00::/24 2002:ac10::/28
+ 2002:c0a8::/32
+- IPv4 compatible address that embeds unspecified/multicast/loopback/broadcast
+ IPv4 address (if they are in IPv6 native packet header, they are malicious).
+ Note that, since KAME doe snot support RFC1933/2893 auto tunnels, KAME nodes
+ are not vulnerable to these packets.
+ ::0.0.0.0/104 ::127.0.0.0/104 ::224.0.0.0/100 ::255.0.0.0/104
+
+Also, since KAME does not support RFC1933/2893 auto tunnels, seeing IPv4
+compatible is very rare. You should take caution if you see those on the wire.
+
+If we see IPv6 packets with IPv4 mapped address (::ffff:0.0.0.0/96) in the
+header in dual-stack environment (not in SIIT environment), they indicate
+that someone is trying to impersonate IPv4 peer. The packet should be dropped.
+
+IPv6 specifications do not talk very much about IPv6 unspecified address (::)
+in the IPv6 source address field. Clarification is in progress.
+Here are couple of comments:
+- IPv6 unspecified address can be used in IPv6 source address field, if and
+ only if we have no legal source address for the node. The legal situations
+ include, but may not be limited to, (1) MLD while no IPv6 address is assigned
+ to the node and (2) DAD.
+- If IPv6 TCP packet has IPv6 unspecified address, it is an attack attempt.
+ The form can be used as a trigger for TCP DoS attack. KAME code already
+ filters them out.
+- The following examples are seemingly illegal. It seems that there's general
+ consensus among ipngwg for those. (1) Mobile IPv6 home address option,
+ (2) offlink packets (so routers should not forward them).
+ KAME implements (2) already.
+
+KAME code is carefully written to avoid such incidents. More specifically,
+KAME kernel will reject packets with certain source/destination address in IPv6
+base header, or IPv6 routing header. Also, KAME default configuration file
+is written carefully, to avoid those attacks.
+
+draft-itojun-ipv6-transition-abuse-01.txt, draft-cmetz-v6ops-v4mapped-api-
+harmful-00.txt and draft-itojun-v6ops-v4mapped-harmful-01.txt has more on
+this issue.
+
+1.15 Node's required addresses
+
+RFC2373 section 2.8 talks about required addresses for an IPv6
+node. The section talks about how KAME stack manages those required
+addresses.
+
+1.15.1 Host case
+
+The following items are automatically assigned to the node (or the node will
+automatically joins the group), at bootstrap time:
+- Loopback address
+- All-nodes multicast addresses (ff01::1)
+
+The following items will be automatically handled when the interface becomes
+IFF_UP:
+- Its link-local address for each interface
+- Solicited-node multicast address for link-local addresses
+- Link-local allnodes multicast address (ff02::1)
+
+The following items need to be configured manually by ifconfig(8) or prefix(8).
+Alternatively, these can be autoconfigured by using stateless address
+autoconfiguration.
+- Assigned unicast/anycast addresses
+- Solicited-Node multicast address for assigned unicast address
+
+Users can join groups by using appropriate system calls like setsockopt(2).
+
+1.15.2 Router case
+
+In addition to the above, routers needs to handle the following items.
+
+The following items need to be configured manually by using ifconfig(8).
+o The subnet-router anycast addresses for the interfaces it is configured
+ to act as a router on (prefix::/64)
+o All other anycast addresses with which the router has been configured
+
+The router will join the following multicast group when rtadvd(8) is available
+for the interface.
+o All-Routers Multicast Addresses (ff02::2)
+
+Routing daemons will join appropriate multicast groups, as necessary,
+like ff02::9 for RIPng.
+
+Users can join groups by using appropriate system calls like setsockopt(2).
+
+1.16 Advanced API
+
+Current KAME kernel implements RFC3542 API. It also implements RFC2292 API,
+for backward compatibility purposes with *BSD-integrated codebase.
+KAME tree ships with RFC3542 headers.
+*BSD-integrated codebase implements either RFC2292, or RFC3542, API.
+see "COVERAGE" document for detailed implementation status.
+
+Here are couple of issues to mention:
+- *BSD-integrated binaries, compiled for RFC2292, will work on KAME kernel.
+ For example, OpenBSD 2.7 /sbin/rtsol will work on KAME/openbsd kernel.
+- KAME binaries, compiled using RFC3542, will not work on *BSD-integrated
+ kenrel. For example, KAME /usr/local/v6/sbin/rtsol will not work on
+ OpenBSD 2.7 kernel.
+- RFC3542 API is not compatible with RFC2292 API. RFC3542 #define symbols
+ conflict with RFC2292 symbols. Therefore, if you compile programs that
+ assume RFC2292 API, the compilation itself goes fine, however, the compiled
+ binary will not work correctly. The problem is not KAME issue, but API
+ issue. For example, Solaris 8 implements RFC3542 API. If you compile
+ RFC2292-based code on Solaris 8, the binary can behave strange.
+
+There are few (or couple of) incompatible behavior in RFC2292 binary backward
+compatibility support in KAME tree. To enumerate:
+- Type 0 routing header lacks support for strict/loose bitmap.
+ Even if we see packets with "strict" bit set, those bits will not be made
+ visible to the userland.
+ Background: RFC2292 document is based on RFC1883 IPv6, and it uses
+ strict/loose bitmap. RFC3542 document is based on RFC2460 IPv6, and it has
+ no strict/loose bitmap (it was removed from RFC2460). KAME tree obeys
+ RFC2460 IPv6, and lacks support for strict/loose bitmap.
+
+The RFC3542 documents leave some particular cases unspecified. The
+KAME implementation treats them as follows:
+- The IPV6_DONTFRAG and IPV6_RECVPATHMTU socket options for TCP
+ sockets are ignored. That is, the setsocktopt() call will succeed
+ but the specified value will have no effect.
+
+1.17 DNS resolver
+
+KAME ships with modified DNS resolver, in libinet6.a.
+libinet6.a has a couple of extensions against libc DNS resolver:
+- Can take "options insecure1" and "options insecure2" in /etc/resolv.conf,
+ which toggles RES_INSECURE[12] option flag bit.
+- EDNS0 receive buffer size notification support. It can be enabled by
+ "options edns0" in /etc/resolv.conf. See USAGE for details.
+- IPv6 transport support (queries/responses over IPv6). Most of BSD official
+ releases now has it already.
+- Partial A6 chain chasing/DNAME/bit string label support (KAME/BSDI4).
+
+
+2. Network Drivers
+
+KAME requires three items to be added into the standard drivers:
+
+(1) (freebsd[234] and bsdi[34] only) mbuf clustering requirement.
+ In this stable release, we changed MINCLSIZE into MHLEN+1 for all the
+ operating systems in order to make all the drivers behave as we expect.
+
+(2) multicast. If "ifmcstat" yields no multicast group for a
+ interface, that interface has to be patched.
+
+To avoid troubles, we suggest you to comment out the device drivers
+for unsupported/unnecessary cards, from the kernel configuration file.
+If you accidentally enable unsupported drivers, some of the userland
+tools may not work correctly (routing daemons are typical example).
+
+In the following sections, "official support" means that KAME developers
+are using that ethernet card/driver frequently.
+
+(NOTE: In the past we required all pcmcia drivers to have a call to
+in6_ifattach(). We have no such requirement any more)
+
+2.1 FreeBSD 2.2.x-RELEASE
+
+Here is a list of FreeBSD 2.2.x-RELEASE drivers and its conditions:
+
+ driver mbuf(1) multicast(2) official support?
+ --- --- --- ---
+ (Ethernet)
+ ar looks ok - -
+ cnw ok ok yes (*)
+ ed ok ok yes
+ ep ok ok yes
+ fe ok ok yes
+ sn looks ok - - (*)
+ vx looks ok - -
+ wlp ok ok - (*)
+ xl ok ok yes
+ zp ok ok -
+ (FDDI)
+ fpa looks ok ? -
+ (ATM)
+ en ok ok yes
+ (Serial)
+ lp ? - not work
+ sl ? - not work
+ sr looks ok ok - (**)
+
+You may want to add an invocation of "rtsol" in "/etc/pccard_ether",
+if you are using notebook computers and PCMCIA ethernet card.
+
+(*) These drivers are distributed with PAO (http://www.jp.freebsd.org/PAO/).
+
+(**) There was some report says that, if you make sr driver up and down and
+then up, the kernel may hang up. We have disabled frame-relay support from
+sr driver and after that this looks to be working fine. If you need
+frame-relay support to come back, please contact KAME developers.
+
+2.2 BSD/OS 3.x
+
+The following lists BSD/OS 3.x device drivers and its conditions:
+
+ driver mbuf(1) multicast(2) official support?
+ --- --- --- ---
+ (Ethernet)
+ cnw ok ok yes
+ de ok ok -
+ df ok ok -
+ eb ok ok -
+ ef ok ok yes
+ exp ok ok -
+ mz ok ok yes
+ ne ok ok yes
+ we ok ok -
+ (FDDI)
+ fpa ok ok -
+ (ATM)
+ en maybe ok -
+ (Serial)
+ ntwo ok ok yes
+ sl ? - not work
+ appp ? - not work
+
+You may want to use "@insert" directive in /etc/pccard.conf to invoke
+"rtsol" command right after dynamic insertion of PCMCIA ethernet cards.
+
+2.3 NetBSD
+
+The following table lists the network drivers we have tried so far.
+
+ driver mbuf(1) multicast(2) official support?
+ --- --- --- ---
+ (Ethernet)
+ awi pcmcia/i386 ok ok -
+ bah zbus/amiga NG(*)
+ cnw pcmcia/i386 ok ok yes
+ ep pcmcia/i386 ok ok -
+ fxp pci/i386 ok(*2) ok -
+ tlp pci/i386 ok ok -
+ le sbus/sparc ok ok yes
+ ne pci/i386 ok ok yes
+ ne pcmcia/i386 ok ok yes
+ rtk pci/i386 ok ok -
+ wi pcmcia/i386 ok ok yes
+ (ATM)
+ en pci/i386 ok ok -
+
+(*) This may need some fix, but I'm not sure what arcnet interfaces assume...
+
+2.4 FreeBSD 3.x-RELEASE
+
+Here is a list of FreeBSD 3.x-RELEASE drivers and its conditions:
+
+ driver mbuf(1) multicast(2) official support?
+ --- --- --- ---
+ (Ethernet)
+ cnw ok ok -(*)
+ ed ? ok -
+ ep ok ok -
+ fe ok ok yes
+ fxp ?(**)
+ lnc ? ok -
+ sn ? ? -(*)
+ wi ok ok yes
+ xl ? ok -
+
+(*) These drivers are distributed with PAO as PAO3
+ (http://www.jp.freebsd.org/PAO/).
+(**) there were trouble reports with multicast filter initialization.
+
+More drivers will just simply work on KAME FreeBSD 3.x-RELEASE but have not
+been checked yet.
+
+2.5 FreeBSD 4.x-RELEASE
+
+Here is a list of FreeBSD 4.x-RELEASE drivers and its conditions:
+
+ driver multicast
+ --- ---
+ (Ethernet)
+ lnc/vmware ok
+
+2.6 OpenBSD 2.x
+
+Here is a list of OpenBSD 2.x drivers and its conditions:
+
+ driver mbuf(1) multicast(2) official support?
+ --- --- --- ---
+ (Ethernet)
+ de pci/i386 ok ok yes
+ fxp pci/i386 ?(*)
+ le sbus/sparc ok ok yes
+ ne pci/i386 ok ok yes
+ ne pcmcia/i386 ok ok yes
+ wi pcmcia/i386 ok ok yes
+
+(*) There seem to be some problem in driver, with multicast filter
+configuration. This happens with certain revision of chipset on the card.
+Should be fixed by now by workaround in sys/net/if.c, but still not sure.
+
+2.7 BSD/OS 4.x
+
+The following lists BSD/OS 4.x device drivers and its conditions:
+
+ driver mbuf(1) multicast(2) official support?
+ --- --- --- ---
+ (Ethernet)
+ de ok ok yes
+ exp (*)
+
+You may want to use "@insert" directive in /etc/pccard.conf to invoke
+"rtsol" command right after dynamic insertion of PCMCIA ethernet cards.
+
+(*) exp driver has serious conflict with KAME initialization sequence.
+A workaround is committed into sys/i386/pci/if_exp.c, and should be okay by now.
+
+
+3. Translator
+
+We categorize IPv4/IPv6 translator into 4 types.
+
+Translator A --- It is used in the early stage of transition to make
+it possible to establish a connection from an IPv6 host in an IPv6
+island to an IPv4 host in the IPv4 ocean.
+
+Translator B --- It is used in the early stage of transition to make
+it possible to establish a connection from an IPv4 host in the IPv4
+ocean to an IPv6 host in an IPv6 island.
+
+Translator C --- It is used in the late stage of transition to make it
+possible to establish a connection from an IPv4 host in an IPv4 island
+to an IPv6 host in the IPv6 ocean.
+
+Translator D --- It is used in the late stage of transition to make it
+possible to establish a connection from an IPv6 host in the IPv6 ocean
+to an IPv4 host in an IPv4 island.
+
+KAME provides an TCP relay translator for category A. This is called
+"FAITH". We also provide IP header translator for category A.
+
+3.1 FAITH TCP relay translator
+
+FAITH system uses TCP relay daemon called "faithd" helped by the KAME kernel.
+FAITH will reserve an IPv6 address prefix, and relay TCP connection
+toward that prefix to IPv4 destination.
+
+For example, if the reserved IPv6 prefix is 3ffe:0501:0200:ffff::, and
+the IPv6 destination for TCP connection is 3ffe:0501:0200:ffff::163.221.202.12,
+the connection will be relayed toward IPv4 destination 163.221.202.12.
+
+ destination IPv4 node (163.221.202.12)
+ ^
+ | IPv4 tcp toward 163.221.202.12
+ FAITH-relay dual stack node
+ ^
+ | IPv6 TCP toward 3ffe:0501:0200:ffff::163.221.202.12
+ source IPv6 node
+
+faithd must be invoked on FAITH-relay dual stack node.
+
+For more details, consult kame/kame/faithd/README and RFC3142.
+
+3.2 IPv6-to-IPv4 header translator
+
+(to be written)
+
+
+4. IPsec
+
+IPsec is implemented as the following three components.
+
+(1) Policy Management
+(2) Key Management
+(3) AH, ESP and IPComp handling in kernel
+
+Note that KAME/OpenBSD does NOT include support for KAME IPsec code,
+as OpenBSD team has their home-brew IPsec stack and they have no plan
+to replace it. IPv6 support for IPsec is, therefore, lacking on KAME/OpenBSD.
+
+http://www.netbsd.org/Documentation/network/ipsec/ has more information
+including usage examples.
+
+4.1 Policy Management
+
+The kernel implements experimental policy management code. There are two ways
+to manage security policy. One is to configure per-socket policy using
+setsockopt(3). In this cases, policy configuration is described in
+ipsec_set_policy(3). The other is to configure kernel packet filter-based
+policy using PF_KEY interface, via setkey(8).
+
+The policy entry will be matched in order. The order of entries makes
+difference in behavior.
+
+4.2 Key Management
+
+The key management code implemented in this kit (sys/netkey) is a
+home-brew PFKEY v2 implementation. This conforms to RFC2367.
+
+The home-brew IKE daemon, "racoon" is included in the kit (kame/kame/racoon,
+or usr.sbin/racoon).
+Basically you'll need to run racoon as daemon, then setup a policy
+to require keys (like ping -P 'out ipsec esp/transport//use').
+The kernel will contact racoon daemon as necessary to exchange keys.
+
+In IKE spec, there's ambiguity about interpretation of "tunnel" proposal.
+For example, if we would like to propose the use of following packet:
+ IP AH ESP IP payload
+some implementation proposes it as "AH transport and ESP tunnel", since
+this is more logical from packet construction point of view. Some
+implementation proposes it as "AH tunnel and ESP tunnel".
+Racoon follows the latter route (previously it followed the former, and
+the latter interpretation seems to be popular/consensus).
+This raises real interoperability issue. We hope this to be resolved quickly.
+
+racoon does not implement byte lifetime for both phase 1 and phase 2
+(RFC2409 page 35, Life Type = kilobytes).
+
+4.3 AH and ESP handling
+
+IPsec module is implemented as "hooks" to the standard IPv4/IPv6
+processing. When sending a packet, ip{,6}_output() checks if ESP/AH
+processing is required by checking if a matching SPD (Security
+Policy Database) is found. If ESP/AH is needed,
+{esp,ah}{4,6}_output() will be called and mbuf will be updated
+accordingly. When a packet is received, {esp,ah}4_input() will be
+called based on protocol number, i.e. (*inetsw[proto])().
+{esp,ah}4_input() will decrypt/check authenticity of the packet,
+and strips off daisy-chained header and padding for ESP/AH. It is
+safe to strip off the ESP/AH header on packet reception, since we
+will never use the received packet in "as is" form.
+
+By using ESP/AH, TCP4/6 effective data segment size will be affected by
+extra daisy-chained headers inserted by ESP/AH. Our code takes care of
+the case.
+
+Basic crypto functions can be found in directory "sys/crypto". ESP/AH
+transform are listed in {esp,ah}_core.c with wrapper functions. If you
+wish to add some algorithm, add wrapper function in {esp,ah}_core.c, and
+add your crypto algorithm code into sys/crypto.
+
+Tunnel mode works basically fine, but comes with the following restrictions:
+- You cannot run routing daemon across IPsec tunnel, since we do not model
+ IPsec tunnel as pseudo interfaces.
+- Authentication model for AH tunnel must be revisited. We'll need to
+ improve the policy management engine, eventually.
+- Path MTU discovery does not work across IPv6 IPsec tunnel gateway due to
+ insufficient code.
+
+AH specification does not talk much about "multiple AH on a packet" case.
+We incrementally compute AH checksum, from inside to outside. Also, we
+treat inner AH to be immutable.
+For example, if we are to create the following packet:
+ IP AH1 AH2 AH3 payload
+we do it incrementally. As a result, we get crypto checksums like below:
+ AH3 has checksum against "IP AH3' payload".
+ where AH3' = AH3 with checksum field filled with 0.
+ AH2 has checksum against "IP AH2' AH3 payload".
+ AH1 has checksum against "IP AH1' AH2 AH3 payload",
+Also note that AH3 has the smallest sequence number, and AH1 has the largest
+sequence number.
+
+To avoid traffic analysis on shorter packets, ESP output logic supports
+random length padding. By setting net.inet.ipsec.esp_randpad (or
+net.inet6.ipsec6.esp_randpad) to positive value N, you can ask the kernel
+to randomly pad packets shorter than N bytes, to random length smaller than
+or equal to N. Note that N does not include ESP authentication data length.
+Also note that the random padding is not included in TCP segment
+size computation. Negative value will turn off the functionality.
+Recommended value for N is like 128, or 256. If you use a too big number
+as N, you may experience inefficiency due to fragmented packets.
+
+4.4 IPComp handling
+
+IPComp stands for IP payload compression protocol. This is aimed for
+payload compression, not the header compression like PPP VJ compression.
+This may be useful when you are using slow serial link (say, cell phone)
+with powerful CPU (well, recent notebook PCs are really powerful...).
+The protocol design of IPComp is very similar to IPsec, though it was
+defined separately from IPsec itself.
+
+Here are some points to be noted:
+- IPComp is treated as part of IPsec protocol suite, and SPI and
+ CPI space is unified. Spec says that there's no relationship
+ between two so they are assumed to be separate in specs.
+- IPComp association (IPCA) is kept in SAD.
+- It is possible to use well-known CPI (CPI=2 for DEFLATE for example),
+ for outbound/inbound packet, but for indexing purposes one element from
+ SPI/CPI space will be occupied anyway.
+- pfkey is modified to support IPComp. However, there's no official
+ SA type number assignment yet. Portability with other IPComp
+ stack is questionable (anyway, who else implement IPComp on UN*X?).
+- Spec says that IPComp output processing must be performed before AH/ESP
+ output processing, to achieve better compression ratio and "stir" data
+ stream before encryption. The most meaningful processing order is:
+ (1) compress payload by IPComp, (2) encrypt payload by ESP, then (3) attach
+ authentication data by AH.
+ However, with manual SPD setting, you are able to violate the ordering
+ (KAME code is too generic, maybe). Also, it is just okay to use IPComp
+ alone, without AH/ESP.
+- Though the packet size can be significantly decreased by using IPComp, no
+ special consideration is made about path MTU (spec talks nothing about MTU
+ consideration). IPComp is designed for serial links, not ethernet-like
+ medium, it seems.
+- You can change compression ratio on outbound packet, by changing
+ deflate_policy in sys/netinet6/ipcomp_core.c. You can also change outbound
+ history buffer size by changing deflate_window_out in the same source code.
+ (should it be sysctl accessible, or per-SAD configurable?)
+- Tunnel mode IPComp is not working right. KAME box can generate tunnelled
+ IPComp packet, however, cannot accept tunneled IPComp packet.
+- You can negotiate IPComp association with racoon IKE daemon.
+- KAME code does not attach Adler32 checksum to compressed data.
+ see ipsec wg mailing list discussion in Jan 2000 for details.
+
+4.5 Conformance to RFCs and IDs
+
+The IPsec code in the kernel conforms (or, tries to conform) to the
+following standards:
+ "old IPsec" specification documented in rfc182[5-9].txt
+ "new IPsec" specification documented in:
+ rfc240[1-6].txt rfc241[01].txt rfc2451.txt rfc3602.txt
+ IPComp:
+ RFC2393: IP Payload Compression Protocol (IPComp)
+IKE specifications (rfc240[7-9].txt) are implemented in userland
+as "racoon" IKE daemon.
+
+Currently supported algorithms are:
+ old IPsec AH
+ null crypto checksum (no document, just for debugging)
+ keyed MD5 with 128bit crypto checksum (rfc1828.txt)
+ keyed SHA1 with 128bit crypto checksum (no document)
+ HMAC MD5 with 128bit crypto checksum (rfc2085.txt)
+ HMAC SHA1 with 128bit crypto checksum (no document)
+ HMAC RIPEMD160 with 128bit crypto checksum (no document)
+ old IPsec ESP
+ null encryption (no document, similar to rfc2410.txt)
+ DES-CBC mode (rfc1829.txt)
+ new IPsec AH
+ null crypto checksum (no document, just for debugging)
+ keyed MD5 with 96bit crypto checksum (no document)
+ keyed SHA1 with 96bit crypto checksum (no document)
+ HMAC MD5 with 96bit crypto checksum (rfc2403.txt
+ HMAC SHA1 with 96bit crypto checksum (rfc2404.txt)
+ HMAC SHA2-256 with 96bit crypto checksum (draft-ietf-ipsec-ciph-sha-256-00.txt)
+ HMAC SHA2-384 with 96bit crypto checksum (no document)
+ HMAC SHA2-512 with 96bit crypto checksum (no document)
+ HMAC RIPEMD160 with 96bit crypto checksum (RFC2857)
+ AES XCBC MAC with 96bit crypto checksum (RFC3566)
+ new IPsec ESP
+ null encryption (rfc2410.txt)
+ DES-CBC with derived IV
+ (draft-ietf-ipsec-ciph-des-derived-01.txt, draft expired)
+ DES-CBC with explicit IV (rfc2405.txt)
+ 3DES-CBC with explicit IV (rfc2451.txt)
+ BLOWFISH CBC (rfc2451.txt)
+ CAST128 CBC (rfc2451.txt)
+ RIJNDAEL/AES CBC (rfc3602.txt)
+ AES counter mode (rfc3686.txt)
+
+ each of the above can be combined with new IPsec AH schemes for
+ ESP authentication.
+ IPComp
+ RFC2394: IP Payload Compression Using DEFLATE
+
+The following algorithms are NOT supported:
+ old IPsec AH
+ HMAC MD5 with 128bit crypto checksum + 64bit replay prevention
+ (rfc2085.txt)
+ keyed SHA1 with 160bit crypto checksum + 32bit padding (rfc1852.txt)
+
+The key/policy management API is based on the following document, with fair
+amount of extensions:
+ RFC2367: PF_KEY key management API
+
+4.6 ECN consideration on IPsec tunnels
+
+KAME IPsec implements ECN-friendly IPsec tunnel, described in
+draft-ietf-ipsec-ecn-02.txt.
+Normal IPsec tunnel is described in RFC2401. On encapsulation,
+IPv4 TOS field (or, IPv6 traffic class field) will be copied from inner
+IP header to outer IP header. On decapsulation outer IP header
+will be simply dropped. The decapsulation rule is not compatible
+with ECN, since ECN bit on the outer IP TOS/traffic class field will be
+lost.
+To make IPsec tunnel ECN-friendly, we should modify encapsulation
+and decapsulation procedure. This is described in
+draft-ietf-ipsec-ecn-02.txt, chapter 3.3.
+
+KAME IPsec tunnel implementation can give you three behaviors, by setting
+net.inet.ipsec.ecn (or net.inet6.ipsec6.ecn) to some value:
+- RFC2401: no consideration for ECN (sysctl value -1)
+- ECN forbidden (sysctl value 0)
+- ECN allowed (sysctl value 1)
+Note that the behavior is configurable in per-node manner, not per-SA manner
+(draft-ietf-ipsec-ecn-02 wants per-SA configuration, but it looks too much
+for me).
+
+The behavior is summarized as follows (see source code for more detail):
+
+ encapsulate decapsulate
+ --- ---
+RFC2401 copy all TOS bits drop TOS bits on outer
+ from inner to outer. (use inner TOS bits as is)
+
+ECN forbidden copy TOS bits except for ECN drop TOS bits on outer
+ (masked with 0xfc) from inner (use inner TOS bits as is)
+ to outer. set ECN bits to 0.
+
+ECN allowed copy TOS bits except for ECN use inner TOS bits with some
+ CE (masked with 0xfe) from change. if outer ECN CE bit
+ inner to outer. is 1, enable ECN CE bit on
+ set ECN CE bit to 0. the inner.
+
+General strategy for configuration is as follows:
+- if both IPsec tunnel endpoint are capable of ECN-friendly behavior,
+ you'd better configure both end to "ECN allowed" (sysctl value 1).
+- if the other end is very strict about TOS bit, use "RFC2401"
+ (sysctl value -1).
+- in other cases, use "ECN forbidden" (sysctl value 0).
+The default behavior is "ECN forbidden" (sysctl value 0).
+
+For more information, please refer to:
+ draft-ietf-ipsec-ecn-02.txt
+ RFC2481 (Explicit Congestion Notification)
+ KAME sys/netinet6/{ah,esp}_input.c
+
+(Thanks goes to Kenjiro Cho <kjc@csl.sony.co.jp> for detailed analysis)
+
+4.7 Interoperability
+
+IPsec, IPComp (in kernel) and IKE (in userland as "racoon") has been tested
+at several interoperability test events, and it is known to interoperate
+with many other implementations well. Also, KAME IPsec has quite wide
+coverage for IPsec crypto algorithms documented in RFC (we do not cover
+algorithms with intellectual property issues, though).
+
+Here are (some of) platforms we have tested IPsec/IKE interoperability
+in the past, no particular order. Note that both ends (KAME and
+others) may have modified their implementation, so use the following
+list just for reference purposes.
+ 6WIND, ACC, Allied-telesis, Altiga, Ashley-laurent (vpcom.com),
+ BlueSteel, CISCO IOS, Checkpoint FW-1, Compaq Tru54 UNIX
+ X5.1B-BL4, Cryptek, Data Fellows (F-Secure), Ericsson,
+ F-Secure VPN+ 5.40, Fitec, Fitel, FreeS/WAN, HITACHI, HiFn,
+ IBM AIX 5.1, III, IIJ (fujie stack), Intel Canada, Intel
+ Packet Protect, MEW NetCocoon, MGCS, Microsoft WinNT/2000/XP,
+ NAI PGPnet, NEC IX5000, NIST (linux IPsec + plutoplus),
+ NetLock, Netoctave, Netopia, Netscreen, Nokia EPOC, Nortel
+ GatewayController/CallServer 2000 (not released yet),
+ NxNetworks, OpenBSD isakmpd on OpenBSD, Oullim information
+ technologies SECUREWORKS VPN gateway 3.0, Pivotal, RSA,
+ Radguard, RapidStream, RedCreek, Routerware, SSH, SecGo
+ CryptoIP v3, Secure Computing, Soliton, Sun Solaris 8,
+ TIS/NAI Gauntret, Toshiba, Trilogy AdmitOne 2.6, Trustworks
+ TrustedClient v3.2, USAGI linux, VPNet, Yamaha RT series,
+ ZyXEL
+
+Here are (some of) platforms we have tested IPComp/IKE interoperability
+in the past, in no particular order.
+ Compaq, IRE, SSH, NetLock, FreeS/WAN, F-Secure VPN+ 5.40
+
+VPNC (vpnc.org) provides IPsec conformance tests, using KAME and OpenBSD
+IPsec/IKE implementations. Their test results are available at
+http://www.vpnc.org/conformance.html, and it may give you more idea
+about which implementation interoperates with KAME IPsec/IKE implementation.
+
+4.8 Operations with IPsec tunnel mode
+
+First of all, IPsec tunnel is a very hairy thing. It seems to do a neat thing
+like VPN configuration or secure remote accesses, however, it comes with lots
+of architectural twists.
+
+RFC2401 defines IPsec tunnel mode, within the context of IPsec. RFC2401
+defines tunnel mode packet encapsulation/decapsulation on its own, and
+does not refer other tunnelling specifications. Since RFC2401 advocates
+filter-based SPD database matches, it would be natural for us to implement
+IPsec tunnel mode as filters - not as pseudo interfaces.
+
+There are some people who are trying to separate IPsec "tunnel mode" from
+the IPsec itself. They would like to implement IPsec transport mode only,
+and combine it with tunneling pseudo devices. The prime example is found
+in draft-touch-ipsec-vpn-01.txt. However, if you really define pseudo
+interfaces separately from IPsec, IKE daemons would need to negotiate
+transport mode SAs, instead of tunnel mode SAs. Therefore, we cannot
+really mix RFC2401-based interpretation and draft-touch-ipsec-vpn-01.txt
+interpretation.
+
+The KAME stack implements can be configured in two ways. You may need
+to recompile your kernel to switch the behavior.
+- RFC2401 IPsec tunnel mode approach (4.8.1)
+- draft-touch-ipsec-vpn approach (4.8.2)
+ Works in all kernel configuration, but racoon(8) may not interoperate.
+
+There are pros and cons on these approaches:
+
+RFC2401 IPsec tunnel mode (filter-like) approach
+ PRO: SPD lookup fits nicely with packet filters (if you integrate them)
+ CON: cannot run routing daemons across IPsec tunnels
+ CON: it is very hard to control source address selection on originating
+ cases
+ ???: IPv6 scope zone is kept the same
+draft-touch-ipsec-vpn (transportmode + Pseudo-interface) approach
+ PRO: run routing daemons across IPsec tunnels
+ PRO: source address selection can be done normally, by looking at
+ IPsec tunnel pseudo devices
+ CON: on outbound, possibility of infinite loops if routing setup
+ is wrong
+ CON: due to differences in encap/decap logic from RFC2401, it may not
+ interoperate with very picky RFC2401 implementations
+ (those who check TOS bits, for example)
+ CON: cannot negotiate IKE with other IPsec tunnel-mode devices
+ (the other end has to implement
+ ???: IPv6 scope zone is likely to be different from the real ethernet
+ interface
+
+The recommendation is different depending on the situation you have:
+- use draft-touch-ipsec-vpn if you have the control over the other end.
+ this one is the best in terms of simplicity.
+- if the other end is normal IPsec device with RFC2401 implementation,
+ you need to use RFC2401, otherwise you won't be able to run IKE.
+- use RFC2401 approach if you just want to forward packets back and forth
+ and there's no plan to use IPsec gateway itself as an originating device.
+
+4.8.1 RFC2401 IPsec tunnel mode approach
+
+To configure your device as RFC2401 IPsec tunnel mode endpoint, you will
+use "tunnel" keyword in setkey(8) "spdadd" directives. Let us assume the
+following topology (A and B could be a network, like prefix/length):
+
+ ((((((((((((The internet))))))))))))
+ | |
+ |C (global) |D
+ your device peer's device
+ |A (private) |B
+ ==+===== VPN net ==+===== VPN net
+
+The policy configuration directive is like this. You will need manual
+SAs, or IKE daemon, for actual encryption:
+
+ # setkey -c <<EOF
+ spdadd A B any -P out ipsec esp/tunnel/C-D/use;
+ spdadd B A any -P in ipsec esp/tunnel/D-C/use;
+ ^D
+
+The inbound/outbound traffic is monitored/captured by SPD engine, which works
+just like packet filters.
+
+With this, forwarding case should work flawlessly. However, troubles arise
+when you have one of the following requirements:
+- When you originate traffic from your VPN gateway device to VPN net on the
+ other end (like B), you want your source address to be A (private side)
+ so that the traffic would be protected by the policy.
+ With this approach, however, the source address selection logic follows
+ normal routing table, and C (global side) will be picked for any outgoing
+ traffic, even if the destination is B. The resulting packet will be like
+ this:
+ IP[C -> B] payload
+ and will not match the policy (= sent in clear).
+- When you want to run routing protocols on top of the IPsec tunnel, it is
+ not possible. As there is no pseudo device that identifies the IPsec tunnel,
+ you cannot identify where the routing information came from. As a result,
+ you can't run routing daemons.
+
+4.8.2 draft-touch-ipsec-vpn approach
+
+With this approach, you will configure gif(4) tunnel interfaces, as well as
+IPsec transport mode SAs.
+
+ # gifconfig gif0 C D
+ # ifconfig gif0 A B
+ # setkey -c <<EOF
+ spdadd C D any -P out ipsec esp/transport//use;
+ spdadd D C any -P in ipsec esp/transport//use;
+ ^D
+
+Since we have a pseudo-interface "gif0", and it affects the routes and
+the source address selection logic, we can have source address A, for
+packets originated by the VPN gateway to B (and the VPN cloud).
+We can also exchange routing information over the tunnel (gif0), as the tunnel
+is represented as a pseudo interface (dynamic routes points to the
+pseudo interface).
+
+There is a big drawbacks, however; with this, you can use IKE if and only if
+the other end is using draft-touch-ipsec-vpn approach too. Since racoon(8)
+grabs phase 2 IKE proposals from the kernel SPD database, you will be
+negotiating IPsec transport-mode SAs with the other end, not tunnel-mode SAs.
+Also, since the encapsulation mechanism is different from RFC2401, you may not
+be able to interoperate with a picky RFC2401 implementations - if the other
+end checks certain outer IP header fields (like TOS), you will not be able to
+interoperate.
+
+
+5. ALTQ
+
+KAME kit includes ALTQ, which supports FreeBSD3, FreeBSD4, FreeBSD5
+NetBSD. OpenBSD has ALTQ merged into pf and its ALTQ code is not
+compatible with other platforms so that KAME's ALTQ is not used for
+OpenBSD. For BSD/OS, ALTQ does not work.
+ALTQ in KAME supports IPv6.
+(actually, ALTQ is developed on KAME repository since ALTQ 2.1 - Jan 2000)
+
+ALTQ occupies single character device number. For FreeBSD, it is officially
+allocated. For OpenBSD and NetBSD, we use the number which is not
+currently allocated (will eventually get an official number).
+The character device is enabled for i386 architecture only. To enable and
+compile ALTQ-ready kernel for other architectures, take the following steps:
+- assume that your architecture is FOOBAA.
+- modify sys/arch/FOOBAA/FOOBAA/conf.c (or somewhere that defines cdevsw),
+ to include a line for ALTQ. look at sys/arch/i386/i386/conf.c for
+ example. The major number must be same as i386 case.
+- copy kernel configuration file (like ALTQ.v6 or GENERIC.v6) from i386,
+ and modify accordingly.
+- build a kernel.
+- before building userland, change netbsd/{lib,usr.sbin,usr.bin}/Makefile
+ (or openbsd/foobaa) so that it will visit altq-related sub directories.
+
+
+6. Mobile IPv6
+
+6.1 KAME node as correspondent node
+
+Default installation recognizes home address option (in destination
+options header). No sub-options are supported. Interaction with
+IPsec, and/or 2292bis API, needs further study.
+
+6.2 KAME node as home agent/mobile node
+
+KAME kit includes Ericsson mobile-ip6 code. The integration is just started
+(in Feb 2000), and we will need some more time to integrate it better.
+
+See kame/mip6config/{QUICKSTART,README_MIP6.txt} for more details.
+
+The Ericsson code implements revision 09 of the mobile-ip6 draft. There
+are other implementations available:
+ NEC: http://www.6bone.nec.co.jp/mipv6/internal-dist/ (-13 draft)
+ SFC: http://neo.sfc.wide.ad.jp/~mip6/ (-13 draft)
+
+7. Coding style
+
+The KAME developers basically do not make a bother about coding
+style. However, there is still some agreement on the style, in order
+to make the distributed development smooth.
+
+- follow *BSD KNF where possible. note: there are multiple KNF standards.
+- the tab character should be 8 columns wide (tabstops are at 8, 16, 24, ...
+ column). With vi, use ":set ts=8 sw=8".
+ With GNU Emacs 20 and later, the easiest way is to use the "bsd" style of
+ cc-mode with the variable "c-basic-offset" being 8;
+ (add-hook 'c-mode-common-hook
+ (function
+ (lambda ()
+ (c-set-style "bsd")
+ (setq c-basic-offset 8) ; XXX for Emacs 20 only
+ )))
+ The "bsd" style in GNU Emacs 21 sets the variable to 8 by default,
+ so the line marked by "XXX" is not necessary if you only use GNU
+ Emacs 21.
+- each line should be within 80 characters.
+- keep a single open/close bracket in a comment such as in the following
+ line:
+ putchar('('); /* ) */
+ without this, some vi users would have a hard time to match a pair of
+ brackets. Although this type of bracket seems clumsy and is even
+ harmful for some other type of vi users and Emacs users, the
+ agreement in the KAME developers is to allow it.
+- add the following line to the head of every KAME-derived file:
+ /* (dollar)KAME(dollar) */
+ where "(dollar)" is the dollar character ($), and around "$" are tabs.
+ (this is for C. For other language, you should use its own comment
+ line.)
+ Once committed to the CVS repository, this line will contain its
+ version number (see, for example, at the top of this file). This
+ would make it easy to report a bug.
+- when creating a new file with the WIDE copyright, tap "make copyright.c" at
+ the top-level, and use copyright.c as a template. KAME RCS tag will be
+ included automatically.
+- when editing a third-party package, keep its own coding style as
+ much as possible, even if the style does not follow the items above.
+- it is recommended to always wrap an expression containing
+ bitwise operators by parentheses, especially when the expression is
+ combined with relational operators, in order to avoid unintentional
+ mismatch of operators. Thus, we should write
+ if ((a & b) == 0) /* (A) */
+ or
+ if (a & (b == 0)) /* (B) */
+ instead of
+ if (a & b == 0) /* (C) */
+ even if the programmer's intention was (C), which is equivalent to
+ (B) according to the grammar of the language C.
+ Thus, we should write a code to test if a bit-flag is set for a
+ given variable as follows:
+ if ((flag & FLAG_A) == 0) /* (D) the FLAG_A is NOT set */
+ if ((flag & FLAG_A) != 0) /* (E) the FLAG_A is set */
+ Some developers in the KAME project rather prefer the following style:
+ if (!(flag & FLAG_A)) /* (F) the FLAG_A is NOT set */
+ if ((flag & FLAG_A)) /* (G) the FLAG_A is set */
+ because it would be more intuitive in terms of the relationship
+ between the negation operator (!) and the semantics of the
+ condition. The KAME developers have discussed the style, and have
+ agreed that all the styles from (D) to (G) are valid. So, when you
+ see styles like (D) and (E) in the KAME code and feel a bit strange,
+ please just keep them. They are intentional.
+- When inserting a separate block just to define some intra-block
+ variables, add the level of indentation as if the block was in a
+ control statement such as if-else, for, or while. For example,
+ foo ()
+ {
+ int a;
+
+ {
+ int internal_a;
+ ...
+ }
+ }
+ should be used, instead of
+ foo ()
+ {
+ int a;
+
+ {
+ int internal_a;
+ ...
+ }
+ }
+- Do not use printf() or log() in the packet input path of the kernel code.
+ They can make the system vulnerable to packet flooding attacks (results in
+ /var overflow).
+- (not a style issue)
+ To disable a module that is mistakenly imported (by CVS), just
+ remove the source tree in the repository. Note, however, that the
+ removal might annoy other developers who have already checked the
+ module out, so you should announce the removal as soon as possible.
+ Also, be 100% sure not to remove other modules.
+
+When you want to contribute something to the KAME project, and if *you
+do not mind* the agreement, it would be helpful for the project to
+keep these rules. Note, however, that we would never intend to force
+you to adopt our rules. We would rather regard your own style,
+especially when you have a policy about the style.
+
+
+8. Policy on technology with intellectual property right restriction
+
+There are quite a few IETF documents/whatever which has intellectual property
+right (IPR) restriction. KAME's stance is stated below.
+
+ The goal of KAME is to provide freely redistributable, BSD-licensed,
+ implementation of Internet protocol technologies.
+ For this purpose, we implement protocols that (1) do not need license
+ contract with IPR holder, and (2) are royalty-free.
+ The reason for (1) is, even if KAME contracts with the IPR holder in
+ question, the users of KAME stack (usually implementers of some other
+ codebase) would need to make a license contract with the IPR holder.
+ It would damage the "freely redistributable" status of KAME codebase.
+
+ By doing so KAME is (implicitly) trying to advocate no-license-contract,
+ royalty-free, release of IPRs.
+
+Note however, as documented in README, we do not guarantee that KAME code
+is free of IPR infringement, you MUST check it if you are to integrate
+KAME into your product (or whatever):
+ READ CAREFULLY: Several countries have legal enforcement for
+ export/import/use of cryptographic software. Check it before playing
+ with the kit. We do not intend to be your legalese clearing house
+ (NO WARRANTY). If you intend to include KAME stack into your product,
+ you'll need to check if the licenses on each file fit your situations,
+ and/or possible intellectual property right issues.
+
+ <end of IMPLEMENTATION>
diff --git a/share/doc/IPv6/Makefile b/share/doc/IPv6/Makefile
new file mode 100644
index 000000000000..62e160cbfcf1
--- /dev/null
+++ b/share/doc/IPv6/Makefile
@@ -0,0 +1,7 @@
+# $FreeBSD$
+
+NO_OBJ=
+FILES= IMPLEMENTATION
+FILESDIR= ${SHAREDIR}/doc/IPv6
+
+.include <bsd.prog.mk>
diff --git a/share/doc/Makefile b/share/doc/Makefile
new file mode 100644
index 000000000000..7eabbd97e9f8
--- /dev/null
+++ b/share/doc/Makefile
@@ -0,0 +1,25 @@
+# From: @(#)Makefile 8.1 (Berkeley) 6/5/93
+# $FreeBSD$
+
+.include <bsd.own.mk>
+
+SUBDIR= ${_bind9} IPv6 legal ${_llvm} ${_roffdocs}
+
+.if ${MK_BIND} != "no"
+_bind9= bind9
+.endif
+
+.if ${MK_CLANG} != "no"
+_llvm= llvm
+.endif
+
+# FIXME this is not a real solution ...
+.if ${MK_GROFF} != "no"
+_roffdocs= papers psd smm usd
+.endif
+
+# Default output format for troff documents is ascii.
+# To generate postscript versions of troff documents, use:
+# make PRINTERDEVICE=ps
+
+.include <bsd.subdir.mk>
diff --git a/share/doc/bind9/Makefile b/share/doc/bind9/Makefile
new file mode 100644
index 000000000000..3aca4e5515a0
--- /dev/null
+++ b/share/doc/bind9/Makefile
@@ -0,0 +1,31 @@
+# $FreeBSD$
+
+BIND_DIR= ${.CURDIR}/../../../contrib/bind9
+SRCDIR= ${BIND_DIR}/doc
+
+.PATH: ${BIND_DIR} ${SRCDIR}/arm ${SRCDIR}/misc
+
+NO_OBJ=
+
+FILESGROUPS= TOP ARM MISC
+TOP= CHANGES COPYRIGHT FAQ HISTORY README
+TOPDIR= ${DOCDIR}/bind9
+ARM= Bv9ARM.ch01.html Bv9ARM.ch02.html Bv9ARM.ch03.html \
+ Bv9ARM.ch04.html Bv9ARM.ch05.html Bv9ARM.ch06.html \
+ Bv9ARM.ch07.html Bv9ARM.ch08.html Bv9ARM.ch09.html \
+ Bv9ARM.ch10.html Bv9ARM.html Bv9ARM.pdf \
+ man.arpaname.html man.ddns-confgen.html man.dig.html \
+ man.dnssec-dsfromkey.html man.dnssec-keyfromlabel.html \
+ man.dnssec-keygen.html man.dnssec-revoke.html \
+ man.dnssec-settime.html man.dnssec-signzone.html \
+ man.genrandom.html man.host.html man.isc-hmac-fixup.html \
+ man.named-checkconf.html man.named-checkzone.html \
+ man.named-journalprint.html man.named.html \
+ man.nsec3hash.html man.nsupdate.html \
+ man.rndc-confgen.html man.rndc.conf.html man.rndc.html
+ARMDIR= ${TOPDIR}/arm
+MISC= dnssec format-options.pl ipv6 migration migration-4to9 \
+ options rfc-compliance roadmap sdb sort-options.pl
+MISCDIR= ${TOPDIR}/misc
+
+.include <bsd.prog.mk>
diff --git a/share/doc/legal/Makefile b/share/doc/legal/Makefile
new file mode 100644
index 000000000000..3ae8eca3b4cb
--- /dev/null
+++ b/share/doc/legal/Makefile
@@ -0,0 +1,8 @@
+# $FreeBSD$
+
+SUBDIR= intel_ipw \
+ intel_iwi \
+ intel_iwn \
+ intel_wpi
+
+.include <bsd.subdir.mk>
diff --git a/share/doc/legal/intel_ipw/Makefile b/share/doc/legal/intel_ipw/Makefile
new file mode 100644
index 000000000000..8f4f822fb4e0
--- /dev/null
+++ b/share/doc/legal/intel_ipw/Makefile
@@ -0,0 +1,7 @@
+# $FreeBSD$
+
+NO_OBJ=
+FILES= ${.CURDIR}/../../../../sys/contrib/dev/ipw/LICENSE
+FILESDIR= ${SHAREDIR}/doc/legal/intel_ipw
+
+.include <bsd.prog.mk>
diff --git a/share/doc/legal/intel_iwi/Makefile b/share/doc/legal/intel_iwi/Makefile
new file mode 100644
index 000000000000..85962379a1a6
--- /dev/null
+++ b/share/doc/legal/intel_iwi/Makefile
@@ -0,0 +1,7 @@
+# $FreeBSD$
+
+NO_OBJ=
+FILES= ${.CURDIR}/../../../../sys/contrib/dev/iwi/LICENSE
+FILESDIR= ${SHAREDIR}/doc/legal/intel_iwi
+
+.include <bsd.prog.mk>
diff --git a/share/doc/legal/intel_iwn/Makefile b/share/doc/legal/intel_iwn/Makefile
new file mode 100644
index 000000000000..9a29dfa96208
--- /dev/null
+++ b/share/doc/legal/intel_iwn/Makefile
@@ -0,0 +1,7 @@
+# $FreeBSD$
+
+NO_OBJ=
+FILES= ${.CURDIR}/../../../../sys/contrib/dev/iwn/LICENSE
+FILESDIR= ${SHAREDIR}/doc/legal/intel_iwn
+
+.include <bsd.prog.mk>
diff --git a/share/doc/legal/intel_wpi/Makefile b/share/doc/legal/intel_wpi/Makefile
new file mode 100644
index 000000000000..81014bedc0a7
--- /dev/null
+++ b/share/doc/legal/intel_wpi/Makefile
@@ -0,0 +1,8 @@
+# $FreeBSD$
+
+NO_OBJ=
+FILES= ${.CURDIR}/../../../../sys/contrib/dev/wpi/LICENSE
+FILESDIR= ${SHAREDIR}/doc/legal/intel_wpi
+
+.include <bsd.prog.mk>
+
diff --git a/share/doc/llvm/Makefile b/share/doc/llvm/Makefile
new file mode 100644
index 000000000000..37493bbc2308
--- /dev/null
+++ b/share/doc/llvm/Makefile
@@ -0,0 +1,15 @@
+# $FreeBSD$
+
+SUBDIR= clang
+
+SRCDIR= ${.CURDIR}/../../../contrib/llvm
+
+.PATH: ${SRCDIR} ${SRCDIR}/lib/Support
+
+NO_OBJ=
+
+FILESGROUPS= TOP
+TOP= LICENSE.TXT COPYRIGHT.regex
+TOPDIR= ${DOCDIR}/llvm
+
+.include <bsd.prog.mk>
diff --git a/share/doc/llvm/clang/Makefile b/share/doc/llvm/clang/Makefile
new file mode 100644
index 000000000000..1b26d6a9a60b
--- /dev/null
+++ b/share/doc/llvm/clang/Makefile
@@ -0,0 +1,13 @@
+# $FreeBSD$
+
+SRCDIR= ${.CURDIR}/../../../../contrib/llvm/tools/clang
+
+.PATH: ${SRCDIR}
+
+NO_OBJ=
+
+FILESGROUPS= TOP
+TOP= LICENSE.TXT
+TOPDIR= ${DOCDIR}/llvm/clang
+
+.include <bsd.prog.mk>
diff --git a/share/doc/papers/Makefile b/share/doc/papers/Makefile
new file mode 100644
index 000000000000..866fe20cd925
--- /dev/null
+++ b/share/doc/papers/Makefile
@@ -0,0 +1,19 @@
+# $FreeBSD$
+
+SUBDIR= beyond4.3 \
+ bufbio \
+ contents \
+ devfs \
+ diskperf \
+ fsinterface \
+ hwpmc \
+ jail \
+ kernmalloc \
+ kerntune \
+ malloc \
+ newvm \
+ relengr \
+ sysperf \
+ timecounter
+
+.include <bsd.subdir.mk>
diff --git a/share/doc/papers/beyond4.3/Makefile b/share/doc/papers/beyond4.3/Makefile
new file mode 100644
index 000000000000..7d1fa492d9a5
--- /dev/null
+++ b/share/doc/papers/beyond4.3/Makefile
@@ -0,0 +1,9 @@
+# From: @(#)Makefile 5.2 (Berkeley) 6/8/93
+# $FreeBSD$
+
+VOLUME= papers
+DOC= beyond43
+SRCS= beyond43.ms
+MACROS= -ms
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/beyond4.3/beyond43.ms b/share/doc/papers/beyond4.3/beyond43.ms
new file mode 100644
index 000000000000..b682ffc0d836
--- /dev/null
+++ b/share/doc/papers/beyond4.3/beyond43.ms
@@ -0,0 +1,519 @@
+.\" Copyright (c) 1989 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)beyond43.ms 5.1 (Berkeley) 6/5/90
+.\" $FreeBSD$
+.\"
+.\" *troff -ms
+.rm CM
+.sp 2
+.ce 100
+\fB\s+2Current Research by
+The Computer Systems Research Group
+of Berkeley\s-2\fP
+.ds DT "February 10, 1989
+.\" \fBDRAFT of \*(DT\fP
+.sp 2
+.nf
+Marshall Kirk McKusick
+Michael J Karels
+Keith Sklower
+Kevin Fall
+Marc Teitelbaum
+Keith Bostic
+.fi
+.sp 2
+.ce 1
+\fISummary\fP
+.ce 0
+.PP
+The release of 4.3BSD in April of 1986 addressed many of the
+performance problems and unfinished interfaces
+present in 4.2BSD [Leffler84] [McKusick85].
+The Computer Systems Research Group at Berkeley
+has now embarked on a new development phase to
+update other major components of the system, as well as to offer
+new functionality.
+There are five major ongoing projects.
+The first is to develop an OSI network protocol suite and to integrate
+existing ISO applications into Berkeley UNIX.
+The second is to develop and support an interface compliant with the
+P1003.1 POSIX standard recently approved by the IEEE.
+The third is to refine the TCP/IP networking to improve
+its performance and limit congestion on slow and/or lossy networks.
+The fourth is to provide a standard interface to file systems
+so that multiple local and remote file systems can be supported,
+much as multiple networking protocols are supported by 4.3BSD.
+The fifth is to evaluate alternate access control mechanisms and
+audit the existing security features of the system, particularly
+with respect to network services.
+Other areas of work include multi-architecture support,
+a general purpose kernel memory allocator, disk labels, and
+extensions to the 4.2BSD fast filesystem.
+.PP
+We are planning to finish implementation prototypes for each of the
+five main areas of work over the next year, and provide an informal
+test release sometime next year for interested developers.
+After incorporating feedback and refinements from the testers,
+they will appear in the next full Berkeley release, which is typically
+made about a year after the test release.
+.br
+.ne 10
+.sp 2
+.NH
+Recently Completed Projects
+.PP
+There have been several changes in the system that were included
+in the recent 4.3BSD Tahoe release.
+.NH 2
+Multi-architecture support
+.PP
+Support has been added for the DEC VAX 8600/8650, VAX 8200/8250,
+MicroVAXII and MicroVAXIII.
+.PP
+The largest change has been the incorporation of support for the first
+non-VAX processor, the CCI Power 6/32 and 6/32SX. (This addition also
+supports the
+Harris HCX-7 and HCX-9, as well as the Sperry 7000/40 and ICL machines.)
+The Power 6 version of 4.3BSD is largely based on the compilers and
+device drivers done for CCI's 4.2BSD UNIX,
+and is otherwise similar to the VAX release of 4.3BSD.
+The entire source tree, including all kernel and user-level sources,
+has been merged using a structure that will easily accommodate the addition
+of other processor families. A MIPS R2000 has been donated to us,
+making the MIPS architecture a likely candidate for inclusion into a future
+BSD release.
+.NH 2
+Kernel Memory Allocator
+.PP
+The 4.3BSD UNIX kernel used 10 different memory allocation mechanisms,
+each designed for the particular needs of the utilizing subsystem.
+These mechanisms have been replaced by a general purpose dynamic
+memory allocator that can be used by all of the kernel subsystems.
+The design of this allocator takes advantage of known memory usage
+patterns in the UNIX kernel and a hybrid strategy that is time-efficient
+for small allocations and space-efficient for large allocations.
+This allocator replaces the multiple memory allocation interfaces
+with a single easy-to-program interface,
+results in more efficient use of global memory by eliminating
+partitioned and specialized memory pools,
+and is quick enough (approximately 15 VAX instructions) that no
+performance loss is observed relative to the current implementations.
+[McKusick88].
+.NH 2
+Disk Labels
+.PP
+During the work on the CCI machine,
+it became obvious that disk geometry and filesystem layout information
+must be stored on each disk in a pack label.
+Disk labels were implemented for the CCI disks and for the most common
+types of disk controllers on the VAX.
+A utility was written to create and maintain the disk information,
+and other user-level programs that use such information now obtain
+it from the disk label.
+The use of this facility has allowed improvements in the file system's
+knowledge of irregular disk geometries such as track-to-track skew.
+.NH 2
+Fat Fast File System
+.PP
+The 4.2 fast file system [McKusick84]
+contained several statically sized structures,
+imposing limits on the number of cylinders per cylinder group,
+inodes per cylinder group,
+and number of distinguished rotational positions.
+The new ``fat'' filesystem allows these limits to be set at filesystem
+creation time.
+Old kernels will treat the new filesystems as read-only,
+and new kernels
+will accommodate both formats.
+The filesystem check facility, \fBfsck\fP, has also been modified to check
+either type.
+.br
+.ne 10
+.sp 2
+.NH
+Current UNIX Research at Berkeley
+.PP
+Since the release of 4.3BSD in mid 1986,
+we have begun work on several new major areas of research.
+Our goal is to apply leading edge research ideas into a stable
+and reliable implementation that solves current problems in
+operating systems development.
+.NH 2
+OSI network protocol development
+.PP
+The network architecture of 4.2BSD was designed to accommodate
+multiple network protocol families and address formats,
+and an implementation of the ISO OSI network protocols
+should enter into this framework without much difficulty.
+We plan to
+implement the OSI connectionless internet protocol (CLNP),
+and device drivers for X.25, 802.3, and possibly 802.5 interfaces, and
+to integrate these with an OSI transport class 4 (TP-4) implementation.
+We will also incorporate into the Berkeley Software Distribution an
+updated ISO Development Environment (ISODE)
+featuring International Standard (IS) versions of utilities.
+ISODE implements the session and presentation layers of the OSI protocol suite,
+and will include an implementation of the file transfer protocol (FTAM).
+It is also possible that an X.400 implementation now being done at
+University College, London and the University of Nottingham
+will be available for testing and distribution.
+.LP
+This implementation is comprised of four areas.
+.IP 1)
+We are updating the University of
+Wisconsin TP-4 to match GOSIP requirements.
+The University of Wisconsin developed a transport class 4
+implementation for the 4.2BSD kernel under contract to Mitre.
+This implementation must be updated to reflect the National Institute
+of Standards and Technology (NIST, formerly NBS) workshop agreements,
+GOSIP, and 4.3BSD requirements.
+We will make this TP-4 operate with an OSI IP,
+as the original implementation was built to run over the DoD IP.
+.IP 2)
+A kernel version of the OSI IP and ES-IS protocols must be produced.
+We will implement the kernel version of these protocols.
+.IP 3)
+The required device drivers need to be integrated into a BSD kernel.
+4.3BSD has existing device drivers for many Ethernet devices; future
+BSD versions may also support X.25 devices as well as token ring
+networks.
+These device drivers must be integrated
+into the kernel OSI protocol implementations.
+.IP 4)
+The existing OSINET interoperability test network is available so
+that the interoperability of the ISODE and BSD kernel protocols
+can be established through tests with several vendors.
+Testing is crucial because an openly available version of GOSIP protocols
+that does not interoperate with DEC, IBM, SUN, ICL, HIS, and other
+major vendors would be embarrassing.
+To allow testing of the integrated pieces the most desirable
+approach is to provide access to OSINET at UCB.
+A second approach is to do the interoperability testing at
+the site of an existing OSINET member, such as the NBS.
+.NH 2
+Compliance with POSIX 1003
+.PP
+Berkeley became involved several months ago in the development
+of the IEEE POSIX P1003.1 system interface standard.
+Since then, we have been participating in the working groups
+of P1003.2 (shell and application utility interface),
+P1003.6 (security), P1003.7 (system administration), and P1003.8
+(networking).
+.PP
+The IEEE published the POSIX P1003.1 standard in late 1988.
+POSIX related changes to the BSD system have included a new terminal
+driver, support for POSIX sessions and job control, expanded signal
+functionality, restructured directory access routines, and new set-user
+and set-group id facilities.
+We currently have a prototype implementation of the
+POSIX driver with extensions to provide binary compatibility with
+applications developed for the old Berkeley terminal driver.
+We also have a prototype implementation of the 4.2BSD-based POSIX
+job control facility.
+.PP
+The P1003.2 draft is currently being voted on by the IEEE
+P1003.2 balloting group.
+Berkeley is particularly interested in the results of this standard,
+as it will profoundly influence the user environment.
+The other groups are in comparatively early phases, with drafts
+coming to ballot sometime in the 90's.
+Berkeley will continue to participate in these groups, and
+move in the near future toward a P1003.1 and P1003.2 compliant
+system.
+We have many of the utilities outlined in the current P1003.2 draft
+already implemented, and have other parties willing to contribute
+additional implementations.
+.NH 2
+Improvements to the TCP/IP Networking Protocols
+.PP
+The Internet and the Berkeley collection of local-area networks
+have both grown at high rates in the last year.
+The Bay Area Regional Research Network (BARRNet),
+connecting several UC campuses, Stanford and NASA-Ames
+has recently become operational, increasing the complexity
+of the network connectivity.
+Both Internet and local routing algorithms are showing the strain
+of continued growth.
+We have made several changes in the local routing algorithm
+to keep accommodating the current topology,
+and are participating in the development of new routing algorithms
+and standard protocols.
+.PP
+Recent work in collaboration with Van Jacobson of the Lawrence Berkeley
+Laboratory has led to the design and implementation of several new algorithms
+for TCP that improve throughput on both local and long-haul networks
+while reducing unnecessary retransmission.
+The improvement is especially striking when connections must traverse
+slow and/or lossy networks.
+The new algorithms include ``slow-start,''
+a technique for opening the TCP flow control window slowly
+and using the returning stream of acknowledgements as a clock
+to drive the connection at the highest speed tolerated by the intervening
+network.
+A modification of this technique allows the sender to dynamically modify
+the send window size to adjust to changing network conditions.
+In addition, the round-trip timer has been modified to estimate the variance
+in round-trip time, thus allowing earlier retransmission of lost packets
+with less spurious retransmission due to increasing network delay.
+Along with a scheme proposed by Phil Karn of Bellcore,
+these changes reduce unnecessary retransmission over difficult paths
+such as Satnet by nearly two orders of magnitude
+while improving throughput dramatically.
+.PP
+The current TCP implementation is now being readied
+for more widespread distribution via the network and as a
+standard Berkeley distribution unencumbered by any commercial licensing.
+We are continuing to refine the TCP and IP implementations
+using the ARPANET, BARRNet, the NSF network
+and local campus nets as testbeds.
+In addition, we are incorporating applicable algorithms from this work
+into the TP-4 protocol implementation.
+.NH 2
+Toward a Compatible File System Interface
+.PP
+The most critical shortcoming of the 4.3BSD UNIX system was in the
+area of distributed file systems.
+As with networking protocols,
+there is no single distributed file system
+that provides sufficient speed and functionality for all problems.
+It is frequently necessary to support several different remote
+file system protocols, just as it is necessary to run several
+different network protocols.
+.PP
+As network or remote file systems have been implemented for UNIX,
+several stylized interfaces between the file system implementation
+and the rest of the kernel have been developed.
+Among these are Sun Microsystems' Virtual File System interface (VFS)
+using \fBvnodes\fP [Sandburg85] [Kleiman86],
+Digital Equipment's Generic File System (GFS) architecture [Rodriguez86],
+AT&T's File System Switch (FSS) [Rifkin86],
+the LOCUS distributed file system [Walker85],
+and Masscomp's extended file system [Cole85].
+Other remote file systems have been implemented in research or
+university groups for internal use,
+notably the network file system in the Eighth Edition UNIX
+system [Weinberger84] and two different file systems used at Carnegie Mellon
+University [Satyanarayanan85].
+Numerous other remote file access methods have been devised for use
+within individual UNIX processes,
+many of them by modifications to the C I/O library
+similar to those in the Newcastle Connection [Brownbridge82].
+.PP
+Each design attempts to isolate file system-dependent details
+below a generic interface and to provide a framework within which
+new file systems may be incorporated.
+However, each of these interfaces is different from
+and incompatible with the others.
+Each addresses somewhat different design goals,
+having been based on a different version of UNIX,
+having targeted a different set of file systems with varying characteristics,
+and having selected a different set of file system primitive operations.
+.PP
+Our effort in this area is aimed at providing a common framework to
+support these different distributed file systems simultaneously rather than to
+simply implement yet another protocol.
+This requires a detailed study of the existing protocols,
+and discussion with their implementors to determine whether
+they could modify their implementation to fit within our proposed
+framework. We have studied the various file system interfaces to determine
+their generality, completeness, robustness, efficiency, and aesthetics
+and are currently working on a file system interface
+that we believe includes the best features of
+each of the existing implementations.
+This work and the rationale underlying its development
+have been presented to major software vendors as an early step
+toward convergence on a standard compatible file system interface.
+Briefly, the proposal adopts the 4.3BSD calling convention for file
+name lookup but otherwise is closely related to Sun's VFS
+and DEC's GFS. [Karels86].
+.NH 2
+System Security
+.PP
+The recent invasion of the DARPA Internet by a quickly reproducing ``worm''
+highlighted the need for a thorough review of the access
+safeguards built into the system.
+Until now, we have taken a passive approach to dealing with
+weaknesses in the system access mechanisms, rather than actively
+searching for possible weaknesses.
+When we are notified of a problem or loophole in a system utility
+by one of our users,
+we have a well defined procedure for fixing the problem and
+expeditiously disseminating the fix to the BSD mailing list.
+This procedure has proven itself to be effective in
+solving known problems as they arise
+(witness its success in handling the recent worm).
+However, we feel that it would be useful to take a more active
+role in identifying problems before they are reported (or exploited).
+We will make a complete audit of the system
+utilities and network servers to find unintended system access mechanisms.
+.PP
+As a part of the work to make the system more resistant to attack
+from local users or via the network, it will be necessary to produce
+additional documentation on the configuration and operation of the system.
+This documentation will cover such topics as file and directory ownership
+and access, network and server configuration,
+and control of privileged operations such as file system backups.
+.PP
+We are investigating the addition of access control lists (ACLs) for
+filesystem objects.
+ACLs provide a much finer granularity of control over file access permissions
+than the current
+discretionary access control mechanism (mode bits).
+Furthermore, they are necessary
+in environments where C2 level security or better, as defined in the DoD
+TCSEC [DoD83], is required.
+The POSIX P1003.6 security group has made notable progress in determining
+how an ACL mechanism should work, and several vendors have implemented
+ACLs for their commercial systems.
+Berkeley will investigate the existing implementations and determine
+how to best integrate ACLs with the existing mechanism.
+.PP
+A major shortcoming of the present system is that authentication
+over the network is based solely on the privileged port mechanism
+between trusting hosts and users.
+Although privileged ports can only be created by processes running as root
+on a UNIX system,
+such processes are easy for a workstation user to obtain;
+they simply reboot their workstation in single user mode.
+Thus, a better authentication mechanism is needed.
+At present, we believe that the MIT Kerberos authentication
+server [Steiner88] provides the best solution to this problem.
+We propose to investigate Kerberos further as well as other
+authentication mechanisms and then to integrate
+the best one into Berkeley UNIX.
+Part of this integration would be the addition of the
+authentication mechanism into utilities such as
+telnet, login, remote shell, etc.
+We will add support for telnet (eventually replacing rlogin),
+the X window system, and the mail system within an authentication
+domain (a Kerberos \fIrealm\fP).
+We hope to replace the existing password authentication on each host
+with the network authentication system.
+.NH
+References
+.sp
+.IP Brownbridge82
+Brownbridge, D.R., L.F. Marshall, B. Randell,
+``The Newcastle Connection, or UNIXes of the World Unite!,''
+\fISoftware\- Practice and Experience\fP, Vol. 12, pp. 1147-1162, 1982.
+.sp
+.IP Cole85
+.br
+Cole, C.T., P.B. Flinn, A.B. Atlas,
+``An Implementation of an Extended File System for UNIX,''
+\fIUsenix Conference Proceedings\fP,
+pp. 131-150, June, 1985.
+.sp
+.IP DoD83
+.br
+Department of Defense,
+``Trusted Computer System Evaluation Criteria,''
+\fICSC-STD-001-83\fP,
+DoD Computer Security Center, August, 1983.
+.sp
+.IP Karels86
+Karels, M., M. McKusick,
+``Towards a Compatible File System Interface,''
+\fIProceedings of the European UNIX Users Group Meeting\fP,
+Manchester, England, pp. 481-496, September 1986.
+.sp
+.IP Kleiman86
+Kleiman, S.,
+``Vnodes: An Architecture for Multiple File System Types in Sun UNIX,''
+\fIUsenix Conference Proceedings\fP,
+pp. 238-247, June, 1986.
+.sp
+.IP Leffler84
+Leffler, S., M.K. McKusick, M. Karels,
+``Measuring and Improving the Performance of 4.2BSD,''
+\fIUsenix Conference Proceedings\fP, pp. 237-252, June, 1984.
+.sp
+.IP McKusick84
+McKusick, M.K., W. Joy, S. Leffler, R. Fabry,
+``A Fast File System for UNIX'',
+\fIACM Transactions on Computer Systems 2\fP, 3.
+pp 181-197, August 1984.
+.sp
+.IP McKusick85
+McKusick, M.K., M. Karels, S. Leffler,
+``Performance Improvements and Functional Enhancements in 4.3BSD,''
+\fIUsenix Conference Proceedings\fP, pp. 519-531, June, 1985.
+.sp
+.IP McKusick86
+McKusick, M.K., M. Karels,
+``A New Virtual Memory Implementation for Berkeley UNIX,''
+\fIProceedings of the European UNIX Users Group Meeting\fP,
+Manchester, England, pp. 451-460, September 1986.
+.sp
+.IP McKusick88
+McKusick, M.K., M. Karels,
+``Design of a General Purpose Memory Allocator for the 4.3BSD UNIX Kernel,''
+\fIUsenix Conference Proceedings\fP,
+pp. 295-303, June, 1988.
+.sp
+.IP Rifkin86
+Rifkin, A.P., M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, K. Yueh,
+``RFS Architectural Overview,'' \fIUsenix Conference Proceedings\fP,
+pp. 248-259, June, 1986.
+.sp
+.IP Rodriguez86
+Rodriguez, R., M. Koehler, R. Hyde,
+``The Generic File System,''
+\fIUsenix Conference Proceedings\fP,
+pp. 260-269, June, 1986.
+.sp
+.IP Sandberg85
+Sandberg, R., D. Goldberg, S. Kleiman, D. Walsh, B. Lyon,
+``Design and Implementation of the Sun Network File System,''
+\fIUsenix Conference Proceedings\fP,
+pp. 119-130, June, 1985.
+.sp
+.IP Satyanarayanan85
+Satyanarayanan, M., \fIet al.\fP,
+``The ITC Distributed File System: Principles and Design,''
+\fIProc. 10th Symposium on Operating Systems Principles\fP, pp. 35-50,
+ACM, December, 1985.
+.sp
+.IP Steiner88
+Steiner, J., C. Newman, J. Schiller,
+``\fIKerberos:\fP An Authentication Service for Open Network Systems,''
+\fIUsenix Conference Proceedings\fP, pp. 191-202, February, 1988.
+.sp
+.IP Walker85
+Walker, B.J. and S.H. Kiser, ``The LOCUS Distributed File System,''
+\fIThe LOCUS Distributed System Architecture\fP,
+G.J. Popek and B.J. Walker, ed., The MIT Press, Cambridge, MA, 1985.
+.sp
+.IP Weinberger84
+Weinberger, P.J., ``The Version 8 Network File System,''
+\fIUsenix Conference presentation\fP,
+June, 1984.
diff --git a/share/doc/papers/bufbio/Makefile b/share/doc/papers/bufbio/Makefile
new file mode 100644
index 000000000000..9bdd4874fb20
--- /dev/null
+++ b/share/doc/papers/bufbio/Makefile
@@ -0,0 +1,14 @@
+# $FreeBSD$
+
+VOLUME= papers
+DOC= bio
+SRCS= bio.ms-patched
+EXTRA= bufsize.eps
+MACROS= -ms
+USE_PIC=
+CLEANFILES= bio.ms-patched
+
+bio.ms-patched: bio.ms
+ sed "s;bufsize\.eps;${.CURDIR}/&;" ${.ALLSRC} > ${.TARGET}
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/bufbio/bio.ms b/share/doc/papers/bufbio/bio.ms
new file mode 100644
index 000000000000..123f8e7699b7
--- /dev/null
+++ b/share/doc/papers/bufbio/bio.ms
@@ -0,0 +1,830 @@
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@FreeBSD.ORG> wrote this file. As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $FreeBSD$
+.\"
+.if n .ftr C R
+.nr PI 2n
+.TL
+The case for struct bio
+.br
+- or -
+.br
+A road map for a stackable BIO subsystem in FreeBSD
+.AU
+Poul-Henning Kamp <phk@FreeBSD.org>
+.AI
+The FreeBSD Project
+.AB
+Historically, the only translation performed on I/O requests after
+they they left the file-system layer were logical sub disk implementation
+done in the device driver. No universal standard for how sub disks are
+configured and implemented exists, in fact pretty much every single platform
+and operating system have done it their own way. As FreeBSD migrates to
+other platforms it needs to understand these local conventions to be
+able to co-exist with other operating systems on the same disk.
+.PP
+Recently a number of technologies like RAID have expanded the
+concept of "a disk" a fair bit and while these technologies initially
+were implemented in separate hardware they increasingly migrate into
+the operating systems as standard functionality.
+.PP
+Both of these factors indicate the need for a structured approach to
+systematic "geometry manipulation" facilities in FreeBSD.
+.PP
+This paper contains the road-map for a stackable "BIO" system in
+FreeBSD, which will support these facilities.
+.AE
+.NH
+The miseducation of \fCstruct buf\fP.
+.PP
+To fully appreciate the topic, I include a little historic overview
+of struct buf, it is a most enlightening case of not exactly bit-rot
+but more appropriately design-rot.
+.PP
+In the beginning, which for this purpose extends until virtual
+memory is was introduced into UNIX, all disk I/O were done from or
+to a struct buf. In the 6th edition sources, as printed in Lions
+Book, struct buf looks like this:
+.DS
+.ft C
+.ps -1
+struct buf
+{
+ int b_flags; /* see defines below */
+ struct buf *b_forw; /* headed by devtab of b_dev */
+ struct buf *b_back; /* ' */
+ struct buf *av_forw; /* position on free list, */
+ struct buf *av_back; /* if not BUSY*/
+ int b_dev; /* major+minor device name */
+ int b_wcount; /* transfer count (usu. words) */
+ char *b_addr; /* low order core address */
+ char *b_xmem; /* high order core address */
+ char *b_blkno; /* block # on device */
+ char b_error; /* returned after I/O */
+ char *b_resid; /* words not transferred after
+ error */
+} buf[NBUF];
+.ps +1
+.ft P
+.DE
+.PP
+At this point in time, struct buf had only two functions:
+To act as a cache
+and to transport I/O operations to device drivers. For the purpose of
+this document, the cache functionality is uninteresting and will be
+ignored.
+.PP
+The I/O operations functionality consists of three parts:
+.IP "" 5n
+\(bu Where in Ram/Core is the data located (b_addr, b_xmem, b_wcount).
+.IP
+\(bu Where on disk is the data located (b_dev, b_blkno)
+.IP
+\(bu Request and result information (b_flags, b_error, b_resid)
+.PP
+In addition to this, the av_forw and av_back elements are
+used by the disk device drivers to put requests on a linked list.
+All in all the majority of struct buf is involved with the I/O
+aspect and only a few fields relate exclusively to the cache aspect.
+.PP
+If we step forward to the BSD 4.4-Lite-2 release, struct buf has grown
+a bit here or there:
+.DS
+.ft C
+.ps -1
+struct buf {
+ LIST_ENTRY(buf) b_hash; /* Hash chain. */
+ LIST_ENTRY(buf) b_vnbufs; /* Buffer's associated vnode. */
+ TAILQ_ENTRY(buf) b_freelist; /* Free list position if not active. */
+ struct buf *b_actf, **b_actb; /* Device driver queue when active. */
+ struct proc *b_proc; /* Associated proc; NULL if kernel. */
+ volatile long b_flags; /* B_* flags. */
+ int b_error; /* Errno value. */
+ long b_bufsize; /* Allocated buffer size. */
+ long b_bcount; /* Valid bytes in buffer. */
+ long b_resid; /* Remaining I/O. */
+ dev_t b_dev; /* Device associated with buffer. */
+ struct {
+ caddr_t b_addr; /* Memory, superblocks, indirect etc. */
+ } b_un;
+ void *b_saveaddr; /* Original b_addr for physio. */
+ daddr_t b_lblkno; /* Logical block number. */
+ daddr_t b_blkno; /* Underlying physical block number. */
+ /* Function to call upon completion. */
+ void (*b_iodone) __P((struct buf *));
+ struct vnode *b_vp; /* Device vnode. */
+ long b_pfcent; /* Center page when swapping cluster. */
+ /* XXX pfcent should be int; overld. */
+ int b_dirtyoff; /* Offset in buffer of dirty region. */
+ int b_dirtyend; /* Offset of end of dirty region. */
+ struct ucred *b_rcred; /* Read credentials reference. */
+ struct ucred *b_wcred; /* Write credentials reference. */
+ int b_validoff; /* Offset in buffer of valid region. */
+ int b_validend; /* Offset of end of valid region. */
+};
+.ps +1
+.ft P
+.DE
+.PP
+The main piece of action is the addition of vnodes, a VM system and a
+prototype LFS filesystem, all of which needed some handles on struct
+buf. Comparison will show that the I/O aspect of struct buf is in
+essence unchanged, the length field is now in bytes instead of words,
+the linked list the drivers can use has been renamed (b_actf,
+b_actb) and a b_iodone pointer for callback notification has been added
+but otherwise there is no change to the fields which
+represent the I/O aspect. All the new fields relate to the cache
+aspect, link buffers to the VM system, provide hacks for file-systems
+(b_lblkno) etc etc.
+.PP
+By the time we get to FreeBSD 3.0 more stuff has grown on struct buf:
+.DS
+.ft C
+.ps -1
+struct buf {
+ LIST_ENTRY(buf) b_hash; /* Hash chain. */
+ LIST_ENTRY(buf) b_vnbufs; /* Buffer's associated vnode. */
+ TAILQ_ENTRY(buf) b_freelist; /* Free list position if not active. */
+ TAILQ_ENTRY(buf) b_act; /* Device driver queue when active. *new* */
+ struct proc *b_proc; /* Associated proc; NULL if kernel. */
+ long b_flags; /* B_* flags. */
+ unsigned short b_qindex; /* buffer queue index */
+ unsigned char b_usecount; /* buffer use count */
+ int b_error; /* Errno value. */
+ long b_bufsize; /* Allocated buffer size. */
+ long b_bcount; /* Valid bytes in buffer. */
+ long b_resid; /* Remaining I/O. */
+ dev_t b_dev; /* Device associated with buffer. */
+ caddr_t b_data; /* Memory, superblocks, indirect etc. */
+ caddr_t b_kvabase; /* base kva for buffer */
+ int b_kvasize; /* size of kva for buffer */
+ daddr_t b_lblkno; /* Logical block number. */
+ daddr_t b_blkno; /* Underlying physical block number. */
+ off_t b_offset; /* Offset into file */
+ /* Function to call upon completion. */
+ void (*b_iodone) __P((struct buf *));
+ /* For nested b_iodone's. */
+ struct iodone_chain *b_iodone_chain;
+ struct vnode *b_vp; /* Device vnode. */
+ int b_dirtyoff; /* Offset in buffer of dirty region. */
+ int b_dirtyend; /* Offset of end of dirty region. */
+ struct ucred *b_rcred; /* Read credentials reference. */
+ struct ucred *b_wcred; /* Write credentials reference. */
+ int b_validoff; /* Offset in buffer of valid region. */
+ int b_validend; /* Offset of end of valid region. */
+ daddr_t b_pblkno; /* physical block number */
+ void *b_saveaddr; /* Original b_addr for physio. */
+ caddr_t b_savekva; /* saved kva for transfer while bouncing */
+ void *b_driver1; /* for private use by the driver */
+ void *b_driver2; /* for private use by the driver */
+ void *b_spc;
+ union cluster_info {
+ TAILQ_HEAD(cluster_list_head, buf) cluster_head;
+ TAILQ_ENTRY(buf) cluster_entry;
+ } b_cluster;
+ struct vm_page *b_pages[btoc(MAXPHYS)];
+ int b_npages;
+ struct workhead b_dep; /* List of filesystem dependencies. */
+};
+.ps +1
+.ft P
+.DE
+.PP
+Still we find that the I/O aspect of struct buf is in essence unchanged. A couple of fields have been added which allows the driver to hang local data off the buf while working on it have been added (b_driver1, b_driver2) and a "physical block number" (b_pblkno) have been added.
+.PP
+This p_blkno is relevant, it has been added because the disklabel/slice
+code have been abstracted out of the device drivers, the filesystem
+ask for b_blkno, the slice/label code translates this into b_pblkno
+which the device driver operates on.
+.PP
+After this point some minor cleanups have happened, some unused fields
+have been removed etc but the I/O aspect of struct buf is still only
+a fraction of the entire structure: less than a quarter of the
+bytes in a struct buf are used for the I/O aspect and struct buf
+seems to continue to grow and grow.
+.PP
+Since version 6 as documented in Lions book, a three significant pieces
+of code have emerged which need to do non-trivial translations of
+the I/O request before it reaches the device drivers: CCD, slice/label
+and Vinum. They all basically do the same: they map I/O requests from
+a logical space to a physical space, and the mappings they perform
+can be 1:1 or 1:N. \**
+.FS
+It is interesting to note that Lions in his comments to the \fCrkaddr\fP
+routine (p. 16-2) writes \fIThe code in this procedure incorporates
+a special feature for files which extend over more than one disk
+drive. This feature is described in the UPM Section "RK(IV)". Its
+usefulness seems to be restricted.\fP This more than hints at the
+presence already then of various hacks to stripe/span multiple devices.
+.FE
+.PP
+The 1:1 mapping of the slice/label code is rather trivial, and the
+addition of the b_pblkno field catered for the majority of the issues
+this resulted in, leaving but one: Reads or writes to the magic "disklabel"
+or equally magic "MBR" sectors on a disk must be caught, examined and in
+some cases modified before being passed on to the device driver. This need
+resulted in the addition of the b_iodone_chain field which adds a limited
+ability to stack I/O operations;
+.PP
+The 1:N mapping of CCD and Vinum are far more interesting. These two
+subsystems look like a device driver, but rather than drive some piece
+of hardware, they allocate new struct buf data structures populates
+these and pass them on to other device drivers.
+.PP
+Apart from it being inefficient to lug about a 348 bytes data structure
+when 80 bytes would have done, it also leads to significant code rot
+when programmers don't know what to do about the remaining fields or
+even worse: "borrow" a field or two for their own uses.
+.PP
+.ID
+.if t .PSPIC bufsize.eps
+.if n [graph not available in this format]
+.DE
+.I
+Conclusions:
+.IP "" 5n
+\(bu Struct buf is victim of chronic bloat.
+.IP
+\(bu The I/O aspect of
+struct buf is practically constant and only about \(14 of the total bytes.
+.IP
+\(bu Struct buf currently have several users, vinum, ccd and to
+limited extent diskslice/label, which
+need only the I/O aspect, not the vnode, caching or VM linkage.
+.IP
+.I
+The I/O aspect of struct buf should be put in a separate \fCstruct bio\fP.
+.R
+.NH 1
+Implications for future struct buf improvements
+.PP
+Concerns have been raised about the implications this separation
+will have for future work on struct buf, I will try to address
+these concerns here.
+.PP
+As the existence and popularity of vinum and ccd proves, there is
+a legitimate and valid requirement to be able to do I/O operations
+which are not initiated by a vnode or filesystem operation.
+In other words, an I/O request is a fully valid entity in its own
+right and should be treated like that.
+.PP
+Without doubt, the I/O request has to be tuned to fit the needs
+of struct buf users in the best possible way, and consequently
+any future changes in struct buf are likely to affect the I/O request
+semantics.
+.PP
+One particular change which has been proposed is to drop the present
+requirement that a struct buf be mapped contiguously into kernel
+address space. The argument goes that since many modern drivers use
+physical address DMA to transfer the data maintaining such a mapping
+is needless overhead.
+.PP
+Of course some drivers will still need to be able to access the
+buffer in kernel address space and some kind of compatibility
+must be provided there.
+.PP
+The question is, if such a change is made impossible by the
+separation of the I/O aspect into its own data structure?
+.PP
+The answer to this is ``no''.
+Anything that could be added to or done with
+the I/O aspect of struct buf can also be added to or done
+with the I/O aspect if it lives in a new "struct bio".
+.NH 1
+Implementing a \fCstruct bio\fP
+.PP
+The first decision to be made was who got to use the name "struct buf",
+and considering the fact that it is the I/O aspect which gets separated
+out and that it only covers about \(14 of the bytes in struct buf,
+obviously the new structure for the I/O aspect gets a new name.
+Examining the naming in the kernel, the "bio" prefix seemed a given,
+for instance, the function to signal completion of an I/O request is
+already named "biodone()".
+.PP
+Making the transition smooth is obviously also a priority and after
+some prototyping \**
+.FS
+The software development technique previously known as "Trial & Error".
+.FE
+it was found that a totally transparent transition could be made by
+embedding a copy of the new "struct bio" as the first element of "struct buf"
+and by using cpp(1) macros to alias the fields to the legacy struct buf
+names.
+.NH 2
+The b_flags problem.
+.PP
+Struct bio was defined by examining all code existing in the driver tree
+and finding all the struct buf fields which were legitimately used (as
+opposed to "hi-jacked" fields).
+One field was found to have "dual-use": the b_flags field.
+This required special attention.
+Examination showed that b_flags were used for three things:
+.IP "" 5n
+\(bu Communication of the I/O command (READ, WRITE, FORMAT, DELETE)
+.IP
+\(bu Communication of ordering and error status
+.IP
+\(bu General status for non I/O aspect consumers of struct buf.
+.PP
+For historic reasons B_WRITE was defined to be zero, which lead to
+confusion and bugs, this pushed the decision to have a separate
+"b_iocmd" field in struct buf and struct bio for communicating
+only the action to be performed.
+.PP
+The ordering and error status bits were put in a new flag field "b_ioflag".
+This has left sufficiently many now unused bits in b_flags that the b_xflags element
+can now be merged back into b_flags.
+.NH 2
+Definition of struct bio
+.PP
+With the cleanup of b_flags in place, the definition of struct bio looks like this:
+.DS
+.ft C
+.ps -1
+struct bio {
+ u_int bio_cmd; /* I/O operation. */
+ dev_t bio_dev; /* Device to do I/O on. */
+ daddr_t bio_blkno; /* Underlying physical block number. */
+ off_t bio_offset; /* Offset into file. */
+ long bio_bcount; /* Valid bytes in buffer. */
+ caddr_t bio_data; /* Memory, superblocks, indirect etc. */
+ u_int bio_flags; /* BIO_ flags. */
+ struct buf *_bio_buf; /* Parent buffer. */
+ int bio_error; /* Errno for BIO_ERROR. */
+ long bio_resid; /* Remaining I/O in bytes. */
+ void (*bio_done) __P((struct buf *));
+ void *bio_driver1; /* Private use by the callee. */
+ void *bio_driver2; /* Private use by the callee. */
+ void *bio_caller1; /* Private use by the caller. */
+ void *bio_caller2; /* Private use by the caller. */
+ TAILQ_ENTRY(bio) bio_queue; /* Disksort queue. */
+ daddr_t bio_pblkno; /* physical block number */
+ struct iodone_chain *bio_done_chain;
+};
+.ps +1
+.ft P
+.DE
+.NH 2
+Definition of struct buf
+.PP
+After adding a struct bio to struct buf and the fields aliased into it
+struct buf looks like this:
+.DS
+.ft C
+.ps -1
+struct buf {
+ /* XXX: b_io must be the first element of struct buf for now /phk */
+ struct bio b_io; /* "Builtin" I/O request. */
+#define b_bcount b_io.bio_bcount
+#define b_blkno b_io.bio_blkno
+#define b_caller1 b_io.bio_caller1
+#define b_caller2 b_io.bio_caller2
+#define b_data b_io.bio_data
+#define b_dev b_io.bio_dev
+#define b_driver1 b_io.bio_driver1
+#define b_driver2 b_io.bio_driver2
+#define b_error b_io.bio_error
+#define b_iocmd b_io.bio_cmd
+#define b_iodone b_io.bio_done
+#define b_iodone_chain b_io.bio_done_chain
+#define b_ioflags b_io.bio_flags
+#define b_offset b_io.bio_offset
+#define b_pblkno b_io.bio_pblkno
+#define b_resid b_io.bio_resid
+ LIST_ENTRY(buf) b_hash; /* Hash chain. */
+ TAILQ_ENTRY(buf) b_vnbufs; /* Buffer's associated vnode. */
+ TAILQ_ENTRY(buf) b_freelist; /* Free list position if not active. */
+ TAILQ_ENTRY(buf) b_act; /* Device driver queue when active. *new* */
+ long b_flags; /* B_* flags. */
+ unsigned short b_qindex; /* buffer queue index */
+ unsigned char b_xflags; /* extra flags */
+[...]
+.ps +1
+.ft P
+.DE
+.PP
+Putting the struct bio as the first element in struct buf during a transition
+period allows a pointer to either to be cast to a pointer of the other,
+which means that certain pieces of code can be left un-converted with the
+use of a couple of casts while the remaining pieces of code are tested.
+The ccd and vinum modules have been left un-converted like this for now.
+.PP
+This is basically where FreeBSD-current stands today.
+.PP
+The next step is to substitute struct bio for struct buf in all the code
+which only care about the I/O aspect: device drivers, diskslice/label.
+The patch to do this is up for review. \**
+.FS
+And can be found at http://phk.freebsd.dk/misc
+.FE
+and consists mainly of systematic substitutions like these
+.DS
+.ft C
+s/struct buf/struct bio/
+s/b_flags/bio_flags/
+s/b_bcount/bio_bcount/
+&c &c
+.ft P
+.DE
+.NH 2
+Future work
+.PP
+It can be successfully argued that the cpp(1) macros used for aliasing
+above are ugly and should be expanded in place. It would certainly
+be trivial to do so, but not by definition worthwhile.
+.PP
+Retaining the aliasing for the b_* and bio_* name-spaces this way
+leaves us with considerable flexibility in modifying the future
+interaction between the two. The DEV_STRATEGY() macro is the single
+point where a struct buf is turned into a struct bio and launched
+into the drivers to full-fill the I/O request and this provides us
+with a single isolated location for performing non-trivial translations.
+.PP
+As an example of this flexibility: It has been proposed to essentially
+drop the b_blkno field and use the b_offset field to communicate the
+on-disk location of the data. b_blkno is a 32bit offset of B_DEVSIZE
+(512) bytes sectors which allows us to address two terabytes worth
+of data. Using b_offset as a 64 bit byte-address would not only allow
+us to address 8 million times larger disks, it would also make it
+possible to accommodate disks which use non-power-of-two sector-size,
+Audio CD-ROMs for instance.
+.PP
+The above mentioned flexibility makes an implementation almost trivial:
+.IP "" 5n
+\(bu Add code to DEV_STRATEGY() to populate b_offset from b_blkno in the
+cases where it is not valid. Today it is only valid for a struct buf
+marked B_PHYS.
+.IP
+\(bu Change diskslice/label, ccd, vinum and device drivers to use b_offset
+instead of b_blkno.
+.IP
+\(bu Remove the bio_blkno field from struct bio, add it to struct buf as
+b_blkno and remove the cpp(1) macro which aliased it into struct bio.
+.PP
+Another possible transition could be to not have a "built-in" struct bio
+in struct buf. If for some reason struct bio grows fields of no relevance
+to struct buf it might be cheaper to remove struct bio from struct buf,
+un-alias the fields and have DEV_STRATEGY() allocate a struct bio and populate
+the relevant fields from struct buf.
+This would also be entirely transparent to both users of struct buf and
+struct bio as long as we retain the aliasing mechanism and DEV_STRATEGY().
+.bp
+.NH 1
+Towards a stackable BIO subsystem.
+.PP
+Considering that we now have three distinct pieces of code living
+in the nowhere between DEV_STRATEGY() and the device drivers:
+diskslice/label, ccd and vinum, it is not unreasonable to start
+to look for a more structured and powerful API for these pieces
+of code.
+.PP
+In traditional UNIX semantics a "disk" is a one-dimensional array of
+512 byte sectors which can be read or written. Support for sectors
+of multiple of 512 bytes were implemented with a sort of "don't ask-don't tell" policy where system administrator would specify a larger minimum sector-size
+to the filesystem, and things would "just work", but no formal communication about the size of the smallest transfer possible were exchanged between the disk driver and the filesystem.
+.PP
+A truly generalised concept of a disk needs to be more flexible and more
+expressive. For instance, a user of a disk will want to know:
+.IP "" 5n
+\(bu What is the sector size. Sector-size these days may not be a power
+of two, for instance Audio CDs have 2352 byte "sectors".
+.IP
+\(bu How many sectors are there.
+.IP
+\(bu Is writing of sectors supported.
+.IP
+\(bu Is freeing of sectors supported. This is important for flash based
+devices where a wear-distribution software or hardware function uses
+the information about which sectors are actually in use to optimise the
+usage of the slow erase function to a minimum.
+.IP
+\(bu Is opening this device in a specific mode, (read-only or read-write)
+allowed. The VM system and the file-systems generally assume that nobody
+writes to "their storage" under their feet, and therefore opens which
+would make that possible should be rejected.
+.IP
+\(bu What is the "native" geometry of this device (Sectors/Heads/Cylinders).
+This is useful for staying compatible with badly designed on-disk formats
+from other operating systems.
+.PP
+Obviously, all of these properties are dynamic in the sense that in
+these days disks are removable devices, and they may therefore change
+at any time. While some devices like CD-ROMs can lock the media in
+place with a special command, this cannot be done for all devices,
+in particular it cannot be done with normal floppy disk drives.
+.PP
+If we adopt such a model for disk, retain the existing "strategy/biodone" model of I/O scheduling and decide to use a modular or stackable approach to
+geometry translations we find that nearly endless flexibility emerges:
+Mirroring, RAID, striping, interleaving, disk-labels and sub-disks, all of
+these techniques would get a common framework to operate in.
+.PP
+In practice of course, such a scheme must not complicate the use of or
+installation of FreeBSD. The code will have to act and react exactly
+like the current code but fortunately the current behaviour is not at
+all hard to emulate so implementation-wise this is a non-issue.
+.PP
+But lets look at some drawings to see what this means in practice.
+.PP
+Today the plumbing might look like this on a machine:
+.DS
+.PS
+ Ad0: box "disk (ad0)"
+ arrow up from Ad0.n
+ SL0: box "slice/label"
+ Ad1: box "disk (ad1)" with .w at Ad0.e + (.2,0)
+ arrow up from Ad1.n
+ SL1: box "slice/label"
+ Ad2: box "disk (ad2)" with .w at Ad1.e + (.2,0)
+ arrow up from Ad2.n
+ SL2: box "slice/label"
+ Ad3: box "disk (ad3)" with .w at Ad2.e + (.2,0)
+ arrow up from Ad3.n
+ SL3: box "slice/label"
+ DML: box dashed width 4i height .9i with .sw at SL0.sw + (-.2,-.2)
+ "Disk-mini-layer" with .n at DML.s + (0, .1)
+
+ V: box "vinum" at 1/2 <SL1.n, SL2.n> + (0,1.2)
+
+ A0A: arrow up from 1/4 <SL0.nw, SL0.ne>
+ A0B: arrow up from 2/4 <SL0.nw, SL0.ne>
+ A0E: arrow up from 3/4 <SL0.nw, SL0.ne>
+ A1C: arrow up from 2/4 <SL1.nw, SL1.ne>
+ arrow to 1/3 <V.sw, V.se>
+ A2C: arrow up from 2/4 <SL2.nw, SL2.ne>
+ arrow to 2/3 <V.sw, V.se>
+ A3A: arrow up from 1/4 <SL3.nw, SL3.ne>
+ A3E: arrow up from 2/4 <SL3.nw, SL3.ne>
+ A3F: arrow up from 3/4 <SL3.nw, SL3.ne>
+
+ "ad0s1a" with .s at A0A.n + (0, .1)
+ "ad0s1b" with .s at A0B.n + (0, .3)
+ "ad0s1e" with .s at A0E.n + (0, .5)
+ "ad1s1c" with .s at A1C.n + (0, .1)
+ "ad2s1c" with .s at A2C.n + (0, .1)
+ "ad3s4a" with .s at A3A.n + (0, .1)
+ "ad3s4e" with .s at A3E.n + (0, .3)
+ "ad3s4f" with .s at A3F.n + (0, .5)
+
+ V1: arrow up from 1/4 <V.nw, V.ne>
+ V2: arrow up from 2/4 <V.nw, V.ne>
+ V3: arrow up from 3/4 <V.nw, V.ne>
+ "V1" with .s at V1.n + (0, .1)
+ "V2" with .s at V2.n + (0, .1)
+ "V3" with .s at V3.n + (0, .1)
+
+.PE
+.DE
+.PP
+And while this drawing looks nice and clean, the code underneat isn't.
+With a stackable BIO implementation, the picture would look like this:
+.DS
+.PS
+ Ad0: box "disk (ad0)"
+ arrow up from Ad0.n
+ M0: box "MBR"
+ arrow up
+ B0: box "BSD"
+
+ A0A: arrow up from 1/4 <B0.nw, B0.ne>
+ A0B: arrow up from 2/4 <B0.nw, B0.ne>
+ A0E: arrow up from 3/4 <B0.nw, B0.ne>
+
+ Ad1: box "disk (ad1)" with .w at Ad0.e + (.2,0)
+ Ad2: box "disk (ad2)" with .w at Ad1.e + (.2,0)
+ Ad3: box "disk (ad3)" with .w at Ad2.e + (.2,0)
+ arrow up from Ad3.n
+ SL3: box "MBR"
+ arrow up
+ B3: box "BSD"
+
+ V: box "vinum" at 1/2 <Ad1.n, Ad2.n> + (0,.8)
+ arrow from Ad1.n to 1/3 <V.sw, V.se>
+ arrow from Ad2.n to 2/3 <V.sw, V.se>
+
+ A3A: arrow from 1/4 <B3.nw, B3.ne>
+ A3E: arrow from 2/4 <B3.nw, B3.ne>
+ A3F: arrow from 3/4 <B3.nw, B3.ne>
+
+ "ad0s1a" with .s at A0A.n + (0, .1)
+ "ad0s1b" with .s at A0B.n + (0, .3)
+ "ad0s1e" with .s at A0E.n + (0, .5)
+ "ad3s4a" with .s at A3A.n + (0, .1)
+ "ad3s4e" with .s at A3E.n + (0, .3)
+ "ad3s4f" with .s at A3F.n + (0, .5)
+
+ V1: arrow up from 1/4 <V.nw, V.ne>
+ V2: arrow up from 2/4 <V.nw, V.ne>
+ V3: arrow up from 3/4 <V.nw, V.ne>
+ "V1" with .s at V1.n + (0, .1)
+ "V2" with .s at V2.n + (0, .1)
+ "V3" with .s at V3.n + (0, .1)
+
+.PE
+.DE
+.PP
+The first thing we notice is that the disk mini-layer is gone, instead
+separate modules for the Microsoft style MBR and the BSD style disklabel
+are now stacked over the disk. We can also see that Vinum no longer
+needs to go though the BSD/MBR layers if it wants access to the entire
+physical disk, it can be stacked right over the disk.
+.PP
+Now, imagine that a ZIP drive is connected to the machine, and the
+user loads a ZIP disk in it. First the device driver notices the
+new disk and instantiates a new disk:
+.DS
+.PS
+ box "disk (da0)"
+.PE
+.DE
+.PP
+A number of the geometry modules have registered as "auto-discovering"
+and will be polled sequentially to see if any of them recognise what
+is on this disk. The MBR module finds a MBR in sector 0 and attach
+an instance of itself to the disk:
+.DS
+.PS
+ D: box "disk (da0)"
+ arrow up from D.n
+ M: box "MBR"
+ M1: arrow up from 1/3 <M.nw, M.ne>
+ M2: arrow up from 2/3 <M.nw, M.ne>
+.PE
+.DE
+.PP
+It finds two "slices" in the MBR and creates two new "disks" one for
+each of these. The polling of modules is repeated and this time the
+BSD label module recognises a FreeBSD label on one of the slices and
+attach itself:
+.DS
+.PS
+ D: box "disk (da0)"
+ arrow "O" up from D.n
+ M: box "MBR"
+ M1: line up .3i from 1/3 <M.nw, M.ne>
+ arrow "O" left
+ M2: arrow "O" up from 2/3 <M.nw, M.ne>
+ B: box "BSD"
+ B1: arrow "O" up from 1/4 <B.nw, B.ne>
+ B2: arrow "O" up from 2/4 <B.nw, B.ne>
+ B3: arrow "O" up from 3/4 <B.nw, B.ne>
+
+.PE
+.DE
+.PP
+The BSD module finds three partitions, creates them as disks and the
+polling is repeated for each of these. No modules recognise these
+and the process ends. In theory one could have a module recognise
+the UFS superblock and extract from there the path to mount the disk
+on, but this is probably better implemented in a general "device-daemon"
+in user-land.
+.PP
+On this last drawing I have marked with "O" the "disks" which can be
+accessed from user-land or kernel. The VM and file-systems generally
+prefer to have exclusive write access to the disk sectors they use,
+so we need to enforce this policy. Since we cannot know what transformation
+a particular module implements, we need to ask the modules if the open
+is OK, and they may need to ask their neighbours before they can answer.
+.PP
+We decide to mount a filesystem on one of the BSD partitions at the very top.
+The open request is passed to the BSD module, which finds that none of
+the other open partitions (there are none) overlap this one, so far no
+objections. It then passes the open to the MBR module, which goes through
+basically the same procedure finds no objections and pass the request to
+the disk driver, which since it was not previously open approves of the
+open.
+.PP
+Next we mount a filesystem on the next BSD partition. The
+BSD module again checks for overlapping open partitions and find none.
+This time however, it finds that it has already opened the "downstream"
+in R/W mode so it does not need to ask for permission for that again
+so the open is OK.
+.PP
+Next we mount a msdos filesystem on the other MBR slice. This is the
+same case, the MBR finds no overlapping open slices and has already
+opened "downstream" so the open is OK.
+.PP
+If we now try to open the other slice for writing, the one which has the
+BSD module attached already. The open is passed to the MBR module which
+notes that the device is already opened for writing by a module (the BSD
+module) and consequently the open is refused.
+.PP
+While this sounds complicated it actually took less than 200 lines of
+code to implement in a prototype implementation.
+.PP
+Now, the user ejects the ZIP disk. If the hardware can give a notification
+of intent to eject, a call-up from the driver can try to get devices synchronised
+and closed, this is pretty trivial. If the hardware just disappears like
+a unplugged parallel zip drive, a floppy disk or a PC-card, we have no
+choice but to dismantle the setup. The device driver sends a "gone" notification to the MBR module, which replicates this upwards to the mounted msdosfs
+and the BSD module. The msdosfs unmounts forcefully, invalidates any blocks
+in the buf/vm system and returns. The BSD module replicates the "gone" to
+the two mounted file-systems which in turn unmounts forcefully, invalidates
+blocks and return, after which the BSD module releases any resources held
+and returns, the MBR module releases any resources held and returns and all
+traces of the device have been removed.
+.PP
+Now, let us get a bit more complicated. We add another disk and mirror
+two of the MBR slices:
+.DS
+.PS
+ D0: box "disk (da0)"
+
+ arrow "O" up from D0.n
+ M0: box "MBR"
+ M01: line up .3i from 1/3 <M0.nw, M0.ne>
+ arrow "O" left
+ M02: arrow "O" up from 2/3 <M0.nw, M0.ne>
+
+ D1: box "disk (da1)" with .w at D0.e + (.2,0)
+ arrow "O" up from D1.n
+ M1: box "MBR"
+ M11: line up .3i from 1/3 <M1.nw, M1.ne>
+ line "O" left
+ M11a: arrow up .2i
+
+ I: box "Mirror" with .s at 1/2 <M02.n, M11a.n>
+ arrow "O" up
+ BB: box "BSD"
+ BB1: arrow "O" up from 1/4 <BB.nw, BB.ne>
+ BB2: arrow "O" up from 2/4 <BB.nw, BB.ne>
+ BB3: arrow "O" up from 3/4 <BB.nw, BB.ne>
+
+ M12: arrow "O" up from 2/3 <M1.nw, M1.ne>
+ B: box "BSD"
+ B1: arrow "O" up from 1/4 <B.nw, B.ne>
+ B2: arrow "O" up from 2/4 <B.nw, B.ne>
+ B3: arrow "O" up from 3/4 <B.nw, B.ne>
+.PE
+.DE
+.PP
+Now assuming that we lose disk da0, the notification goes up like before
+but the mirror module still has a valid mirror from disk da1, so it
+doesn't propagate the "gone" notification further up and the three
+file-systems mounted are not affected.
+.PP
+It is possible to modify the graph while in action, as long as the
+modules know that they will not affect any I/O in progress. This is
+very handy for moving things around. At any of the arrows we can
+insert a mirroring module, since it has a 1:1 mapping from input
+to output. Next we can add another copy to the mirror, give the
+mirror time to sync the two copies. Detach the first mirror copy
+and remove the mirror module. We have now in essence moved a partition
+from one disk to another transparently.
+.NH 1
+Getting stackable BIO layers from where we are today.
+.PP
+Most of the infrastructure is in place now to implement stackable
+BIO layers:
+.IP "" 5n
+\(bu The dev_t change gave us a public structure where
+information about devices can be put. This enabled us to get rid
+of all the NFOO limits on the number of instances of a particular
+driver/device, and significantly cleaned up the vnode aliasing for
+device vnodes.
+.IP
+\(bu The disk-mini-layer has
+taken the knowledge about diskslice/labels out of the
+majority of the disk-drivers, saving on average 100 lines of code per
+driver.
+.IP
+\(bu The struct bio/buf divorce is giving us an IO request of manageable
+size which can be modified without affecting all the filesystem and
+VM system users of struct buf.
+.PP
+The missing bits are:
+.IP "" 5n
+\(bu changes to struct bio to make it more
+stackable. This mostly relates to the handling of the biodone()
+event, something which will be transparent to all current users
+of struct buf/bio.
+.IP
+\(bu code to stich modules together and to pass events and notifications
+between them.
+.NH 1
+An Implementation plan for stackable BIO layers
+.PP
+My plan for implementation stackable BIO layers is to first complete
+the struct bio/buf divorce with the already mentioned patch.
+.PP
+The next step is to re-implement the monolithic disk-mini-layer so
+that it becomes the stackable BIO system. Vinum and CCD and all
+other consumers should not be unable to tell the difference between
+the current and the new disk-mini-layer. The new implementation
+will initially use a static stacking to remain compatible with the
+current behaviour. This will be the next logical checkpoint commit.
+.PP
+The next step is to make the stackable layers configurable,
+to provide the means to initialise the stacking and to subsequently
+change it. This will be the next logical checkpoint commit.
+.PP
+At this point new functionality can be added inside the stackable
+BIO system: CCD can be re-implemented as a mirror module and a stripe
+module. Vinum can be integrated either as one "macro-module" or
+as separate functions in separate modules. Also modules for other
+purposes can be added, sub-disk handling for Solaris, MacOS, etc
+etc. These modules can be committed one at a time.
diff --git a/share/doc/papers/bufbio/bufsize.eps b/share/doc/papers/bufbio/bufsize.eps
new file mode 100644
index 000000000000..2396ac62aa40
--- /dev/null
+++ b/share/doc/papers/bufbio/bufsize.eps
@@ -0,0 +1,479 @@
+%!PS-Adobe-2.0 EPSF-2.0
+%%Title: a.ps
+%%Creator: $FreeBSD$
+%%CreationDate: Sat Apr 8 08:32:58 2000
+%%DocumentFonts: (atend)
+%%BoundingBox: 50 50 410 302
+%%Orientation: Portrait
+%%EndComments
+/gnudict 256 dict def
+gnudict begin
+/Color false def
+/Solid false def
+/gnulinewidth 5.000 def
+/userlinewidth gnulinewidth def
+/vshift -46 def
+/dl {10 mul} def
+/hpt_ 31.5 def
+/vpt_ 31.5 def
+/hpt hpt_ def
+/vpt vpt_ def
+/M {moveto} bind def
+/L {lineto} bind def
+/R {rmoveto} bind def
+/V {rlineto} bind def
+/vpt2 vpt 2 mul def
+/hpt2 hpt 2 mul def
+/Lshow { currentpoint stroke M
+ 0 vshift R show } def
+/Rshow { currentpoint stroke M
+ dup stringwidth pop neg vshift R show } def
+/Cshow { currentpoint stroke M
+ dup stringwidth pop -2 div vshift R show } def
+/UP { dup vpt_ mul /vpt exch def hpt_ mul /hpt exch def
+ /hpt2 hpt 2 mul def /vpt2 vpt 2 mul def } def
+/DL { Color {setrgbcolor Solid {pop []} if 0 setdash }
+ {pop pop pop Solid {pop []} if 0 setdash} ifelse } def
+/BL { stroke gnulinewidth 2 mul setlinewidth } def
+/AL { stroke gnulinewidth 2 div setlinewidth } def
+/UL { gnulinewidth mul /userlinewidth exch def } def
+/PL { stroke userlinewidth setlinewidth } def
+/LTb { BL [] 0 0 0 DL } def
+/LTa { AL [1 dl 2 dl] 0 setdash 0 0 0 setrgbcolor } def
+/LT0 { PL [] 1 0 0 DL } def
+/LT1 { PL [4 dl 2 dl] 0 1 0 DL } def
+/LT2 { PL [2 dl 3 dl] 0 0 1 DL } def
+/LT3 { PL [1 dl 1.5 dl] 1 0 1 DL } def
+/LT4 { PL [5 dl 2 dl 1 dl 2 dl] 0 1 1 DL } def
+/LT5 { PL [4 dl 3 dl 1 dl 3 dl] 1 1 0 DL } def
+/LT6 { PL [2 dl 2 dl 2 dl 4 dl] 0 0 0 DL } def
+/LT7 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 1 0.3 0 DL } def
+/LT8 { PL [2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 2 dl 4 dl] 0.5 0.5 0.5 DL } def
+/Pnt { stroke [] 0 setdash
+ gsave 1 setlinecap M 0 0 V stroke grestore } def
+/Dia { stroke [] 0 setdash 2 copy vpt add M
+ hpt neg vpt neg V hpt vpt neg V
+ hpt vpt V hpt neg vpt V closepath stroke
+ Pnt } def
+/Pls { stroke [] 0 setdash vpt sub M 0 vpt2 V
+ currentpoint stroke M
+ hpt neg vpt neg R hpt2 0 V stroke
+ } def
+/Box { stroke [] 0 setdash 2 copy exch hpt sub exch vpt add M
+ 0 vpt2 neg V hpt2 0 V 0 vpt2 V
+ hpt2 neg 0 V closepath stroke
+ Pnt } def
+/Crs { stroke [] 0 setdash exch hpt sub exch vpt add M
+ hpt2 vpt2 neg V currentpoint stroke M
+ hpt2 neg 0 R hpt2 vpt2 V stroke } def
+/TriU { stroke [] 0 setdash 2 copy vpt 1.12 mul add M
+ hpt neg vpt -1.62 mul V
+ hpt 2 mul 0 V
+ hpt neg vpt 1.62 mul V closepath stroke
+ Pnt } def
+/Star { 2 copy Pls Crs } def
+/BoxF { stroke [] 0 setdash exch hpt sub exch vpt add M
+ 0 vpt2 neg V hpt2 0 V 0 vpt2 V
+ hpt2 neg 0 V closepath fill } def
+/TriUF { stroke [] 0 setdash vpt 1.12 mul add M
+ hpt neg vpt -1.62 mul V
+ hpt 2 mul 0 V
+ hpt neg vpt 1.62 mul V closepath fill } def
+/TriD { stroke [] 0 setdash 2 copy vpt 1.12 mul sub M
+ hpt neg vpt 1.62 mul V
+ hpt 2 mul 0 V
+ hpt neg vpt -1.62 mul V closepath stroke
+ Pnt } def
+/TriDF { stroke [] 0 setdash vpt 1.12 mul sub M
+ hpt neg vpt 1.62 mul V
+ hpt 2 mul 0 V
+ hpt neg vpt -1.62 mul V closepath fill} def
+/DiaF { stroke [] 0 setdash vpt add M
+ hpt neg vpt neg V hpt vpt neg V
+ hpt vpt V hpt neg vpt V closepath fill } def
+/Pent { stroke [] 0 setdash 2 copy gsave
+ translate 0 hpt M 4 {72 rotate 0 hpt L} repeat
+ closepath stroke grestore Pnt } def
+/PentF { stroke [] 0 setdash gsave
+ translate 0 hpt M 4 {72 rotate 0 hpt L} repeat
+ closepath fill grestore } def
+/Circle { stroke [] 0 setdash 2 copy
+ hpt 0 360 arc stroke Pnt } def
+/CircleF { stroke [] 0 setdash hpt 0 360 arc fill } def
+/C0 { BL [] 0 setdash 2 copy moveto vpt 90 450 arc } bind def
+/C1 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 0 90 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C2 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 90 180 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C3 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 0 180 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C4 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 180 270 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C5 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 0 90 arc
+ 2 copy moveto
+ 2 copy vpt 180 270 arc closepath fill
+ vpt 0 360 arc } bind def
+/C6 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 90 270 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C7 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 0 270 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C8 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 270 360 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C9 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 270 450 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C10 { BL [] 0 setdash 2 copy 2 copy moveto vpt 270 360 arc closepath fill
+ 2 copy moveto
+ 2 copy vpt 90 180 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C11 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 0 180 arc closepath fill
+ 2 copy moveto
+ 2 copy vpt 270 360 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C12 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 180 360 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C13 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 0 90 arc closepath fill
+ 2 copy moveto
+ 2 copy vpt 180 360 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/C14 { BL [] 0 setdash 2 copy moveto
+ 2 copy vpt 90 360 arc closepath fill
+ vpt 0 360 arc } bind def
+/C15 { BL [] 0 setdash 2 copy vpt 0 360 arc closepath fill
+ vpt 0 360 arc closepath } bind def
+/Rec { newpath 4 2 roll moveto 1 index 0 rlineto 0 exch rlineto
+ neg 0 rlineto closepath } bind def
+/Square { dup Rec } bind def
+/Bsquare { vpt sub exch vpt sub exch vpt2 Square } bind def
+/S0 { BL [] 0 setdash 2 copy moveto 0 vpt rlineto BL Bsquare } bind def
+/S1 { BL [] 0 setdash 2 copy vpt Square fill Bsquare } bind def
+/S2 { BL [] 0 setdash 2 copy exch vpt sub exch vpt Square fill Bsquare } bind def
+/S3 { BL [] 0 setdash 2 copy exch vpt sub exch vpt2 vpt Rec fill Bsquare } bind def
+/S4 { BL [] 0 setdash 2 copy exch vpt sub exch vpt sub vpt Square fill Bsquare } bind def
+/S5 { BL [] 0 setdash 2 copy 2 copy vpt Square fill
+ exch vpt sub exch vpt sub vpt Square fill Bsquare } bind def
+/S6 { BL [] 0 setdash 2 copy exch vpt sub exch vpt sub vpt vpt2 Rec fill Bsquare } bind def
+/S7 { BL [] 0 setdash 2 copy exch vpt sub exch vpt sub vpt vpt2 Rec fill
+ 2 copy vpt Square fill
+ Bsquare } bind def
+/S8 { BL [] 0 setdash 2 copy vpt sub vpt Square fill Bsquare } bind def
+/S9 { BL [] 0 setdash 2 copy vpt sub vpt vpt2 Rec fill Bsquare } bind def
+/S10 { BL [] 0 setdash 2 copy vpt sub vpt Square fill 2 copy exch vpt sub exch vpt Square fill
+ Bsquare } bind def
+/S11 { BL [] 0 setdash 2 copy vpt sub vpt Square fill 2 copy exch vpt sub exch vpt2 vpt Rec fill
+ Bsquare } bind def
+/S12 { BL [] 0 setdash 2 copy exch vpt sub exch vpt sub vpt2 vpt Rec fill Bsquare } bind def
+/S13 { BL [] 0 setdash 2 copy exch vpt sub exch vpt sub vpt2 vpt Rec fill
+ 2 copy vpt Square fill Bsquare } bind def
+/S14 { BL [] 0 setdash 2 copy exch vpt sub exch vpt sub vpt2 vpt Rec fill
+ 2 copy exch vpt sub exch vpt Square fill Bsquare } bind def
+/S15 { BL [] 0 setdash 2 copy Bsquare fill Bsquare } bind def
+/D0 { gsave translate 45 rotate 0 0 S0 stroke grestore } bind def
+/D1 { gsave translate 45 rotate 0 0 S1 stroke grestore } bind def
+/D2 { gsave translate 45 rotate 0 0 S2 stroke grestore } bind def
+/D3 { gsave translate 45 rotate 0 0 S3 stroke grestore } bind def
+/D4 { gsave translate 45 rotate 0 0 S4 stroke grestore } bind def
+/D5 { gsave translate 45 rotate 0 0 S5 stroke grestore } bind def
+/D6 { gsave translate 45 rotate 0 0 S6 stroke grestore } bind def
+/D7 { gsave translate 45 rotate 0 0 S7 stroke grestore } bind def
+/D8 { gsave translate 45 rotate 0 0 S8 stroke grestore } bind def
+/D9 { gsave translate 45 rotate 0 0 S9 stroke grestore } bind def
+/D10 { gsave translate 45 rotate 0 0 S10 stroke grestore } bind def
+/D11 { gsave translate 45 rotate 0 0 S11 stroke grestore } bind def
+/D12 { gsave translate 45 rotate 0 0 S12 stroke grestore } bind def
+/D13 { gsave translate 45 rotate 0 0 S13 stroke grestore } bind def
+/D14 { gsave translate 45 rotate 0 0 S14 stroke grestore } bind def
+/D15 { gsave translate 45 rotate 0 0 S15 stroke grestore } bind def
+/DiaE { stroke [] 0 setdash vpt add M
+ hpt neg vpt neg V hpt vpt neg V
+ hpt vpt V hpt neg vpt V closepath stroke } def
+/BoxE { stroke [] 0 setdash exch hpt sub exch vpt add M
+ 0 vpt2 neg V hpt2 0 V 0 vpt2 V
+ hpt2 neg 0 V closepath stroke } def
+/TriUE { stroke [] 0 setdash vpt 1.12 mul add M
+ hpt neg vpt -1.62 mul V
+ hpt 2 mul 0 V
+ hpt neg vpt 1.62 mul V closepath stroke } def
+/TriDE { stroke [] 0 setdash vpt 1.12 mul sub M
+ hpt neg vpt 1.62 mul V
+ hpt 2 mul 0 V
+ hpt neg vpt -1.62 mul V closepath stroke } def
+/PentE { stroke [] 0 setdash gsave
+ translate 0 hpt M 4 {72 rotate 0 hpt L} repeat
+ closepath stroke grestore } def
+/CircE { stroke [] 0 setdash
+ hpt 0 360 arc stroke } def
+/Opaque { gsave closepath 1 setgray fill grestore 0 setgray closepath } def
+/DiaW { stroke [] 0 setdash vpt add M
+ hpt neg vpt neg V hpt vpt neg V
+ hpt vpt V hpt neg vpt V Opaque stroke } def
+/BoxW { stroke [] 0 setdash exch hpt sub exch vpt add M
+ 0 vpt2 neg V hpt2 0 V 0 vpt2 V
+ hpt2 neg 0 V Opaque stroke } def
+/TriUW { stroke [] 0 setdash vpt 1.12 mul add M
+ hpt neg vpt -1.62 mul V
+ hpt 2 mul 0 V
+ hpt neg vpt 1.62 mul V Opaque stroke } def
+/TriDW { stroke [] 0 setdash vpt 1.12 mul sub M
+ hpt neg vpt 1.62 mul V
+ hpt 2 mul 0 V
+ hpt neg vpt -1.62 mul V Opaque stroke } def
+/PentW { stroke [] 0 setdash gsave
+ translate 0 hpt M 4 {72 rotate 0 hpt L} repeat
+ Opaque stroke grestore } def
+/CircW { stroke [] 0 setdash
+ hpt 0 360 arc Opaque stroke } def
+/BoxFill { gsave Rec 1 setgray fill grestore } def
+end
+%%EndProlog
+gnudict begin
+gsave
+50 50 translate
+0.050 0.050 scale
+0 setgray
+newpath
+(Helvetica) findfont 140 scalefont setfont
+1.000 UL
+LTb
+630 420 M
+63 0 V
+6269 0 R
+-63 0 V
+546 420 M
+(0) Rshow
+630 1020 M
+63 0 V
+6269 0 R
+-63 0 V
+-6353 0 R
+(50) Rshow
+630 1620 M
+63 0 V
+6269 0 R
+-63 0 V
+-6353 0 R
+(100) Rshow
+630 2220 M
+63 0 V
+6269 0 R
+-63 0 V
+-6353 0 R
+(150) Rshow
+630 2820 M
+63 0 V
+6269 0 R
+-63 0 V
+-6353 0 R
+(200) Rshow
+630 3420 M
+63 0 V
+6269 0 R
+-63 0 V
+-6353 0 R
+(250) Rshow
+630 4020 M
+63 0 V
+6269 0 R
+-63 0 V
+-6353 0 R
+(300) Rshow
+630 4620 M
+63 0 V
+6269 0 R
+-63 0 V
+-6353 0 R
+(350) Rshow
+630 420 M
+0 63 V
+0 4137 R
+0 -63 V
+630 280 M
+(0) Cshow
+1263 420 M
+0 63 V
+0 4137 R
+0 -63 V
+0 -4277 R
+(10) Cshow
+1896 420 M
+0 63 V
+0 4137 R
+0 -63 V
+0 -4277 R
+(20) Cshow
+2530 420 M
+0 63 V
+0 4137 R
+0 -63 V
+0 -4277 R
+(30) Cshow
+3163 420 M
+0 63 V
+0 4137 R
+0 -63 V
+0 -4277 R
+(40) Cshow
+3796 420 M
+0 63 V
+0 4137 R
+0 -63 V
+0 -4277 R
+(50) Cshow
+4429 420 M
+0 63 V
+0 4137 R
+0 -63 V
+0 -4277 R
+(60) Cshow
+5062 420 M
+0 63 V
+0 4137 R
+0 -63 V
+0 -4277 R
+(70) Cshow
+5696 420 M
+0 63 V
+0 4137 R
+0 -63 V
+0 -4277 R
+(80) Cshow
+6329 420 M
+0 63 V
+0 4137 R
+0 -63 V
+0 -4277 R
+(90) Cshow
+6962 420 M
+0 63 V
+0 4137 R
+0 -63 V
+0 -4277 R
+(100) Cshow
+1.000 UL
+LTb
+630 420 M
+6332 0 V
+0 4200 V
+-6332 0 V
+630 420 L
+140 2520 M
+currentpoint gsave translate 90 rotate 0 0 M
+(Bytes) Cshow
+grestore
+3796 70 M
+(CVS revision of <sys/buf.h>) Cshow
+3796 4830 M
+(Sizeof\(struct buf\)) Cshow
+1.000 UL
+LT0
+693 1764 M
+64 384 V
+63 0 V
+63 0 V
+64 -96 V
+63 0 V
+63 0 V
+64 816 V
+63 0 V
+63 0 V
+64 768 V
+63 48 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 48 V
+63 96 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 -48 V
+63 0 V
+63 -48 V
+64 0 V
+63 0 V
+63 96 V
+64 0 V
+63 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 48 V
+64 0 V
+63 48 V
+63 96 V
+64 -48 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+63 0 V
+64 96 V
+63 -96 V
+63 -48 V
+64 48 V
+63 0 V
+63 384 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+64 0 V
+63 0 V
+63 0 V
+63 48 V
+64 0 V
+63 0 V
+63 96 V
+64 96 V
+63 0 V
+stroke
+grestore
+end
+showpage
+%%Trailer
+%%DocumentFonts: Helvetica
diff --git a/share/doc/papers/contents/Makefile b/share/doc/papers/contents/Makefile
new file mode 100644
index 000000000000..d15ff9c3b4ea
--- /dev/null
+++ b/share/doc/papers/contents/Makefile
@@ -0,0 +1,8 @@
+# $FreeBSD$
+
+VOLUME= papers
+DOC= contents
+SRCS= contents.ms
+MACROS= -ms
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/contents/contents.ms b/share/doc/papers/contents/contents.ms
new file mode 100644
index 000000000000..12b287a919c3
--- /dev/null
+++ b/share/doc/papers/contents/contents.ms
@@ -0,0 +1,218 @@
+.\" Copyright (c) 1996 FreeBSD Inc.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.OH '''Papers Contents'
+.EH 'Papers Contents'''
+.TL
+UNIX Papers coming with FreeBSD
+.PP
+These papers are of both historic and current interest, but most of them are
+many years old.
+More recent documentation is available from
+.>> <a href="http://www.freebsd.org/docs/">
+http://www.FreeBSD.org/docs/
+.>> </a>
+
+.IP
+.tl '\fBBerkeley Pascal''px\fP'
+.if !r.U .nr .U 0
+.if \n(.U \{\
+.br
+.>> <a href="px.html">px.html</a>
+.\}
+.QP
+Berkeley Pascal
+PX Implementation Notes
+.br
+Version 2.0
+.sp
+Performance Effects of Disk Subsystem Choices
+for VAX\(dg Systems Running 4.2BSD UNIX.
+.sp
+William N. Joy, M. Kirk McKusick.
+.sp
+Revised January, 1979.
+
+.sp
+.IP
+.tl '\fBDisk Performance''diskperf\fP'
+.if \n(.U \{\
+.br
+.>> <a href="diskperf.html">diskperf.html</a>
+.\}
+.QP
+Performance Effects of Disk Subsystem Choices
+for VAX\(dg Systems Running 4.2BSD UNIX.
+.sp
+Bob Kridle, Marshall Kirk McKusick.
+.sp
+Revised July 27, 1983.
+
+.sp
+.IP
+.tl '\fBTune the 4.2BSD Kernel''kerntune\fP'
+.if \n(.U \{\
+.br
+.>> <a href="kerntune.html">kerntune.html</a>
+.\}
+.QP
+Using gprof to Tune the 4.2BSD Kernel.
+.sp
+Marshall Kirk McKusick.
+.sp
+Revised May 21, 1984 (?).
+
+.sp
+.IP
+.tl '\fBNew Virtual Memory''newvm\fP'
+.if \n(.U \{\
+.br
+.>> <a href="newvm.html">newvm.html</a>
+.\}
+.QP
+A New Virtual Memory Implementation for Berkeley.
+.sp
+Marshall Kirk McKusick, Michael J. Karels.
+.sp
+Revised 1986.
+
+.sp
+.IP
+.tl '\fBKernel Malloc''kernmalloc\fP'
+.if \n(.U \{\
+.br
+.>> <a href="kernmalloc.html">kernmalloc.html</a>
+.\}
+.QP
+Design of a General Purpose Memory Allocator for the 4.3BSD UNIX Kernel.
+.sp
+Marshall Kirk McKusick, Michael J. Karels.
+.sp
+Reprinted from:
+\fIProceedings of the San Francisco USENIX Conference\fP,
+pp. 295-303, June 1988.
+
+.sp
+.IP
+.tl '\fBRelease Engineering''relengr\fP'
+.if \n(.U \{\
+.br
+.>> <a href="releng.html">releng.html</a>
+.\}
+.QP
+The Release Engineering of 4.3\s-1BSD\s0.
+.sp
+Marshall Kirk McKusick, Michael J. Karels, Keith Bostic.
+.sp
+Revised 1989.
+
+.sp
+.IP
+.tl '\fBBeyond 4.3BSD''beyond4.3\fP'
+.if \n(.U \{\
+.br
+.>> <a href="beyond43.html">beyond43.html</a>
+.\}
+.QP
+Current Research by The Computer Systems Research Group of Berkeley.
+.sp
+Marshall Kirk McKusick, Michael J Karels, Keith Sklower, Kevin Fall,
+Marc Teitelbaum, Keith Bostic.
+.sp
+Revised February 2, 1989.
+
+.sp
+.IP
+.tl '\fBFilesystem Interface''fsinterface\fP'
+.if \n(.U \{\
+.br
+.>> <a href="fsinterface.html">fsinterface.html</a>
+.\}
+.QP
+Toward a Compatible Filesystem Interface.
+.sp
+Michael J. Karels, Marshall Kirk McKusick.
+.sp
+Conference of the European Users' Group, September 1986.
+Last modified April 16, 1991.
+
+.sp
+.IP
+.tl '\fBSystem Performance''sysperf\fP'
+.if \n(.U \{\
+.br
+.>> <a href="sysperf.html">sysperf.html</a>
+.\}
+.QP
+Measuring and Improving the Performance of Berkeley UNIX.
+.sp
+Marshall Kirk McKusick, Samuel J. Leffler, Michael J. Karels.
+.sp
+Revised April 17, 1991.
+
+.sp
+.IP
+.tl '\fBNot Quite NFS''nqnfs\fP'
+.if \n(.U \{\
+.br
+.>> <a href="nqnfs.html">nqnfs.html</a>
+.\}
+.QP
+Not Quite NFS, Soft Cache Consistency for NFS.
+.sp
+Rick Macklem.
+.sp
+Reprinted with permission from the "Proceedings of the Winter 1994 Usenix
+Conference", January 1994, San Francisco.
+
+.sp
+.IP
+.tl '\fBMalloc(3)''malloc\fP'
+.if \n(.U \{\
+.br
+.>> <a href="malloc.html">malloc.html</a>
+.\}
+.QP
+Malloc(3) in modern Virtual Memory environments.
+.sp
+Poul-Henning Kamp.
+.sp
+Revised April 5, 1996.
+
+.sp
+.IP
+.tl '\fBJails: Confining the omnipotent root''jail\fP'
+.if \n(.U \{\
+.br
+.>> <a href="jail.html">jail.html</a>
+.\}
+.QP
+The jail system call sets up a jail and locks the current process in it.
+.sp
+Poul-Henning Kamp, Robert N. M. Watson.
+.sp
+This paper was presented at the 2nd International System Administration
+and Networking Conference "SANE 2000" May 22-25, 2000 in Maastricht,
+The Netherlands and is published in the proceedings.
diff --git a/share/doc/papers/devfs/Makefile b/share/doc/papers/devfs/Makefile
new file mode 100644
index 000000000000..53a79fccab9a
--- /dev/null
+++ b/share/doc/papers/devfs/Makefile
@@ -0,0 +1,9 @@
+# $FreeBSD$
+
+VOLUME= papers
+DOC= devfs
+SRCS= paper.me
+MACROS= -me
+USE_PIC=
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/devfs/paper.me b/share/doc/papers/devfs/paper.me
new file mode 100644
index 000000000000..1fc627215ac4
--- /dev/null
+++ b/share/doc/papers/devfs/paper.me
@@ -0,0 +1,1277 @@
+.\" format with ditroff -me
+.\" $FreeBSD$
+.\" format made to look as a paper for the proceedings is to look
+.\" (as specified in the text)
+.if n \{ .po 0
+. ll 78n
+. na
+.\}
+.if t \{ .po 1.0i
+. ll 6.5i
+. nr pp 10 \" text point size
+. nr sp \n(pp+2 \" section heading point size
+. nr ss 1.5v \" spacing before section headings
+.\}
+.nr tm 1i
+.nr bm 1i
+.nr fm 2v
+.he ''''
+.de bu
+.ip \0\s-2\(bu\s+2
+..
+.lp
+.rs
+.ce 5
+.sp
+.sz 14
+.b "Rethinking /dev and devices in the UNIX kernel"
+.sz 12
+.sp
+.i "Poul-Henning Kamp"
+.sp .1
+.i "<phk@FreeBSD.org>"
+.i "The FreeBSD Project"
+.i
+.sp 1.5
+.b Abstract
+.lp
+An outstanding novelty in UNIX at its introduction was the notion
+of ``a file is a file is a file and even a device is a file.''
+Going from ``hardware only changes when the DEC Field engineer is here''
+to ``my toaster has USB'' has put serious strain on the rather crude
+implementation of the ``devices as files'' concept, an implementation which
+has survived practically unchanged for 30 years in most UNIX variants.
+Starting from a high-level view of devices and the semantics that
+have grown around them over the years, this paper takes the audience on a
+grand tour of the redesigned FreeBSD device-I/O system,
+to convey an overview of how it all fits together, and to explain why
+things ended up as they did, how to use the new features and
+in particular how not to.
+.sp
+.if t \{
+.2c
+.\}
+.\" end boilerplate... paper starts here.
+.sh 1 "Introduction"
+.sp
+There are really only two fundamental ways to conceptualise
+I/O devices in an operating system:
+The usual way and the UNIX way.
+.lp
+The usual way is to treat I/O devices as their own class of things,
+possibly several classes of things, and provide APIs tailored
+to the semantics of the devices.
+In practice this means that a program must know what it is dealing
+with, it has to interact with disks one way, tapes another and
+rodents yet a third way, all of which are different from how it
+interacts with a plain disk file.
+.lp
+The UNIX way has never been described better than in the very first
+paper
+published on UNIX by Ritchie and Thompson [Ritchie74]:
+.(q
+Special files constitute the most unusual feature of the UNIX filesystem.
+Each supported I/O device is associated with at least one such file.
+Special files are read and written just like ordinary disk files,
+but requests to read or write result in activation of the associated device.
+An entry for each special file resides in directory /dev,
+although a link may be made to one of these files just as it may to an
+ordinary file.
+Thus, for example, to write on a magnetic tape one may write on the file /dev/mt.
+
+Special files exist for each communication line, each disk, each tape drive,
+and for physical main memory.
+Of course, the active disks and the memory special files are protected from indiscriminate access.
+
+There is a threefold advantage in treating I/O devices this way:
+file and device I/O are as similar as possible;
+file and device names have the same syntax and meaning,
+so that a program expecting a file name as a parameter can be passed a device name;
+finally, special files are subject to the same protection mechanism as regular files.
+.)q
+.lp
+.\" (Why was this so special at the time?)
+At the time, this was quite a strange concept; it was totally accepted
+for instance, that neither the system administrator nor the users were
+able to interact with a disk as a disk.
+Operating systems simply
+did not provide access to disk other than as a filesystem.
+Most vendors did not even release a program to initialise a
+disk-pack with a filesystem: selling pre-initialised and ``quality
+tested'' disk-packs was quite a profitable business.
+.lp
+In many cases some kind of API for reading and
+writing individual sectors on a disk pack
+did exist in the operating system,
+but more often than not
+it was not listed in the public documentation.
+.sh 2 "The traditional implementation"
+.lp
+.\" (Explain how opening /dev/lpt0 lands you in the right device driver)
+The initial implementation used hardcoded inode numbers [Ritchie98].
+The console
+device would be inode number 5, the paper-tape-punch number 6 and so on,
+even if those inodes were also actual regular files in the filesystem.
+.lp
+For reasons one can only too vividly imagine, this was changed and
+Thompson
+[Thompson78]
+describes how the implementation now used ``major and minor''
+device numbers to index though the devsw array to the correct device driver.
+.lp
+For all intents and purposes, this is the implementation which survives
+in most UNIX-like systems even to this day.
+Apart from the access control and timestamp information which is
+found in all inodes, the special inodes in the filesystem contain only
+one piece of information: the major and minor device numbers, often
+logically OR'ed to one field.
+.lp
+When a program opens a special file, the kernel uses the major number
+to find the entry points in the device driver, and passes the combined
+major and minor numbers as a parameter to the device driver.
+.sh 1 "The challenge"
+.lp
+Now, we did not talk much about where the special inodes came from
+to begin with.
+They were created by hand, using the
+mknod(2) system call, usually through the mknod(8) program.
+.lp
+In those days a
+computer had a very static hardware configuration\**
+.(f
+\** Unless your assigned field engineer was present on site.
+.)f
+and it certainly did not
+change while the system was up and running, so creating device nodes
+by hand was certainly an acceptable solution.
+.lp
+The first sign that this would not hold up as a solution came with
+the advent of TCP/IP and the telnet(1) program, or more precisely
+with the telnetd(8) daemon.
+In order to support remote login a ``pseudo-tty'' device driver was implemented,
+basically as tty driver which instead of hardware had another device which
+would allow a process to ``act as hardware'' for the tty.
+The telnetd(8) daemon would read and write data on the ``master'' side of
+the pseudo-tty and the user would be running on the ``slave'' side,
+which would act just like any other tty: you could change the erase
+character if you wanted to and all the signals and all that stuff worked.
+.lp
+Obviously with a device requiring no hardware, you can compile as many
+instances into the kernel as you like, as long as you do not use
+too much memory.
+As system after system was connected
+to the ARPANet, ``increasing number of ptys'' became a regular task
+for system administrators, and part of this task was to create
+more special nodes in the filesystem.
+.lp
+Several UNIX vendors also noticed an issue when they sold minicomputers
+in many different configurations: explaining to system administrators
+just which special nodes they would need and how to create them were
+a significant documentation hassle. Some opted for the simple solution
+and pre-populated /dev with every conceivable device node, resulting
+in a predictable slowdown on access to filenames in /dev.
+.lp
+System V UNIX provided a band-aid solution:
+a special boot sequence would take effect if the kernel or
+the hardware had changed since last reboot.
+This boot procedure would
+amongst other things create the necessary special files in the filesystem,
+based on an intricate system of per device driver configuration files.
+.lp
+In the recent years, we have become used to hardware which changes
+configuration at any time: people plug USB, Firewire and PCCard
+devices into their computers.
+These devices can be anything from modems and disks to GPS receivers
+and fingerprint authentication hardware.
+Suddenly maintaining the
+correct set of special devices in ``/dev'' became a major headache.
+.lp
+Along the way, UNIX kernels had learned to deal with multiple filesystem
+types [Heidemann91a] and a ``device-pseudo-filesystem'' was a pretty
+obvious idea.
+The device drivers have a pretty good idea which
+devices they have found in the configuration, so all that is needed is
+to present this information as a filesystem filled with just the right
+special files.
+Experience has shown that this like most other ``pseudo
+filesystems'' sound a lot simpler in theory than in practice.
+.sh 1 "Truly understanding devices"
+.lp
+Before we continue, we need to fully understand the
+``device special file'' in UNIX.
+.lp
+First we need to realize that a special file has the nature of
+a pointer from the filesystem into a different namespace;
+a little understood fact with far reaching consequences.
+.lp
+One implication of this is that several special files can
+exist in the filename namespace all pointing to the same device
+but each having their own access and timestamp attributes:
+.lp
+.(b M
+.vs -3
+\fC\s-3guest# ls -l /dev/fd0 /tmp/fd0
+crw-r----- 1 root operator 9, 0 Sep 27 19:21 /dev/fd0
+crw-rw-rw- 1 root wheel 9, 0 Sep 27 19:24 /tmp/fd0\fP\s+3
+.vs +3
+.)b
+Obviously, the administrator needs to be on top of this:
+one popular way to exploit an unguarded root prompt is
+to create a replica of the special file /dev/kmem
+in a location where it will not be noticed.
+Since /dev/kmem gives access to the kernel memory,
+gaining any particular
+privilege can be arranged by suitably modifying the kernel's
+data structures through the illicit special file.
+.lp
+When NFS appeared it opened a new avenue for this attack:
+People may have root privilege on one machine but not another.
+Since device nodes are not interpreted on the NFS server
+but rather on the local computer,
+a user with root privilege on a NFS client
+computer can create a device node to his liking on a filesystem
+mounted from an NFS server.
+This device node can in turn be used to
+circumvent the security of other computers which mount that filesystem,
+including the server, unless they protect themselves by not
+trusting any device entries on untrusted filesystem by mounting such
+filesystems with the \fCnodev\fP mount-option.
+.lp
+The fact that the device itself does not actually exist inside the
+filesystem which holds the special file makes it possible
+to perform boot-strapping stunts in the spirit
+of Baron Von Münchausen [raspe1785],
+where a filesystem is (re)mounted using one of its own
+device vnodes:
+.(b M
+.vs -3
+\fC\s-2guest# mount -o ro /dev/fd0 /mnt
+guest# fsck /mnt/dev/fd0
+guest# mount -u -o rw /mnt/dev/fd0 /mnt\fP\s+2
+.vs +3
+.)b
+.lp
+Other interesting details are chroot(2) and jail(2) [Kamp2000] which
+provide filesystem isolation for process-trees.
+Whereas chroot(2) was not implemented as a security tool [Mckusick1999]
+(although it has been widely used as such), the jail(2) security
+facility in FreeBSD provides a pretty convincing ``virtual machine''
+where even the root privilege is isolated and restricted to the designated
+area of the machine.
+Obviously chroot(2) and jail(2) may require access to a well-defined
+subset of devices like /dev/null, /dev/zero and /dev/tty,
+whereas access to other devices such as /dev/kmem
+or any disks could be used to compromise the integrity of the jail(2)
+confinement.
+.lp
+For a long time FreeBSD, like almost all UNIX-like systems had two kinds
+of devices, ``block'' and
+``character'' special files, the difference being that ``block''
+devices would provide caching and alignment for disk device access.
+This was one of those minor architectural mistakes which took
+forever to correct.
+.lp
+The argument that block devices were a mistake is really very
+very simple: Many devices other than disks have multiple modes
+of access which you select by choosing which special file to use.
+.lp
+Pick any old timer and he will be able to recite painful
+sagas about the crucial difference between the /dev/rmt
+and /dev/nrmt devices for tape access.\**
+.(f
+\** Make absolutely sure you know the difference before you take
+important data on a multi-file 9-track tape to remote locations.
+.)f
+.lp
+Tapes, asynchronous ports, line printer ports and many other devices
+have implemented submodes, selectable by the user
+at a special filename level, but that has not earned them their
+own special file types.
+Only disks\**
+.(f
+\** Well, OK: and some 9-track tapes.
+.)f
+have enjoyed the privilege of getting an entire file type dedicated to a
+a minor device mode.
+.lp
+Caching and alignment modes should have been enabled by setting
+some bit in the minor device number on the disk special file,
+not by polluting the filesystem code with another file type.
+.lp
+In FreeBSD block devices were not even implemented in a fashion
+which would be of any use, since any write errors would never be
+reported to the writing process. For this reason, and since no
+applications
+were found to be in existence which relied on block devices
+and since historical usage was indeed historical [Mckusick2000],
+block devices were removed from the FreeBSD system.
+This greatly simlified the task of keeping track of open(2)
+reference counts for disks and
+removed much magic special-case code throughout.
+.lp
+.sh 1 "Files, sockets, pipes, SVID IPC and devices"
+.sp
+It is an instructive lesson in inconsistency to look at the
+various types of ``things'' a process can access in UNIX-like
+systems today.
+.lp
+First there are normal files, which are our reference yardstick here:
+they are accessed with open(2), read(2), write(2), mmap(2), close(2)
+and various other auxiliary system calls.
+.lp
+Sockets and pipes are also accessed via file handles but each has
+its own namespace. That means you cannot open(2) a socket,\**
+.(f
+\** This is particularly bizarre in the case of UNIX domain sockets
+which use the filesystem as their namespace and appear in directory
+listings.
+.)f
+but you can read(2) and write(2) to it.
+Sockets and pipes vector off at the file descriptor level and do
+not get in touch with the vnode based part of the kernel at all.
+.lp
+Devices land somewhere in the middle between pipes and sockets on
+one side and normal files on the other.
+They use the filesystem
+namespace, are implemented with vnodes, and can be operated
+on like normal files, but don't actually live in the filesystem.
+.lp
+Devices are in fact special-cased all the way through the vnode system.
+For one thing devices break the ``one file-one vnode''
+rule, making it necessary to chain all vnodes for the same
+device together in
+order to be able to find ``the canonical vnode for this device node'',
+but more importantly, many operations have to be specifically denied
+on special file vnodes since they do not make any sense.
+.lp
+For true inconsistency, consider the SVID IPC mechanisms - not
+only do they not operate via file handles,
+but they also sport a singularly
+illconceived 32 bit numeric namespace and a dedicated set of
+system calls for access.
+.lp
+Several people have convincingly argued that this is an inconsistent
+mess, and have proposed and implemented more consistent operating systems
+like the Plan9 from Bell Labs [Pike90a] [Pike92a].
+Unfortunately reality is that people are not interested in learning a new
+operating system when the one they have is pretty darn good, and
+consequently research into better and more consistent ways is
+a pretty frustrating [Pike2000] but by no means irrelevant topic.
+.sh 1 "Solving the /dev maintenance problem"
+.lp
+There are a number of obvious, simple but wrong ways one could
+go about solving the ``/dev'' maintenance problem.
+.lp
+The very straightforward way is to hack the namei() kernel function
+responsible for filename translation and lookup.
+It is only a minor matter of programming to
+add code to special-case any lookup which ends up in ``/dev''.
+But this leads to problems: in the case of chroot(2) or jail(2), the
+administrator will want to present only a subset of the available
+devices in ``/dev'', so some kind of state will have to be kept per
+chroot(2)/jail(2) about which devices are visible and
+which devices are hidden, but no obvious location for this information
+is available in the absence of a mount data structure.
+.lp
+It also leads to some unpleasant issues
+because of the fact that ``/dev/foo'' is a synthesised directory
+entry which may or may not actually be present on the filesystem
+which seems to provide ``/dev''.
+The vnodes either have to belong to a filesystem or they
+must be special-cased throughout the vnode layer of the kernel.
+.lp
+Finally there is the simple matter of generality:
+hardcoding the string "/dev" in the kernel is very general.
+.lp
+A cruder solution is to leave it to a daemon: make a special
+device driver, have a daemon read messages from it and create and
+destroy nodes in ``/dev'' in response to these messages.
+.lp
+The main drawback to this idea is that now we have added IPC
+to the mix introducing new and interesting race conditions.
+.lp
+Otherwise this solution is a surprisingly effective,
+but chroot(2)/jail(2) requirements prevents a simple implementation
+and running a daemon per jail would become an administrative
+nightmare.
+.lp
+Another pitfall of
+this approach is that we are not able to remount the root filesystem
+read-write at boot until we have a device node for the root device,
+but if this node is missing we cannot create it with a daemon since
+the root filesystem (and hence /dev) is read-only.
+Adding a read-write memory-filesystem mount /dev to solve this problem
+does not improve
+the architectural qualities further and certainly the KISS principle has
+been violated by now.
+.lp
+The final and in the end only satisfactory solution is to write a ``DEVFS''
+which mounts on ``/dev''.
+.lp
+The good news is that it does solve the problem with chroot(2) and jail(2):
+just mount a DEVFS instance on the ``dev'' directory inside the filesystem
+subtree where the chroot or jail lives. Having a mountpoint gives us
+a convenient place to keep track of the local state of this DEVFS mount.
+.lp
+The bad news is that it takes a lot of cleanup and care to implement
+a DEVFS into a UNIX kernel.
+.sh 1 "DEVFS architectural decisions"
+.lp
+Before implementing a DEVFS, it is necessary to decide on a range
+of corner cases in behaviour, and some of these choices have proved
+surprisingly hard to settle for the FreeBSD project.
+.sh 2 "The ``persistence'' issue"
+.lp
+When DEVFS in FreeBSD was initially presented at a BoF at the 1995
+USENIX Technical Conference in New Orleans,
+a group of people demanded that it provide ``persistence''
+for administrative changes.
+.lp
+When trying to get a definition of ``persistence'', people can generally
+agree that if the administrator changes the access control bits of
+a device node, they want that mode to survive across reboots.
+.lp
+Once more tricky examples of the sort of manipulations one can do
+on special files are proposed, people rapidly disagree about what
+should be supported and what should not.
+.lp
+For instance, imagine a
+system with one floppy drive which appears in DEVFS as ``/dev/fd0''.
+Now the administrator, in order to get some badly written software
+to run, links this to ``/dev/fd1'':
+.(b M
+\fC\s-2ln /dev/fd0 /dev/fd1\fP\s+2
+.)b
+This works as expected and with persistence in DEVFS, the link is
+still there after a reboot.
+But what if after a reboot another floppy drive has been connected
+to the system?
+This drive would naturally have the name ``/dev/fd1'',
+but this name is now occupied by the administrators hard link.
+Should the link be broken?
+Should the new floppy drive be called
+``/dev/fd2''? Nobody can agree on anything but the ugliness of the
+situation.
+.lp
+Given that we are no longer dependent on DEC Field engineers to
+change all four wheels to see which one is flat, the basic assumption
+that the machine has a constant hardware configuration is simply no
+longer true.
+The new assumption one should start from when analysing this
+issue is that when the system boots, we cannot know what devices we
+will find, and we can not know if the devices we do find
+are the same ones we had when the system was last shut down.
+.lp
+And in fact, this is very much the case with laptops today: if I attach
+my IOmega Zip drive to my laptop it appears like a SCSI disk named
+``/dev/da0'', but so does the RAID-5 array attached to the PCI SCSI controller
+installed in my laptop's docking station. If I change mode to ``a+rw''
+on the Zip drive, do I want that mode to apply to the RAID-5 as well?
+Unlikely.
+.lp
+And what if we have persistent information about the mode of
+device ``/dev/sio0'', but we boot and do not find any sio devices?
+Do we keep the information in our device-persistence registry?
+How long do we keep it? If I borrow a modem card,
+set the permissions to some non-standard value like 0666,
+and then attach some other serial device a year from now - do I
+want some old permissions changes to come back and haunt me,
+just because they both happened to be ``/dev/sio0''?
+Unlikely.
+.lp
+The fact that more people have laptop computers today than
+five years ago, and the fact that nobody has been able to credibly
+propose where a persistent DEVFS would actually store the
+information about these things in the first place has settled the issue.
+.lp
+Persistence may be the right answer, but to the
+wrong question: persistence is not a desirable property for a DEVFS
+when the hardware configuration may change literally at any time.
+.sh 2 "Who decides on the names?"
+.lp
+In a DEVFS-enabled system, the responsibility for creating nodes in
+/dev shifts to the device drivers, and consequently the device
+drivers get to choose the names of the device files.
+In addition an initial value for owner, group and mode bits are
+provided by the device driver.
+.lp
+But should it be possible to rename ``/dev/lpt0'' to ``/dev/myprinter''?
+While the obvious affirmative answer is easy to arrive at, it leaves
+a lot to be desired once the implications are unmasked.
+.lp
+Most device drivers know their own name and use it purposefully in
+their debug and log messages to identify themselves.
+Furthermore, the ``NewBus'' [NewBus] infrastructure facility,
+which ties hardware to device drivers, identifies things by name
+and unit numbers.
+.lp
+A very common way to report errors in fact:
+.(b M
+.vs -3
+\fC\s-2#define LPT_NAME "lpt" /* our official name */
+[...]
+printf(LPT_NAME
+ ": cannot alloc ppbus (%d)!", error);\fP\s+2
+.vs +3
+.)b
+.lp
+So despite the user renaming the device node pointing to the printer
+to ``myprinter'', this has absolutely no effect in the kernel and can
+be considered a userland aliasing operation.
+.lp
+The decision was therefore made that it should not be possible to rename
+device nodes since it would only lead to confusion and because the desired
+effect could be attained by giving the user the ability to create
+symlinks in DEVFS.
+.sh 2 "On-demand device creation"
+.lp
+Pseudo-devices like pty, tun and bpf,
+but also some real devices, may not pre-emptively create entries for all
+possible device nodes. It would be a pointless waste of resources
+to always create 1000 ptys just in case they are needed,
+and in the worst case more than 1800 device nodes would be needed per
+physical disk to represent all possible slices and partitions.
+.lp
+For pseudo-devices the task at hand is to make a magic device node,
+``/dev/pty'', which when opened will magically transmogrify into the
+first available pty subdevice, maybe ``/dev/pty123''.
+.lp
+Device submodes, on the other hand, work by having multiple
+entries in /dev, each with a different minor number, as a way to instruct
+the device driver in aspects of its operation. The most widespread
+example is probably ``/dev/mt0'' and ``/dev/nmt0'', where the node
+with the extra ``n''
+instructs the tape device driver to not rewind on close.\**
+.(f
+\** This is the answer to the question in footnote number 2.
+.)f
+.lp
+Some UNIX systems have solved the problem for pseudo-devices by
+creating magic cloning devices like ``/dev/tcp''.
+When a cloning device is opened,
+it finds a free instance and through vnode and file descriptor mangling
+return this new device to the opening process.
+.lp
+This scheme has two disadvantages: the complexity of switching vnodes
+in midstream is non-trivial, but even worse is the fact that it
+does not work for
+submodes for a device because it only reacts to one particular /dev entry.
+.lp
+The solution for both needs is a more flexible on-demand device
+creation, implemented in FreeBSD as a two-level lookup.
+When a
+filename is looked up in DEVFS, a match in the existing device nodes is
+sought first and if found, returned.
+If no match is found, device drivers are polled in turn to ask if
+they would be able to synthesise a device node of the given name.
+.lp
+The device driver gets a chance to modify the name
+and create a device with make_dev().
+If one of the drivers succeeds in this, the lookup is started over and
+the newly found device node is returned:
+.(b M
+.vs -3
+\fC\s-2pty_clone()
+ if (name != "pty")
+ return(NULL); /* no luck */
+ n = find_next_unit();
+ dev = make_dev(...,n,"pty%d",n);
+ name = dev->name;
+ return(dev);\fP\s+2
+.vs +3
+.)b
+.lp
+An interesting mixed use of this mechanism is with the sound device drivers.
+Modern sound devices have multiple channels, presumably to allow the
+user to listen to CNN, Napstered MP3 files and Quake sound effects at
+the same time.
+The only problem is that all applications attempt to open ``/dev/dsp''
+since they have no concept of multiple sound devices.
+The sound device drivers use the cloning facility to direct ``/dev/dsp''
+to the first available sound channel completely transparently to the
+process.
+.lp
+There are very few drawbacks to this mechanism, the major one being
+that ``ls /dev'' now errs on the sparse side instead of the rich when used
+as a system device inventory, a practice which has always been
+of dubious precision at best.
+.sh 2 "Deleting and recreating devices"
+.lp
+Deleting device nodes is no problem to implement, but as likely as not,
+some people will want a method to get them back.
+Since only the device driver know how to create a given device,
+recreation cannot be performed solely on the basis of the parameters
+provided by a process in userland.
+.lp
+In order to not complicate the code which updates the directory
+structure for a mountpoint to reflect changes in the DEVFS inode list,
+a deleted entry is merely marked with DE_WHITEOUT instead of being
+removed entirely.
+Otherwise a separate list would be needed for inodes which we had
+deleted so that they would not be mistaken for new inodes.
+.lp
+The obvious way to recreate deleted devices is to let mknod(2) do it
+by matching the name and disregarding the major/minor arguments.
+Recreating the device with mknod(2) will simply remove the DE_WHITEOUT
+flag.
+.sh 2 "Jail(2), chroot(2) and DEVFS"
+.lp
+The primary requirement from facilities like jail(2) and chroot(2)
+is that it must be possible to control the contents of a DEVFS mount
+point.
+.lp
+Obviously, it would not be desirable for dynamic devices to pop
+into existence in the carefully pruned /dev of jails so it must be
+possible to mark a DEVFS mountpoint as ``no new devices''.
+And in the same way, the jailed root should not be able to recreate
+device nodes which the real root has removed.
+.lp
+These behaviours will be controlled with mount options, but these have not
+yet been implemented because FreeBSD has run out of bitmap flags for
+mount options, and a new unlimited mount option implementation is
+still not in place at the time of writing.
+.lp
+One mount option ``jaildevfs'', will restrict the contents of the
+DEVFS mountpoint to the ``normal set'' of devices for a jail and
+automatically hide all future devices and make it impossible
+for a jailed root to un-hide hidden entries while letting an un-jailed
+root do so.
+.lp
+Mounting or remounting read-only, will prevent all future
+devices from appearing and will make it impossible to
+hide or un-hide entries in the mountpoint.
+This is probably only useful for chroots or jails where no tty
+access is intended since cloning will not work either.
+.lp
+More mount options may be needed as more experience is gained.
+.sh 2 "Default mode, owner & group"
+.lp
+When a device driver creates a device node, and a DEVFS mount adds it
+to its directory tree, it needs to have some values for the access
+control fields: mode, owner and group.
+.lp
+Currently, the device driver specifies the initial values in the
+make_dev() call, but this is far from optimal.
+For one thing, embedding magic UIDs and GIDs in the kernel is simply
+bad style unless they are numerically zero.
+More seriously, they represent compile-time defaults which in these
+enlightened days is rather old-fashioned.
+.lp
+.sh 1 "Cleaning up before we build: struct specinfo and dev_t"
+.lp
+Most of the rest of the paper will be about the various challenges
+and issues in the implementation of DEVFS in FreeBSD.
+All of this should be applicable to other systems derived from
+4.4BSD-Lite as well.
+.lp
+POSIX has defined a type called ``dev_t'' which is the identity of a device.
+This is mainly for use in the few system calls which knows about devices:
+stat(2), fstat(2) and mknod(2).
+A dev_t is constructed by logically OR'ing
+the major# and minor# for the device.
+Since those have been defined
+as having no overlapping bits, the major# and minor#
+can be retrieved from the dev_t by a simple masking operation.
+.lp
+Although the kernel had a well-defined concept of any particular
+device it did not have a data structure to represent "a device".
+The device driver has such a structure, traditionally called ``softc''
+but the high kernel does not (and should not!) have access to the
+device driver's private data structures.
+.lp
+It is an interesting tale how things got to be this way,\**
+.(f
+\** Basically, devices should have been moved up with sockets and
+pipes at the file descriptor level when the VFS layering was introduced,
+rather than have all the special casing throughout the vnode system.
+.)f
+but for now just record for
+a fact how the actual relationship between the data structures was
+in the 4.4BSD release (Fig. 1). [44BSDBook]
+.(z
+.PS 3
+F: box "file" "handle"
+arrow down from F.s
+V: box "vnode"
+arrow right from V.e
+S: box "specinfo"
+arrow down from V.s
+I: box "inode"
+arrow right from I.e
+C: box invis "devsw[]" "[major#]"
+arrow down from C.s
+D: box "device" "driver"
+line right from D.e
+box invis "softc[]" "[minor#]"
+F2: box "file" "handle" at F + (2.5,0)
+arrow down from F2.s
+V2: box "vnode"
+arrow right from V2.e
+S2: box "specinfo"
+arrow down from V2.s
+I2: box "inode"
+arrow left from I2.w
+.PE
+.ce 1
+Fig. 1 - Data structures in 4.4BSD
+.)z
+.lp
+As for all other files, a vnode references a filesystem inode, but
+in addition it points to a ``specinfo'' structure. In the inode
+we find the dev_t which is used to reference the device driver.
+.lp
+Access to the device driver happens by extracting the major# from
+the dev_t, indexing through the global devsw[] array to locate
+the device driver's entry point.
+.lp
+The device driver will extract the minor# from the dev_t and use
+that as the index into the softc array of private data per device.
+.lp
+The ``specinfo'' structure is a little sidekick vnodes grew underway,
+and is used to find all vnodes which reference the same device (i.e.
+they have the same major# and minor#).
+This linkage is used to determine
+which vnode is the ``chosen one'' for this device, and to keep track of
+open(2)/close(2) against this device.
+The actual implementation was an inefficient hash implementation,
+which depending on the vnode reclamation rate and /dev directory lookup
+traffic, may become a measurable performance liability.
+.sh 2 "The new vnode/inode/dev_t layout"
+.lp
+In the new layout (Fig. 2) the specinfo structure takes a central
+role. There is only one instanace of struct specinfo per
+device (i.e. unique major#
+and minor# combination) and all vnodes referencing this device point
+to this structure directly.
+.(z
+.PS 2.25
+F: box "file" "handle"
+arrow down from F.s
+V: box "vnode"
+arrow right from V.e
+S: box "specinfo"
+arrow down from V.s
+I: box "inode"
+F2: box "file" "handle" at F + (2.5,0)
+arrow down from F2.s
+V2: box "vnode"
+arrow left from V2.w
+arrow down from V2.s
+I2: box "inode"
+arrow down from S.s
+D: box "device" "driver"
+.PE
+.ce 1
+Fig. 2 - The new FreeBSD data structures.
+.)z
+.lp
+In userland, a dev_t is still the logical OR of the major# and
+minor#, but this entity is now called a udev_t in the kernel.
+In the kernel a dev_t is now a pointer to a struct specinfo.
+.lp
+All vnodes referencing a device are linked to a list hanging
+directly off the specinfo structure, removing the need for the
+hash table and consequently simplifying and speeding up a lot
+of code dealing with vnode instantiation, retirement and
+name-caching.
+.lp
+The entry points to the device driver are stored in the specinfo
+structure, removing the need for the devsw[] array and allowing
+device drivers to use separate entrypoints for various minor numbers.
+.lp
+This is very convenient for devices which have a ``control''
+device for management and tuning. The control device, almost always
+have entirely separate open/close/ioctl implementations [MD.C].
+.lp
+In addition to this, two data elements are included in the specinfo
+structure but ``owned'' by the device driver. Typically the
+device driver will store a pointer to the softc structure in
+one of these, and unit number or mode information in the other.
+.lp
+This removes the need for drivers to find the softc using array
+indexing based on the minor#, and at the same time has obliviated
+the need for the compiled-in ``NFOO'' constants which traditionally
+determined how many softc structures and therefore devices
+the driver could support.\**
+.(f
+\** Not to mention all the drivers which implemented panic(2)
+because they forgot to perform bounds checking on the index before
+using it on their softc arrays.
+.)f
+.lp
+There are some trivial technical issues relating to allocating
+the storage for specinfo early in the boot sequence and how to
+find a specinfo from the udev_t/major#+minor#, but they will
+not be discussed here.
+.sh 2 "Creating and destroying devices"
+.lp
+Ideally, devices should only be created and
+destroyed by the device drivers which know what devices are present.
+This is accomplished with the make_dev() and destroy_dev()
+function calls.
+.lp
+Life is seldom quite that simple. The operating system might be called
+on to act as a NFS server for a diskless workstation, possibly even
+of a different architecture, so we still need to be able to represent
+device nodes with no device driver backing in the filesystems and
+consequently we need to be able to create a specinfo from
+the major#+minor# in these inodes when we encounter them.
+In practice this is quite trivial, but in a few places in the code
+one needs to be aware of the existence
+of both ``named'' and ``anonymous'' specinfo structures.
+.lp
+The make_dev() call creates a specinfo structure and populates
+it with driver entry points, major#, minor#, device node name
+(for instance ``lpt0''), UID, GID and access mode bits. The return
+value is a dev_t (i.e., a pointer to struct specinfo).
+If the device driver determines that the device is no longer
+present, it calls destroy_dev(), giving a dev_t as argument
+and the dev_t will be cleaned and converted to an anonymous dev_t.
+.lp
+Once created with make_dev() a named dev_t exists until destroy_dev()
+is called by the driver. The driver can rely on this and keep state
+in the fields in dev_t which is reserved for driver use.
+.sh 1 "DEVFS"
+.lp
+By now we have all the relevant information about each device node
+collected in struct specinfo but we still have one problem to
+solve before we can add the DEVFS filesystem on top of it.
+.sh 2 "The interrupt problem"
+.lp
+Some device drivers, notably the CAM/SCSI subsystem in FreeBSD
+will discover changes in the device configuration inside an interrupt
+routine.
+.lp
+This imposes some limitations on what can and should do be done:
+first one should minimise the amount
+of work done in an interrupt routine for performance reasons;
+second, to avoid deadlocks, vnodes and mountpoints should not be
+accessed from an interrupt routine.
+.lp
+Also, in addition to the locking issue,
+a machine can have many instances of DEVFS mounted:
+for a jail(8) based virtual-machine web-server several hundred instances
+is not unheard of, making it far too expensive to update all of them
+in an interrupt routine.
+.lp
+The solution to this problem is to do all the filesystem work on
+the filesystem side of DEVFS and use atomically manipulated integer indices
+(``inode numbers'') as the barrier between the two sides.
+.lp
+The functions called from the device drivers, make_dev(), destroy_dev()
+&c. only manipulate the DEVFS inode number of the dev_t in
+question and do not even get near any mountpoints or vnodes.
+.lp
+For make_dev() the task is to assign a unique inode number to the
+dev_t and store the dev_t in the DEVFS-global inode-to-dev_t array.
+.(b M
+.vs -3
+\fC\s-2make_dev(...)
+ store argument values in dev_t
+ assign unique inode number to dev_t
+ atomically insert dev_t into inode_array\fP\s+2
+.vs +3
+.)b
+.lp
+For destroy_dev() the task is the opposite: clear the inode number
+in the dev_t and NULL the pointer in the devfs-global inode-to-dev_t
+array.
+.(b M
+.vs -3
+\fC\s-2destroy_dev(...)
+ clear fields in dev_t
+ zero dev_t inode number.
+ atomically clear entry in inode_array\fP\s+2
+.vs +3
+.)b
+.lp
+Both functions conclude by atomically incrementing a global variable
+\fCdevfs_generation\fP to leave an indication to the filesystem
+side that something has changed.
+.lp
+By modifying the global state only with atomic instructions, locks
+have been entirely avoided in this part of the code which means that
+the make_dev() and destroy_dev() functions can be called from practically
+anywhere in the kernel at any time.
+.lp
+On the filesystem side of DEVFS, the only two vnode methods which examine
+or rely on the directory structure, VOP_LOOKUP and VOP_READDIR,
+call the function devfs_populate() to update their mountpoint's view
+of the device hierarchy to match current reality before doing any work.
+.(b M
+.vs -3
+\fC\s-2devfs_readdir(...)
+ devfs_populate(...)
+ ...\fP\s+2
+.)b
+.vs +3
+.lp
+The devfs_populate() function, compares the current \fCdevfs_generation\fP
+to the value saved in the mountpoint last time devfs_populate() completed
+and if (actually: while) they differ a linear run is made through the
+devfs-global inode-array and the directory tree of the mountpoint is
+brought up to date.
+.lp
+The actual code is slightly more complicated than shown in the pseudo-code
+here because it has to deal with subdirectories and hidden entries.
+.(b M
+.vs -3
+\fC\s-2devfs_populate(...)
+ while (mount->generation != devfs_generation)
+ for i in all inodes
+ if inode created)
+ create directory entry
+ else if inode destroyed
+ remove directory entry
+.vs +3
+.)b
+.lp
+Access to the global DEVFS inode table is again implemented
+with atomic instructions and failsafe retries to avoid the
+need for locking.
+.lp
+From a performance point of view this scheme also means that a particular
+DEVFS mountpoint is not updated until it needs to be, and then always by
+a process belonging to the jail in question thus minimising and
+distributing the CPU load.
+.sh 1 "Device-driver impact"
+.lp
+All these changes have had a significant impact on how device drivers
+interact with the rest of the kernel regarding registration of
+devices.
+.lp
+If we look first at the ``before'' image in Fig. 3, we notice first
+the NFOO define which imposes a firm upper limit on the number of
+devices the kernel can deal with.
+Also notice that the softc structure for all of them is allocated
+at compile time.
+This is because most device drivers (and texts on writing device
+drivers) are from before the general
+kernel malloc facility [Mckusick1988] was introduced into the BSD kernel.
+.lp
+.(b M
+.vs -3
+\fC\s-2
+#ifndef NFOO
+# define NFOO 4
+#endif
+
+struct foo_softc {
+ ...
+} foo_softc[NFOO];
+
+int nfoo = 0;
+
+foo_open(dev, ...)
+{
+ int unit = minor(dev);
+ struct foo_softc *sc;
+
+ if (unit >= NFOO || unit >= nfoo)
+ return (ENXIO);
+
+ sc = &foo_softc[unit]
+
+ ...
+}
+
+foo_attach(...)
+{
+ struct foo_softc *sc;
+ static int once;
+
+ ...
+ if (nfoo >= NFOO) {
+ /* Have hardware, can't handle */
+ return (-1);
+ }
+ sc = &foo_softc[nfoo++];
+ if (!once) {
+ cdevsw_add(&cdevsw);
+ once++;
+ }
+ ...
+}
+\fP\s+2
+Fig. 3 - Device-driver, old style.
+.vs +3
+.)b
+.lp
+Also notice how range checking is needed to make sure that the
+minor# is inside range. This code gets more complex if device-numbering
+is sparse. Code equivalent to that shown in the foo_open() routine
+would also be needed in foo_read(), foo_write(), foo_ioctl() &c.
+.lp
+Finally notice how the attach routine needs to remember to register
+the cdevsw structure (not shown) when the first device is found.
+.lp
+Now, compare this to our ``after'' image in Fig. 4.
+NFOO is totally gone and so is the compile time allocation
+of space for softc structures.
+.lp
+The foo_open (and foo_close, foo_ioctl &c) functions can now
+derive the softc pointer directly from the dev_t they receive
+as an argument.
+.lp
+.(b M
+.vs -3
+\fC\s-2
+struct foo_softc {
+ ....
+};
+
+int nfoo;
+
+foo_open(dev, ...)
+{
+ struct foo_softc *sc = dev->si_drv1;
+
+ ...
+}
+
+foo_attach(...)
+{
+ struct foo_softc *sc;
+
+ ...
+ sc = MALLOC(..., M_ZERO);
+ if (sc == NULL) {
+ /* Have hardware, can't handle */
+ return (-1);
+ }
+ sc->dev = make_dev(&cdevsw, nfoo,
+ UID_ROOT, GID_WHEEL, 0644,
+ "foo%d", nfoo);
+ nfoo++;
+ sc->dev->si_drv1 = sc;
+ ...
+}
+\fP\s+2
+Fig. 4 - Device-driver, new style.
+.vs +3
+.)b
+.lp
+In foo_attach() we can now attach to all the devices we can
+allocate memory for and we register the cdevsw structure per
+dev_t rather than globally.
+.lp
+This last trick is what allows us to discard all bounds checking
+in the foo_open() &c. routines, because they can only be
+called through the cdevsw, and the cdevsw is only attached to
+dev_t's which foo_attach() has created.
+There is no way to end
+up in foo_open() with a dev_t not created by foo_attach().
+.lp
+In the two examples here, the difference is only 10 lines of source
+code, primarily because only one of the worker functions of the
+device driver is shown.
+In real device drivers it is not uncommon to save 50 or more lines
+of source code which typically is about a percent or two of the
+entire driver.
+.sh 1 "Future work"
+.lp
+Apart from some minor issues to be cleaned up, DEVFS is now a reality
+and future work therefore is likely concentrate on applying the
+facilities and functionality of DEVFS to FreeBSD.
+.sh 2 "devd"
+.lp
+It would be logical to complement DEVFS with a ``device-daemon'' which
+could configure and de-configure devices as they come and go.
+When a disk appears, mount it.
+When a network interface appears, configure it.
+And in some configurable way allow the user to customise the action,
+so that for instance images will automatically be copied off the
+flash-based media from a camera, &c.
+.lp
+In this context it is good to question how we view dynamic devices.
+If for instance a printer is removed in the middle of a print job
+and another printer arrives a moment later, should the system
+automatically continue the print job on this new printer?
+When a disk-like device arrives, should we always mount it? Should
+we have a database of known disk-like devices to tell us where to
+mount it, what permissions to give the mountpoint?
+Some computers come in multiple configurations, for instance laptops
+with and without their docking station. How do we want to present
+this to the users and what behaviour do the users expect?
+.sh 2 "Pathname length limitations"
+.lp
+In order to simplify memory management in the early stages of boot,
+the pathname relative to the mountpoint is presently stored in a
+small fixed size buffer inside struct specinfo.
+It should be possible to use filenames as long as the system otherwise
+permits, so some kind of extension mechanism is called for.
+.lp
+Since it cannot be guaranteed that memory can be allocated in
+all the possible scenarios where make_dev() can be called, it may
+be necessary to mandate that the caller allocates the buffer if
+the content will not fit inside the default buffer size.
+.sh 2 "Initial access parameter selection"
+.lp
+As it is now, device drivers propose the initial mode, owner and group
+for the device nodes, but it would be more flexible if it were possible
+to give the kernel a set of rules, much like packet filtering rules,
+which allow the user to set the wanted policy for new devices.
+Such a mechanism could also be used to filter new devices for mount
+points in jails and to determine other behaviour.
+.lp
+Doing these things from userland results in some awkward race conditions
+and software bloat for embedded systems, so a kernel approach may be more
+suitable.
+.sh 2 "Applications of on-demand device creation"
+.lp
+The facility for on-demand creation of devices has some very interesting
+possibilities.
+.lp
+One planned use is to enable user-controlled labelling
+of disks.
+Today disks have names like /dev/da0, /dev/ad4, but since
+this numbering is topological any change in the hardware configuration
+may rename the disks, causing /etc/fstab and backup procedures
+to get out of sync with the hardware.
+.lp
+The current idea is to store on the media of the disk a user-chosen
+disk name and allow access through this name, so that for instance
+/dev/mydisk0
+would be a symlink to whatever topological name the disk might have
+at any given time.
+.lp
+To simplify this and to avoid a forest of symlinks, it will probably
+be decided to move all the sub-divisions of a disk into one subdirectory
+per disk so just a single symlink can do the job.
+In practice that means that the current /dev/ad0s2f will become
+something like /dev/ad0/s2f and so on.
+Obviously, in the same way, disks could also be accessed by their
+topological address, down to the specific path in a SAN environment.
+.lp
+Another potential use could be for automated offline data media libraries.
+It would be quite trivial to make it possible to access all the media
+in the library using /dev/lib/$LABEL which would be a remarkable
+simplification compared with most current automated retrieval facilities.
+.lp
+Another use could be to access devices by parameter rather than by
+name. One could imagine sending a printjob to /dev/printer/color/A2
+and behind the scenes a search would be made for a device with the
+correct properties and paper-handling facilities.
+.sh 1 "Conclusion"
+.lp
+DEVFS has been successfully implemented in FreeBSD,
+including a powerful, simple and flexible solution supporting
+pseudo-devices and on-demand device node creation.
+.lp
+Contrary to the trend, the implementation added functionality
+with a net decrease in source lines,
+primarily because of the improved API seen from device drivers point of view.
+.lp
+Even if DEVFS is not desired, other 4.4BSD derived UNIX variants
+would probably benefit from adopting the dev_t/specinfo related
+cleanup.
+.sh 1 "Acknowledgements"
+.lp
+I first got started on DEVFS in 1989 because the abysmal performance
+of the Olivetti M250 computer forced me to implement a network-disk-device
+for Minix in order to retain my sanity.
+That initial work led to a
+crude but working DEVFS for Minix, so obviously both Andrew Tannenbaum
+and Olivetti deserve credit for inspiration.
+.lp
+Julian Elischer implemented a DEVFS for FreeBSD around 1994 which never
+quite made it to maturity and subsequently was abandoned.
+.lp
+Bruce Evans deserves special credit not only for his keen eye for detail,
+and his competent criticism but also for his enthusiastic resistance to the
+very concept of DEVFS.
+.lp
+Many thanks to the people who took time to help me stamp out ``Danglish''
+through their reviews and comments: Chris Demetriou, Paul Richards,
+Brian Somers, Nik Clayton, and Hanne Munkholm.
+Any remaining insults to proper use of english language are my own fault.
+.\" (list & why)
+.sh 1 "References"
+.lp
+[44BSDBook]
+Mckusick, Bostic, Karels & Quarterman:
+``The Design and Implementation of 4.4 BSD Operating System.''
+Addison Wesley, 1996, ISBN 0-201-54979-4.
+.lp
+[Heidemann91a]
+John S. Heidemann:
+``Stackable layers: an architecture for filesystem development.''
+Master's thesis, University of California, Los Angeles, July 1991.
+Available as UCLA technical report CSD-910056.
+.lp
+[Kamp2000]
+Poul-Henning Kamp and Robert N. M. Watson:
+``Confining the Omnipotent root.''
+Proceedings of the SANE 2000 Conference.
+Available in FreeBSD distributions in \fC/usr/share/papers\fP.
+.lp
+[MD.C]
+Poul-Henning Kamp et al:
+FreeBSD memory disk driver:
+\fCsrc/sys/dev/md/md.c\fP
+.lp
+[Mckusick1988]
+Marshall Kirk Mckusick, Mike J. Karels:
+``Design of a General Purpose Memory Allocator for the 4.3BSD UNIX-Kernel''
+Proceedings of the San Francisco USENIX Conference, pp. 295-303, June 1988.
+.lp
+[Mckusick1999]
+Dr. Marshall Kirk Mckusick:
+Private email communication.
+\fI``According to the SCCS logs, the chroot call was added by Bill Joy
+on March 18, 1982 approximately 1.5 years before 4.2BSD was released.
+That was well before we had ftp servers of any sort (ftp did not
+show up in the source tree until January 1983). My best guess as
+to its purpose was to allow Bill to chroot into the /4.2BSD build
+directory and build a system using only the files, include files,
+etc contained in that tree. That was the only use of chroot that
+I remember from the early days.''\fP
+.lp
+[Mckusick2000]
+Dr. Marshall Kirk Mckusick:
+Private communication at BSDcon2000 conference.
+\fI``I have not used block devices since I wrote FFS and that
+was \fPmany\fI years ago.''\fP
+.lp
+[NewBus]
+NewBus is a subsystem which provides most of the glue between
+hardware and device drivers. Despite the importance of this
+there has never been published any good overview documentation
+for it.
+The following article by Alexander Langer in ``Dæmonnews'' is
+the best reference I can come up with:
+\fC\s-2http://www.daemonnews.org/200007/newbus-intro.html\fP\s+2
+.lp
+[Pike2000]
+Rob Pike:
+``Systems Software Research is Irrelevant.''
+\fC\s-2http://www.cs.bell\-labs.com/who/rob/utah2000.pdf\fP\s+2
+.lp
+[Pike90a]
+Rob Pike, Dave Presotto, Ken Thompson and Howard Trickey:
+``Plan 9 from Bell Labs.''
+Proceedings of the Summer 1990 UKUUG Conference.
+.lp
+[Pike92a]
+Rob Pike, Dave Presotto, Ken Thompson, Howard Trickey and Phil Winterbottom:
+``The Use of Name Spaces in Plan 9.''
+Proceedings of the 5th ACM SIGOPS Workshop.
+.lp
+[Raspe1785]
+Rudolf Erich Raspe:
+``Baron Münchhausen's Narrative of his marvellous Travels and Campaigns in Russia.''
+Kearsley, 1785.
+.lp
+[Ritchie74]
+D.M. Ritchie and K. Thompson:
+``The UNIX Time-Sharing System''
+Communications of the ACM, Vol. 17, No. 7, July 1974.
+.lp
+[Ritchie98]
+Dennis Ritchie: private conversation at USENIX Annual Technical Conference
+New Orleans, 1998.
+.lp
+[Thompson78]
+Ken Thompson:
+``UNIX Implementation''
+The Bell System Technical Journal, vol 57, 1978, number 6 (part 2) p. 1931ff.
diff --git a/share/doc/papers/diskperf/Makefile b/share/doc/papers/diskperf/Makefile
new file mode 100644
index 000000000000..7f7670c45533
--- /dev/null
+++ b/share/doc/papers/diskperf/Makefile
@@ -0,0 +1,11 @@
+# From: @(#)Makefile 6.3 (Berkeley) 6/8/93
+# $FreeBSD$
+
+VOLUME= papers
+DOC= diskperf
+SRCS= abs.ms motivation.ms equip.ms methodology.ms tests.ms \
+ results.ms conclusions.ms appendix.ms
+MACROS= -ms
+USE_TBL=
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/diskperf/abs.ms b/share/doc/papers/diskperf/abs.ms
new file mode 100644
index 000000000000..a61104d5de48
--- /dev/null
+++ b/share/doc/papers/diskperf/abs.ms
@@ -0,0 +1,176 @@
+.\" Copyright (c) 1983 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)abs.ms 6.2 (Berkeley) 4/16/91
+.\"
+.if n .ND
+.TL
+Performance Effects of Disk Subsystem Choices
+for VAX\(dg Systems Running 4.2BSD UNIX*
+.sp
+Revised July 27, 1983
+.AU
+Bob Kridle
+.AI
+mt Xinu
+2560 9th Street
+Suite #312
+Berkeley, California 94710
+.AU
+Marshall Kirk McKusick\(dd
+.AI
+Computer Systems Research Group
+Computer Science Division
+Department of Electrical Engineering and Computer Science
+University of California, Berkeley
+Berkeley, CA 94720
+.AB
+.FS
+\(dgVAX, UNIBUS, and MASSBUS are trademarks of Digital Equipment Corporation.
+.FE
+.FS
+* UNIX is a trademark of Bell Laboratories.
+.FE
+.FS
+\(ddThis work was supported under grants from
+the National Science Foundation under grant MCS80-05144,
+and the Defense Advance Research Projects Agency (DoD) under
+Arpa Order No. 4031 monitored by Naval Electronic System Command under
+Contract No. N00039-82-C-0235.
+.FE
+Measurements were made of the UNIX file system
+throughput for various I/O operations using the most attractive currently
+available Winchester disks and controllers attached to both the
+native busses (SBI/CMI) and the UNIBUS on both VAX 11/780s and VAX 11/750s.
+The tests were designed to highlight the performance of single
+and dual drive subsystems operating in the 4.2BSD
+.I
+fast file system
+.R
+environment.
+Many of the results of the tests were initially counter-intuitive
+and revealed several important aspects of the VAX implementations
+which were surprising to us.
+.PP
+The hardware used included two Fujitsu 2351A
+``Eagle''
+disk drives on each of two foreign-vendor disk controllers
+and two DEC RA-81 disk drives on a DEC UDA-50 disk controller.
+The foreign-vendor controllers were Emulex SC750, SC780
+and Systems Industries 9900 native bus interfaced controllers.
+The DEC UDA-50 controller is a UNIBUS interfaced, heavily buffered
+controller which is the first implementation of a new DEC storage
+system architecture, DSA.
+.PP
+One of the most important results of our testing was the correction
+of several timing parameters in our device handler for devices
+with an RH750/RH780 type interface and having high burst transfer
+rates.
+The correction of these parameters resulted in an increase in
+performance of over twenty percent in some cases.
+In addition, one of the controller manufacturers altered their bus
+arbitration scheme to produce another increase in throughput.
+.AE
+.LP
+.de PT
+.lt \\n(LLu
+.pc %
+.nr PN \\n%
+.tl '\\*(LH'\\*(CH'\\*(RH'
+.lt \\n(.lu
+..
+.af PN i
+.ds LH Performance
+.ds RH Contents
+.bp 1
+.\".if t .ds CF July 27, 1983
+.\".if t .ds LF CSRG TR/8
+.\".if t .ds RF Kridle, et. al.
+.ce
+.B "TABLE OF CONTENTS"
+.LP
+.sp 1
+.nf
+.B "1. Motivation"
+.LP
+.sp .5v
+.nf
+.B "2. Equipment
+2.1. DEC UDA50 disk controller
+2.2. Emulex SC750/SC780 disk controllers
+2.3. Systems Industries 9900 disk controller
+2.4. DEC RA81 disk drives
+2.5. Fujitsu 2351A disk drives
+.LP
+.sp .5v
+.nf
+.B "3. Methodology
+.LP
+.sp .5v
+.nf
+.B "4. Tests
+.LP
+.sp .5v
+.nf
+.B "5. Results
+.LP
+.sp .5v
+.nf
+.B "6. Conclusions
+.LP
+.sp .5v
+.nf
+.B Acknowledgements
+.LP
+.sp .5v
+.nf
+.B References
+.LP
+.sp .5v
+.nf
+.B "Appendix A
+A.1. read_8192
+A.2. write_4096
+A.3. write_8192
+A.4. rewrite_8192
+.ds RH Motivation
+.af PN 1
+.bp 1
+.de _d
+.if t .ta .6i 2.1i 2.6i
+.\" 2.94 went to 2.6, 3.64 to 3.30
+.if n .ta .84i 2.6i 3.30i
+..
+.de _f
+.if t .ta .5i 1.25i 2.5i
+.\" 3.5i went to 3.8i
+.if n .ta .7i 1.75i 3.8i
+..
diff --git a/share/doc/papers/diskperf/appendix.ms b/share/doc/papers/diskperf/appendix.ms
new file mode 100644
index 000000000000..e059249e4143
--- /dev/null
+++ b/share/doc/papers/diskperf/appendix.ms
@@ -0,0 +1,102 @@
+.\" Copyright (c) 1983 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)appendix.ms 6.2 (Berkeley) 4/16/91
+.\"
+.\" .nr H2 1
+.ds RH Appendix A
+.NH
+\s+2Appendix A\s0
+.NH 2
+read_8192
+.PP
+.DS
+#define BUFSIZ 8192
+main( argc, argv)
+char **argv;
+{
+ char buf[BUFSIZ];
+ int i, j;
+
+ j = open(argv[1], 0);
+ for (i = 0; i < 1024; i++)
+ read(j, buf, BUFSIZ);
+}
+.DE
+.NH 2
+write_4096
+.PP
+.DS
+#define BUFSIZ 4096
+main( argc, argv)
+char **argv;
+{
+ char buf[BUFSIZ];
+ int i, j;
+
+ j = creat(argv[1], 0666);
+ for (i = 0; i < 2048; i++)
+ write(j, buf, BUFSIZ);
+}
+.DE
+.NH 2
+write_8192
+.PP
+.DS
+#define BUFSIZ 8192
+main( argc, argv)
+char **argv;
+{
+ char buf[BUFSIZ];
+ int i, j;
+
+ j = creat(argv[1], 0666);
+ for (i = 0; i < 1024; i++)
+ write(j, buf, BUFSIZ);
+}
+.DE
+.bp
+.NH 2
+rewrite_8192
+.PP
+.DS
+#define BUFSIZ 8192
+main( argc, argv)
+char **argv;
+{
+ char buf[BUFSIZ];
+ int i, j;
+
+ j = open(argv[1], 2);
+ for (i = 0; i < 1024; i++)
+ write(j, buf, BUFSIZ);
+}
+.DE
diff --git a/share/doc/papers/diskperf/conclusions.ms b/share/doc/papers/diskperf/conclusions.ms
new file mode 100644
index 000000000000..9e20f1a64708
--- /dev/null
+++ b/share/doc/papers/diskperf/conclusions.ms
@@ -0,0 +1,128 @@
+.\" Copyright (c) 1983 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)conclusions.ms 6.2 (Berkeley) 4/16/91
+.\" $FreeBSD$
+.\"
+.ds RH Conclusions
+.NH
+Conclusions
+.PP
+Peak available throughput is only one criterion
+in most storage system purchasing decisions.
+Most of the VAX UNIX systems we are familiar with
+are not I/O bandwidth constrained.
+Nevertheless, an adequate disk bandwidth is necessary for
+good performance and especially to preserve snappy
+response time.
+All of the disk systems we tested provide more than
+adequate bandwidth for typical VAX UNIX system application.
+Perhaps in some I/O-intensive applications such as
+image processing, more consideration should be given
+to the peak throughput available.
+In most situations, we feel that other factors are more
+important in making a storage choice between the systems we
+tested.
+Cost, reliability, availability, and support are some of these
+factors.
+The maturity of the technology purchased must also be weighed
+against the future value and expandability of newer technologies.
+.PP
+Two important conclusions about storage systems in general
+can be drawn from these tests.
+The first is that buffering can be effective in smoothing
+the effects of lower bus speeds and bus contention.
+Even though the UDA50 is located on the relatively slow
+UNIBUS, its performance is similar to controllers located on
+the faster processor busses.
+However, the SC780 with only one sector of buffering shows that
+little buffering is needed if the underlying bus is fast enough.
+.PP
+Placing more intelligence in the controller seems to hinder UNIX system
+performance more than it helps.
+Our profiling tests have indicated that UNIX spends about
+the same percentage of time in the SC780 driver and the UDA50 driver
+(about 10-14%).
+Normally UNIX uses a disk sort algorithm that separates reads and
+writes into two seek order queues.
+The read queue has priority over the write queue,
+since reads cause processes to block,
+while writes can be done asynchronously.
+This is particularly useful when generating large files,
+as it allows the disk allocator to read
+new disk maps and begin doing new allocations
+while the blocks allocated out of the previous map are written to disk.
+Because the UDA50 handles all block ordering,
+and because it keeps all requests in a single queue,
+there is no way to force the longer seek needed to get the next disk map.
+This disfunction causes all the writes to be done before the disk map read,
+which idles the disk until a new set of blocks can be allocated.
+.PP
+The additional functionality of the UDA50 controller that allows it
+to transfer simultaneously from two drives at once tends to make
+the two drive transfer tests run much more effectively.
+Tuning for the single drive case works more effectively in the two
+drive case than when controllers that cannot handle simultaneous
+transfers are used.
+.ds RH Acknowledgements
+.nr H2 1
+.sp 1
+.NH
+\s+2Acknowledgements\s0
+.PP
+We thank Paul Massigilia and Bill Grace
+of Digital Equipment Corp for helping us run our
+disk tests on their UDA50/RA81.
+We also thank Rich Notari and Paul Ritkowski
+of Emulex for making their machines available
+to us to run our tests of the SC780/Eagles.
+Dan McKinster, then of Systems Industries,
+arranged to make their equipment available for the tests.
+We appreciate the time provided by Bob Gross, Joe Wolf, and
+Sam Leffler on their machines to refine our benchmarks.
+Finally we thank our sponsors,
+the National Science Foundation under grant MCS80-05144,
+and the Defense Advance Research Projects Agency (DoD) under
+Arpa Order No. 4031 monitored by Naval Electronic System Command under
+Contract No. N00039-82-C-0235.
+.ds RH References
+.nr H2 1
+.sp 1
+.NH
+\s+2References\s0
+.LP
+.IP [McKusick83] 20
+M. McKusick, W. Joy, S. Leffler, R. Fabry,
+``A Fast File System for UNIX'',
+\fIACM Transactions on Computer Systems 2\fP, 3.
+pp 181-197, August 1984.
+.ds RH Appendix A
+.bp
diff --git a/share/doc/papers/diskperf/equip.ms b/share/doc/papers/diskperf/equip.ms
new file mode 100644
index 000000000000..264ea0494737
--- /dev/null
+++ b/share/doc/papers/diskperf/equip.ms
@@ -0,0 +1,177 @@
+.\" Copyright (c) 1983 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)equip.ms 6.2 (Berkeley) 4/16/91
+.\"
+.ds RH Equipment
+.NH
+Equipment
+.PP
+Various combinations of the three manufacturers disk controllers,
+and two pairs of Winchester disk drives were tested on both
+VAX 11/780 and VAX 11/750 CPUs. The Emulex and Systems Industries
+disk controllers were interfaced to Fujitsu 2351A
+``Eagle''
+404 Megabyte disk drives.
+The DEC UDA50 disk controller was interfaced to two DEC RA81
+456 Megabyte Winchester disk drives.
+All three controllers were tested on the VAX 780 although
+only the Emulex and DEC controllers were benchmarked on the VAX 11/750.
+Systems Industries makes a VAX 11/750 CMI interface for
+their controller, but we did not have time to test this device.
+In addition, not all the storage systems were tested for
+two drive throughput.
+Each of the controllers and disk drives used in the benchmarks
+is described briefly below.
+.NH 2
+DEC UDA50 disk controller
+.PP
+This is a new controller design which is part of a larger, long range
+storage architecture referred to as
+``DSA''
+or \fBD\fRigital \fBS\fRtorage \fBA\fRrchetecture.
+An important aspect of DSA is migrating a large part
+of the storage management previously handled in the operating
+system to the storage system. Thus, the UDA50 is a much more
+intelligent controller than previous interfaces like the RH750 or
+RH780.
+The UDA50 handles all error correction.
+It also deals with most of the physical storage parameters.
+Typically, system software requests a logical block or
+sequence of blocks.
+The physical locations of these blocks,
+their head, track, and cylinder indices,
+are determined by the controller.
+The UDA50 also orders disk requests to maximize throughput
+where possible, minimizing total seek and rotational delays.
+Where multiple drives are attached to a single controller,
+the UDA50 can interleave
+simultaneous
+data transfers from multiple drives.
+.PP
+The UDA50 is a UNIBUS implementation of a DSA controller.
+It contains 52 sectors of internal buffering to minimize
+the effects of a slow UNIBUS such as the one on the VAX-11/780.
+This buffering also minimizes the effects of contention with
+other UNIBUS peripherals.
+.NH 2
+Emulex SC750/SC780 disk controllers
+.PP
+These two models of the same controller interface to the CMI bus
+of a VAX 11/750 and the SBI bus of a 11/VAX 780, respectively.
+To the operating system, they emulate either an RH750 or
+and RH780.
+The controllers install in the
+MASSBUS
+locations in the CPU cabinets and operate from the
+VAX power suplies.
+They provide an
+``SMD''
+or \fBS\fRtorage \fBM\fRodule \fBD\fRrive
+interface to the disk drives.
+Although a large number of disk drives use this interface, we tested
+the controller exclusively connected to Fujitsu 2351A disks.
+.PP
+The controller ws first implemented for the VAX-11/750 as the SC750
+model several years ago. Although the SC780 was introduced more
+recently, both are stable products with no bugs known to us.
+.NH 2
+System Industries 9900 disk controller
+.PP
+This controller is an evolution of the S.I. 9400 first introduced
+as a UNIBUS SMD interface.
+The 9900 has been enhanced to include an interface to the VAX 11/780 native
+bus, the SBI.
+It has also been upgraded to operate with higher data rate drives such
+as the Fujitsu 2351As we used in this test.
+The controller is contained in its own rack-mounted drawer with an integral
+power supply.
+The interface to the SMD is a four module set which mounts in a
+CPU cabinet slot normally occupied by an RH780.
+The SBI interface derives power from the VAX CPU cabinet power
+supplies.
+.NH 2
+DEC RA81 disk drives
+.PP
+The RA81 is a rack-mountable 456 Megabyte (formatted) Winchester
+disk drive manufactured by DEC.
+It includes a great deal of technology which is an integral part
+of the DEC \fBDSA\fR scheme.
+The novel technology includes a serial packet based communications
+protocol with the controller over a pair of mini-coaxial cables.
+The physical characteristics of the RA81 are shown in the
+table below:
+.DS
+.TS
+box,center;
+c s
+l l.
+DEC RA81 Disk Drive Characteristics
+_
+Peak Transfer Rate 2.2 Mbytes/sec.
+Rotational Speed 3,600 RPM
+Data Sectors/Track 51
+Logical Cylinders 1,248
+Logical Data Heads 14
+Data Capacity 456 Mbytes
+Minimum Seek Time 6 milliseconds
+Average Seek Time 28 milliseconds
+Maximum Seek Time 52 milliseconds
+.TE
+.DE
+.NH 2
+Fujitsu 2351A disk drives
+.PP
+The Fujitsu 2351A disk drive is a Winchester disk drive
+with an SMD controller interface.
+Fujitsu has developed a very good reputation for
+reliable storage products over the last several years.
+The 2351A has the following physical characteristics:
+.DS
+.TS
+box,center;
+c s
+l l.
+Fujitsu 2351A Disk Drive Characteristics
+_
+Peak Transfer Rate 1.859 Mbytes/sec.
+Rotational Speed 3,961 RPM
+Data Sectors/Track 48
+Cylinders 842
+Data Heads 20
+Data Capacity 404 Mbytes
+Minimum Seek Time 5 milliseconds
+Average Seek Time 18 milliseconds
+Maximum Seek Time 35 milliseconds
+.TE
+.DE
+.ds RH Methodology
+.bp
diff --git a/share/doc/papers/diskperf/methodology.ms b/share/doc/papers/diskperf/methodology.ms
new file mode 100644
index 000000000000..703d7b6f0545
--- /dev/null
+++ b/share/doc/papers/diskperf/methodology.ms
@@ -0,0 +1,111 @@
+.\" Copyright (c) 1983 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)methodology.ms 6.2 (Berkeley) 4/16/91
+.\"
+.ds RH Methodology
+.NH
+Methodology
+.PP
+Our goal was to evaluate the performance of the target peripherals
+in an environment as much like our 4.2BSD UNIX systems as possible.
+There are two basic approaches to creating this kind of test environment.
+These might be termed the \fIindirect\fR and the \fIdirect\fR approach.
+The approach used by DEC in producing most of the performance data
+on the UDA50/RA81 system under VMS is what we term the indirect
+approach.
+We chose to use the direct approach.
+.PP
+The indirect approach used by DEC involves two steps.
+First, the environment in which performance is to be evaluated
+is parameterized.
+In this case, the disk I/O characteristics of VMS were measured
+as to the distribution of various sizes of accesses and the proportion
+of reads and writes.
+This parameterization of
+typical
+I/O activity was termed a
+``vax mix.''
+The second stage involves simulating this mixture of I/O activities
+with the devices to be tested and noting the total volume of transactions
+processed per unit time by each system.
+.PP
+The problems encountered with this indirect approach often
+have to do with the completeness and correctness of the parameterization
+of the context environment.
+For example, the
+``vax mix''
+model constructed for DECs tests uses a random distribution of seeks
+to the blocks read or written.
+It is not likely that any real system produces a distribution
+of disk transfer locations which is truly random and does not
+exhibit strong locality characteristics.
+.PP
+The methodology chosen by us is direct
+in the sense that it uses the standard structured file system mechanism present
+in the 4.2BSD UNIX operating system to create the sequence of locations
+and sizes of reads and writes to the benchmarked equipment.
+We simply create, write, and read
+files as they would be by user's activities.
+The disk space allocation and disk cacheing mechanism built into
+UNIX is used to produce the actual device reads and writes as well
+as to determine their size and location on the disk.
+We measure and compare the rate at which these
+.I
+user files
+.R
+can be written, rewritten, or read.
+.PP
+The advantage of this approach is the implicit accuracy in
+testing in the same environment in which the peripheral
+will be used.
+Although this system does not account for the I/O produced
+by some paging and swapping, in our memory rich environment
+these activities account for a relatively small portion
+of the total disk activity.
+.PP
+A more significant disadvantage to the direct approach
+is the occasional difficulty we have in accounting for our
+measured results.
+The apparently straight-forward activity of reading or writing a logical file
+on disk can produce a complex mixture of disk traffic.
+File I/O is supported by a file management system that
+buffers disk traffic through an internal cache,
+which allows writes to ba handled asynchronously.
+Reads must be done synchronously,
+however this restriction is moderated by the use of read-ahead.
+Small changes in the performance of the disk controller
+subsystem can result in large and unexpected
+changes in the file system performance,
+as it may change the characteristics of the memory contention
+experienced by the processor.
+.ds RH Tests
+.bp
diff --git a/share/doc/papers/diskperf/motivation.ms b/share/doc/papers/diskperf/motivation.ms
new file mode 100644
index 000000000000..d5fde9d1b933
--- /dev/null
+++ b/share/doc/papers/diskperf/motivation.ms
@@ -0,0 +1,95 @@
+.\" Copyright (c) 1983 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)motivation.ms 6.2 (Berkeley) 4/16/91
+.\"
+.\" $FreeBSD$
+.\"
+.ds RH Motivation
+.NH
+Motivation
+.PP
+These benchmarks were performed for several reasons.
+Foremost was our desire to obtain guideline to aid
+in choosing one the most expensive components of any
+VAX UNIX configuration, the disk storage system.
+The range of choices in this area has increased dramatically
+in the last year.
+DEC has become, with the introduction of the UDA50/RA81 system,
+cost competitive
+in the area of disk storage for the first time.
+Emulex's entry into the VAX 11/780 SBI controller
+field, the SC780, represented an important choice for us to examine, given
+our previous success with their VAX 11/750 SC750 controller and
+their UNIBUS controllers.
+The Fujitsu 2351A
+Winchester disk drive represents the lowest cost-per-byte disk storage
+known to us.
+In addition, Fujitsu's reputation for reliability was appealing.
+The many attractive aspects of these components justified a more
+careful examination of their performance aspects under UNIX.
+.PP
+In addition to the direct motivation of developing an effective
+choice of storage systems, we hoped to gain more insight into
+VAX UNIX file system and I/O performance in general.
+What generic characteristics of I/O subsystems are most
+important?
+How important is the location of the controller on the SBI/CMI versus
+the UNIBUS?
+Is extensive buffering in the controller essential or even important?
+How much can be gained by putting more of the storage system
+management and optimization function in the controller as
+DEC does with the UDA50?
+.PP
+We also wanted to resolve particular speculation about the value of
+storage system optimization by a controller in a UNIX
+environment.
+Is the access optimization as effective as that already provided
+by the existing 4.2BSD UNIX device handlers for traditional disks?
+VMS disk handlers do no seek optimization.
+This gives the UDA50 controller an advantage over other controllers
+under VMS which is not likely to be as important to UNIX.
+Are there penalties associated with greater intelligence in the controller?
+.PP
+A third and last reason for evaluating this equipment is comparable
+to the proverbial mountain climbers answer when asked why he climbs
+a particular mountain,
+``It was there.''
+In our case the equipment
+was there.
+We were lucky enough to assemble all the desired disks and controllers
+and get them installed on a temporarily idle VAX 11/780.
+This got us started collecting data.
+Although many of the tests were later rerun on a variety of other systems,
+this initial test bed was essential for working out the testing bugs
+and getting our feet wet.
+.ds RH Equipment
+.bp
diff --git a/share/doc/papers/diskperf/results.ms b/share/doc/papers/diskperf/results.ms
new file mode 100644
index 000000000000..09f61a81824f
--- /dev/null
+++ b/share/doc/papers/diskperf/results.ms
@@ -0,0 +1,337 @@
+.\" Copyright (c) 1983 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)results.ms 6.2 (Berkeley) 4/16/91
+.\"
+.ds RH Results
+.NH
+Results
+.PP
+The following tables indicate the results of our
+test runs.
+Note that each table contains results for tests run
+on two varieties of 4.2BSD file systems.
+The first set of results is always for a file system
+with a basic blocking factor of eight Kilobytes and a
+fragment size of 1 Kilobyte. The second sets of measurements
+are for file systems with a four Kilobyte block size and a
+one Kilobyte fragment size.
+The values in parenthesis indicate the percentage of CPU
+time used by the test program.
+In the case of the two disk arm tests,
+the value in parenthesis indicates the sum of the percentage
+of the test programs that were run.
+Entries of ``n. m.'' indicate this value was not measured.
+.DS
+.TS
+box,center;
+c s s s s
+c s s s s
+c s s s s
+l | l s | l s
+l | l s | l s
+l | l l | l l
+l | c c | c c.
+4.2BSD File Systems Tests - \fBVAX 11/750\fR
+=
+Logically Sequential Transfers
+from an \fB8K/1K\fR 4.2BSD File System (Kbytes/sec.)
+_
+Test Emulex SC750/Eagle UDA50/RA81
+
+ 1 Drive 2 Drives 1 Drive 2 Drives
+_
+read_8192 490 (69%) 620 (96%) 310 (44%) 520 (65%)
+write_4096 380 (99%) 370 (99%) 370 (97%) 360 (98%)
+write_8192 470 (99%) 470 (99%) 320 (71%) 410 (83%)
+rewrite_8192 650 (99%) 620 (99%) 310 (50%) 450 (70%)
+=
+.T&
+c s s s s
+c s s s s
+l | l s | l s
+l | l s | l s
+l | l l | l l
+l | c c | c c.
+Logically Sequential Transfers
+from \fB4K/1K\fR 4.2BSD File System (Kbytes/sec.)
+_
+Test Emulex SC750/Eagle UDA50/RA81
+
+ 1 Drive 2 Drives 1 Drive 2 Drives
+_
+read_8192 300 (60%) 400 (84%) 210 (42%) 340 (77%)
+write_4096 320 (98%) 320 (98%) 220 (67%) 290 (99%)
+write_8192 340 (98%) 340 (99%) 220 (65%) 310 (98%)
+rewrite_8192 450 (99%) 450 (98%) 230 (47%) 340 (78%)
+.TE
+.DE
+.PP
+Note that the rate of write operations on the VAX 11/750 are ultimately
+CPU limited in some cases.
+The write rates saturate the CPU at a lower bandwidth than the reads
+because they must do disk allocation in addition to moving the data
+from the user program to the disk.
+The UDA50/RA81 saturates the CPU at a lower transfer rate for a given
+operation than the SC750/Eagle because
+it causes more memory contention with the CPU.
+We do not know if this contention is caused by
+the UNIBUS controller or the UDA50.
+.PP
+The following table reports the results of test runs on a VAX 11/780
+with 4 Megabytes of main memory.
+.DS
+.TS
+box,center;
+c s s s s s s
+c s s s s s s
+c s s s s s s
+l | l s | l s | l s
+l | l s | l s | l s
+l | l l | l l | l l
+l | c c | c c | c c.
+4.2BSD File Systems Tests - \fBVAX 11/780\fR
+=
+Logically Sequential Transfers
+from an \fB8K/1K\fR 4.2BSD File System (Kbytes/sec.)
+_
+Test Emulex SC780/Eagle UDA50/RA81 Sys. Ind. 9900/Eagle
+
+ 1 Drive 2 Drives 1 Drive 2 Drives 1 Drive 2 Drives
+_
+read_8192 560 (70%) 480 (58%) 360 (45%) 540 (72%) 340 (41%) 520 (66%)
+write_4096 440 (98%) 440 (98%) 380 (99%) 480 (96%) 490 (96%) 440 (84%)
+write_8192 490 (98%) 490 (98%) 220 (58%)* 480 (92%) 490 (80%) 430 (72%)
+rewrite_8192 760 (100%) 560 (72%) 220 (50%)* 180 (52%)* 490 (60%) 520 (62%)
+=
+.T&
+c s s s s s s
+c s s s s s s
+l | l s | l s | l s
+l | l s | l s | l s
+l | l l | l l | l l
+l | c c | c c | c c.
+Logically Sequential Transfers
+from an \fB4K/1K\fR 4.2BSD File System (Kbytes/sec.)
+_
+Test Emulex SC780/Eagle UDA50/RA81 Sys. Ind. 9900/Eagle
+
+ 1 Drive 2 Drives 1 Drive 2 Drives 1 Drive 2 Drives
+_
+read_8192 490 (77%) 370 (66%) n.m. n.m. 200 (31%) 370 (56%)
+write_4096 380 (98%) 370 (98%) n.m. n.m. 200 (46%) 370 (88%)
+write_8192 380 (99%) 370 (97%) n.m. n.m. 200 (45%) 320 (76%)
+rewrite_8192 490 (87%) 350 (66%) n.m. n.m. 200 (31%) 300 (46%)
+.TE
+* the operation of the hardware was suspect during these tests.
+.DE
+.PP
+The dropoff in reading and writing rates for the two drive SC780/Eagle
+tests are probably due to the file system using insufficient
+rotational delay for these tests.
+We have not fully investigated these times.
+.PP
+The following table compares data rates on VAX 11/750s directly
+with those of VAX 11/780s using the UDA50/RA81 storage system.
+.DS
+.TS
+box,center;
+c s s s s
+c s s s s
+c s s s s
+l | l s | l s
+l | l s | l s
+l | l l | l l
+l | c c | c c.
+4.2BSD File Systems Tests - \fBDEC UDA50 - 750 vs. 780\fR
+=
+Logically Sequential Transfers
+from an \fB8K/1K\fR 4.2BSD File System (Kbytes/sec.)
+_
+Test VAX 11/750 UNIBUS VAX 11/780 UNIBUS
+
+ 1 Drive 2 Drives 1 Drive 2 Drives
+_
+read_8192 310 (44%) 520 (84%) 360 (45%) 540 (72%)
+write_4096 370 (97%) 360 (100%) 380 (99%) 480 (96%)
+write_8192 320 (71%) 410 (96%) 220 (58%)* 480 (92%)
+rewrite_8192 310 (50%) 450 (80%) 220 (50%)* 180 (52%)*
+=
+.T&
+c s s s s
+c s s s s
+l | l s | l s
+l | l s | l s
+l | l l | l l
+l | c c | c c.
+Logically Sequential Transfers
+from an \fB4K/1K\fR 4.2BSD File System (Kbytes/sec.)
+_
+Test VAX 11/750 UNIBUS VAX 11/780 UNIBUS
+
+ 1 Drive 2 Drives 1 Drive 2 Drives
+_
+read_8192 210 (42%) 342 (77%) n.m. n.m.
+write_4096 215 (67%) 294 (99%) n.m. n.m.
+write_8192 215 (65%) 305 (98%) n.m. n.m.
+rewrite_8192 227 (47%) 336 (78%) n.m. n.m.
+.TE
+* the operation of the hardware was suspect during these tests.
+.DE
+.PP
+The higher throughput available on VAX 11/780s is due to a number
+of factors.
+The larger main memory size allows a larger file system cache.
+The block allocation routines run faster, raising the upper limit
+on the data rates in writing new files.
+.PP
+The next table makes the same comparison using an Emulex controller
+on both systems.
+.DS
+.TS
+box, center;
+c s s s s
+c s s s s
+c s s s s
+l | l s | l s
+l | l s | l s
+l | l l | l l
+l | c c | c c.
+4.2BSD File Systems Tests - \fBEmulex - 750 vs. 780\fR
+=
+Logically Sequential Transfers
+from an \fB8K/1K\fR 4.2BSD File System (Kbytes/sec.)
+_
+Test VAX 11/750 CMI Bus VAX 11/780 SBI Bus
+
+ 1 Drive 2 Drives 1 Drive 2 Drives
+_
+read_8192 490 (69%) 620 (96%) 560 (70%) 480 (58%)
+write_4096 380 (99%) 370 (99%) 440 (98%) 440 (98%)
+write_8192 470 (99%) 470 (99%) 490 (98%) 490 (98%)
+rewrite_8192 650 (99%) 620 (99%) 760 (100%) 560 (72%)
+=
+.T&
+c s s s s
+c s s s s
+l | l s | l s
+l | l s | l s
+l | l l | l l
+l | c c | c c.
+Logically Sequential Transfers
+from an \fB4K/1K\fR 4.2BSD File System (Kbytes/sec.)
+_
+Test VAX 11/750 CMI Bus VAX 11/780 SBI Bus
+
+ 1 Drive 2 Drives 1 Drive 2 Drives
+_
+read_8192 300 (60%) 400 (84%) 490 (77%) 370 (66%)
+write_4096 320 (98%) 320 (98%) 380 (98%) 370 (98%)
+write_8192 340 (98%) 340 (99%) 380 (99%) 370 (97%)
+rewrite_8192 450 (99%) 450 (98%) 490 (87%) 350 (66%)
+.TE
+.DE
+.PP
+The following table illustrates the evolution of our testing
+process as both hardware and software problems effecting
+the performance of the Emulex SC780 were corrected.
+The software change was suggested to us by George Goble
+of Purdue University.
+.PP
+The 4.2BSD handler for RH750/RH780 interfaced disk drives
+contains several constants which to determine how
+much time is provided between an interrupt signaling the completion
+of a positioning command and the subsequent start of a data transfer
+operation. These lead times are expressed as sectors of rotational delay.
+If they are too small, an extra complete rotation will often be required
+between a seek and subsequent read or write operation.
+The higher bit rate and rotational speed of the 2351A Fujitsu
+disk drives required
+increasing these constants.
+.PP
+The hardware change involved allowing for slightly longer
+delays in arbitrating for cycles on the SBI bus by
+starting the bus arbitration cycle a little further ahead of
+when the data was ready for transfer.
+Finally we had to increase the rotational delay between consecutive
+blocks in the file because
+the higher bandwidth from the disk generated more memory contention,
+which slowed down the processor.
+.DS
+.TS
+box,center,expand;
+c s s s s s s
+c s s s s s s
+c s s s s s s
+l | l s | l s | l s
+l | l s | l s | l s
+l | l s | l s | l s
+l | c c | c c | c c
+l | c c | c c | c c.
+4.2BSD File Systems Tests - \fBEmulex SC780 Disk Controller Evolution\fR
+=
+Logically Sequential Transfers
+from an \fB8K/1K\fR 4.2BSD File System (Kbytes/sec.)
+_
+Test Inadequate Search Lead OK Search Lead OK Search Lead
+ Initial SBI Arbitration Init SBI Arb. Improved SBI Arb.
+
+ 1 Drive 2 Drives 1 Drive 2 Drives 1 Drive 2 Drives
+_
+read_8192 320 370 440 (60%) n.m. 560 (70%) 480 (58%)
+write_4096 250 270 300 (63%) n.m. 440 (98%) 440 (98%)
+write_8192 250 280 340 (60%) n.m. 490 (98%) 490 (98%)
+rewrite_8192 250 290 380 (48%) n.m. 760 (100%) 560 (72%)
+=
+.T&
+c s s s s s s
+c s s s s s s
+l | l s | l s | l s
+l | l s | l s | l s
+l | l s | l s | l s
+l | c c | c c | c c
+l | c c | c c | c c.
+Logically Sequential Transfers
+from an \fB4K/1K\fR 4.2BSD File System (Kbytes/sec.)
+_
+Test Inadequate Search Lead OK Search Lead OK Search Lead
+ Initial SBI Arbitration Init SBI Arb. Improved SBI Arb.
+
+ 1 Drive 2 Drives 1 Drive 2 Drives 1 Drive 2 Drives
+_
+read_8192 200 220 280 n.m. 490 (77%) 370 (66%)
+write_4096 180 190 300 n.m. 380 (98%) 370 (98%)
+write_8192 180 200 320 n.m. 380 (99%) 370 (97%)
+rewrite_8192 190 200 340 n.m. 490 (87%) 350 (66%)
+.TE
+.DE
+.ds RH Conclusions
+.bp
diff --git a/share/doc/papers/diskperf/tests.ms b/share/doc/papers/diskperf/tests.ms
new file mode 100644
index 000000000000..e9379311301c
--- /dev/null
+++ b/share/doc/papers/diskperf/tests.ms
@@ -0,0 +1,109 @@
+.\" Copyright (c) 1983 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)tests.ms 6.2 (Berkeley) 4/16/91
+.\" $FreeBSD$
+.\"
+.ds RH Tests
+.NH
+Tests
+.PP
+Our battery of tests consists of four programs,
+read_8192, write_8192, write_4096
+and rewrite_8192 originally written by [McKusick83]
+to evaluate the performance of the new file system in 4.2BSD.
+These programs all follow the same model and are typified by
+read_8192 shown here.
+.DS
+#define BUFSIZ 8192
+main( argc, argv)
+char **argv;
+{
+ char buf[BUFSIZ];
+ int i, j;
+
+ j = open(argv[1], 0);
+ for (i = 0; i < 1024; i++)
+ read(j, buf, BUFSIZ);
+}
+.DE
+The remaining programs are included in appendix A.
+.PP
+These programs read, write with two different blocking factors,
+and rewrite logical files in structured file system on the disk
+under test.
+The write programs create new files while the rewrite program
+overwrites an existing file.
+Each of these programs represents an important segment of the
+typical UNIX file system activity with the read program
+representing by far the largest class and the rewrite the smallest.
+.PP
+A blocking factor of 8192 is used by all programs except write_4096.
+This is typical of most 4.2BSD user programs since a standard set of
+I/O support routines is commonly used and these routines buffer
+data in similar block sizes.
+.PP
+For each test run, an empty eight Kilobyte block
+file system was created in the target
+storage system.
+Then each of the four tests was run and timed.
+Each test was run three times;
+the first to clear out any useful data in the cache,
+and the second two to insure that the experiment
+had stablized and was repeatable.
+Each test operated on eight Megabytes of data to
+insure that the cache did not overly influence the results.
+Another file system was then initialized using a
+basic blocking factor of four Kilobytes and the same tests
+were run again and timed.
+A command script for a run appears as follows:
+.DS
+#!/bin/csh
+set time=2
+echo "8K/1K file system"
+newfs /dev/rhp0g eagle
+mount /dev/hp0g /mnt0
+mkdir /mnt0/foo
+echo "write_8192 /mnt0/foo/tst2"
+rm -f /mnt0/foo/tst2
+write_8192 /mnt0/foo/tst2
+rm -f /mnt0/foo/tst2
+write_8192 /mnt0/foo/tst2
+rm -f /mnt0/foo/tst2
+write_8192 /mnt0/foo/tst2
+echo "read_8192 /mnt0/foo/tst2"
+read_8192 /mnt0/foo/tst2
+read_8192 /mnt0/foo/tst2
+read_8192 /mnt0/foo/tst2
+umount /dev/hp0g
+.DE
+.ds RH Results
+.bp
diff --git a/share/doc/papers/fsinterface/Makefile b/share/doc/papers/fsinterface/Makefile
new file mode 100644
index 000000000000..f11021b0d27c
--- /dev/null
+++ b/share/doc/papers/fsinterface/Makefile
@@ -0,0 +1,9 @@
+# From: @(#)Makefile 5.3 (Berkeley) 6/8/93
+# $FreeBSD$
+
+VOLUME= papers
+DOC= fsinterface
+SRCS= fsinterface.ms
+MACROS= -ms
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/fsinterface/abstract.ms b/share/doc/papers/fsinterface/abstract.ms
new file mode 100644
index 000000000000..ab8b473170e1
--- /dev/null
+++ b/share/doc/papers/fsinterface/abstract.ms
@@ -0,0 +1,73 @@
+.\" Copyright (c) 1986 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)abstract.ms 5.2 (Berkeley) 4/16/91
+.\"
+.TL
+Toward a Compatible Filesystem Interface
+.AU
+Michael J. Karels
+Marshall Kirk McKusick
+.AI
+Computer Systems Research Group
+Computer Science Division
+Department of Electrical Engineering and Computer Science
+University of California, Berkeley
+Berkeley, California 94720
+.LP
+As network or remote filesystems have been implemented for
+.UX ,
+several stylized interfaces between the filesystem implementation
+and the rest of the kernel have been developed.
+Notable among these are Sun Microsystems' virtual filesystem interface
+using vnodes, Digital Equipment's Generic File System architecture,
+and AT&T's File System Switch.
+Each design attempts to isolate filesystem-dependent details
+below the generic interface and to provide a framework within which
+new filesystems may be incorporated.
+However, each of these interfaces is different from
+and incompatible with the others.
+Each of them addresses somewhat different design goals.
+Each was based upon a different starting version of
+.UX ,
+targetted a different set of filesystems with varying characteristics,
+and uses a different set of primitive operations provided by the filesystem.
+The current study compares the various filesystem interfaces.
+Criteria for comparison include generality, completeness, robustness,
+efficiency and esthetics.
+As a result of this comparison, a proposal for a new filesystem interface
+is advanced that includes the best features of the existing implementations.
+The proposal adopts the calling convention for name lookup introduced
+in 4.3BSD.
+A prototype implementation is described.
+This proposal and the rationale underlying its development
+have been presented to major software vendors
+as an early step toward convergence upon a compatible filesystem interface.
diff --git a/share/doc/papers/fsinterface/fsinterface.ms b/share/doc/papers/fsinterface/fsinterface.ms
new file mode 100644
index 000000000000..453cc7e9d594
--- /dev/null
+++ b/share/doc/papers/fsinterface/fsinterface.ms
@@ -0,0 +1,1176 @@
+.\" Copyright (c) 1986 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)fsinterface.ms 1.4 (Berkeley) 4/16/91
+.\" $FreeBSD$
+.\"
+.nr UX 0
+.de UX
+.ie \\n(UX \s-1UNIX\s0\\$1
+.el \{\
+\s-1UNIX\s0\\$1\(dg
+.FS
+\(dg \s-1UNIX\s0 is a registered trademark of AT&T.
+.FE
+.nr UX 1
+.\}
+..
+.TL
+Toward a Compatible Filesystem Interface
+.AU
+Michael J. Karels
+Marshall Kirk McKusick
+.AI
+Computer Systems Research Group
+Computer Science Division
+Department of Electrical Engineering and Computer Science
+University of California, Berkeley
+Berkeley, California 94720
+.AB
+.LP
+As network or remote filesystems have been implemented for
+.UX ,
+several stylized interfaces between the filesystem implementation
+and the rest of the kernel have been developed.
+.FS
+This is an update of a paper originally presented
+at the September 1986 conference of the European
+.UX
+Users' Group.
+Last modified April 16, 1991.
+.FE
+Notable among these are Sun Microsystems' Virtual Filesystem interface (VFS)
+using vnodes, Digital Equipment's Generic File System (GFS) architecture,
+and AT&T's File System Switch (FSS).
+Each design attempts to isolate filesystem-dependent details
+below a generic interface and to provide a framework within which
+new filesystems may be incorporated.
+However, each of these interfaces is different from
+and incompatible with the others.
+Each of them addresses somewhat different design goals.
+Each was based on a different starting version of
+.UX ,
+targetted a different set of filesystems with varying characteristics,
+and uses a different set of primitive operations provided by the filesystem.
+The current study compares the various filesystem interfaces.
+Criteria for comparison include generality, completeness, robustness,
+efficiency and esthetics.
+Several of the underlying design issues are examined in detail.
+As a result of this comparison, a proposal for a new filesystem interface
+is advanced that includes the best features of the existing implementations.
+The proposal adopts the calling convention for name lookup introduced
+in 4.3BSD, but is otherwise closely related to Sun's VFS.
+A prototype implementation is now being developed at Berkeley.
+This proposal and the rationale underlying its development
+have been presented to major software vendors
+as an early step toward convergence on a compatible filesystem interface.
+.AE
+.NH
+Introduction
+.PP
+As network communications and workstation environments
+became common elements in
+.UX
+systems, several vendors of
+.UX
+systems have designed and built network file systems
+that allow client process on one
+.UX
+machine to access files on a server machine.
+Examples include Sun's Network File System, NFS [Sandberg85],
+AT&T's recently-announced Remote File Sharing, RFS [Rifkin86],
+the LOCUS distributed filesystem [Walker85],
+and Masscomp's extended filesystem [Cole85].
+Other remote filesystems have been implemented in research or university groups
+for internal use, notably the network filesystem in the Eighth Edition
+.UX
+system [Weinberger84] and two different filesystems used at Carnegie-Mellon
+University [Satyanarayanan85].
+Numerous other remote file access methods have been devised for use
+within individual
+.UX
+processes,
+many of them by modifications to the C I/O library
+similar to those in the Newcastle Connection [Brownbridge82].
+.PP
+Multiple network filesystems may frequently
+be found in use within a single organization.
+These circumstances make it highly desirable to be able to transport filesystem
+implementations from one system to another.
+Such portability is considerably enhanced by the use of a stylized interface
+with carefully-defined entry points to separate the filesystem from the rest
+of the operating system.
+This interface should be similar to the interface between device drivers
+and the kernel.
+Although varying somewhat among the common versions of
+.UX ,
+the device driver interfaces are sufficiently similar that device drivers
+may be moved from one system to another without major problems.
+A clean, well-defined interface to the filesystem also allows a single
+system to support multiple local filesystem types.
+.PP
+For reasons such as these, several filesystem interfaces have been used
+when integrating new filesystems into the system.
+The best-known of these are Sun Microsystems' Virtual File System interface,
+VFS [Kleiman86], and AT&T's File System Switch, FSS.
+Another interface, known as the Generic File System, GFS,
+has been implemented for the ULTRIX\(dd
+.FS
+\(dd ULTRIX is a trademark of Digital Equipment Corp.
+.FE
+system by Digital [Rodriguez86].
+There are numerous differences among these designs.
+The differences may be understood from the varying philosophies
+and design goals of the groups involved, from the systems under which
+the implementations were done, and from the filesystems originally targetted
+by the designs.
+These differences are summarized in the following sections
+within the limitations of the published specifications.
+.NH
+Design goals
+.PP
+There are several design goals which, in varying degrees,
+have driven the various designs.
+Each attempts to divide the filesystem into a filesystem-type-independent
+layer and individual filesystem implementations.
+The division between these layers occurs at somewhat different places
+in these systems, reflecting different views of the diversity and types
+of the filesystems that may be accommodated.
+Compatibility with existing local filesystems has varying importance;
+at the user-process level, each attempts to be completely transparent
+except for a few filesystem-related system management programs.
+The AT&T interface also makes a major effort to retain familiar internal
+system interfaces, and even to retain object-file-level binary compatibility
+with operating system modules such as device drivers.
+Both Sun and DEC were willing to change internal data structures and interfaces
+so that other operating system modules might require recompilation
+or source-code modification.
+.PP
+AT&T's interface both allows and requires filesystems to support the full
+and exact semantics of their previous filesystem,
+including interruptions of system calls on slow operations.
+System calls that deal with remote files are encapsulated
+with their environment and sent to a server where execution continues.
+The system call may be aborted by either client or server, returning
+control to the client.
+Most system calls that descend into the file-system dependent layer
+of a filesystem other than the standard local filesystem do not return
+to the higher-level kernel calling routines.
+Instead, the filesystem-dependent code completes the requested
+operation and then executes a non-local goto (\fIlongjmp\fP) to exit the
+system call.
+These efforts to avoid modification of main-line kernel code
+indicate a far greater emphasis on internal compatibility than on modularity,
+clean design, or efficiency.
+.PP
+In contrast, the Sun VFS interface makes major modifications to the internal
+interfaces in the kernel, with a very clear separation
+of filesystem-independent and -dependent data structures and operations.
+The semantics of the filesystem are largely retained for local operations,
+although this is achieved at some expense where it does not fit the internal
+structuring well.
+The filesystem implementations are not required to support the same
+semantics as local
+.UX
+filesystems.
+Several historical features of
+.UX
+filesystem behavior are difficult to achieve using the VFS interface,
+including the atomicity of file and link creation and the use of open files
+whose names have been removed.
+.PP
+A major design objective of Sun's network filesystem,
+statelessness,
+permeates the VFS interface.
+No locking may be done in the filesystem-independent layer,
+and locking in the filesystem-dependent layer may occur only during
+a single call into that layer.
+.PP
+A final design goal of most implementors is performance.
+For remote filesystems,
+this goal tends to be in conflict with the goals of complete semantic
+consistency, compatibility and modularity.
+Sun has chosen performance over modularity in some areas,
+but has emphasized clean separation of the layers within the filesystem
+at the expense of performance.
+Although the performance of RFS is yet to be seen,
+AT&T seems to have considered compatibility far more important than modularity
+or performance.
+.NH
+Differences among filesystem interfaces
+.PP
+The existing filesystem interfaces may be characterized
+in several ways.
+Each system is centered around a few data structures or objects,
+along with a set of primitives for performing operations upon these objects.
+In the original
+.UX
+filesystem [Ritchie74],
+the basic object used by the filesystem is the inode, or index node.
+The inode contains all of the information about a file except its name:
+its type, identification, ownership, permissions, timestamps and location.
+Inodes are identified by the filesystem device number and the index within
+the filesystem.
+The major entry points to the filesystem are \fInamei\fP,
+which translates a filesystem pathname into the underlying inode,
+and \fIiget\fP, which locates an inode by number and installs it in the in-core
+inode table.
+\fINamei\fP performs name translation by iterative lookup
+of each component name in its directory to find its inumber,
+then using \fIiget\fP to return the actual inode.
+If the last component has been reached, this inode is returned;
+otherwise, the inode describes the next directory to be searched.
+The inode returned may be used in various ways by the caller;
+it may be examined, the file may be read or written,
+types and access may be checked, and fields may be modified.
+Modified inodes are automatically written back to the filesystem
+on disk when the last reference is released with \fIiput\fP.
+Although the details are considerably different,
+the same general scheme is used in the faster filesystem in 4.2BSD
+.UX
+[Mckusick85].
+.PP
+Both the AT&T interface and, to a lesser extent, the DEC interface
+attempt to preserve the inode-oriented interface.
+Each modify the inode to allow different varieties of the structure
+for different filesystem types by separating the filesystem-dependent
+parts of the inode into a separate structure or one arm of a union.
+Both interfaces allow operations
+equivalent to the \fInamei\fP and \fIiget\fP operations
+of the old filesystem to be performed in the filesystem-independent
+layer, with entry points to the individual filesystem implementations to support
+the type-specific parts of these operations. Implicit in this interface
+is that files may be conveniently be named by and located using a single
+index within a filesystem.
+The GFS provides specific entry points to the filesystems
+to change most file properties rather than allowing arbitrary changes
+to be made to the generic part of the inode.
+.PP
+In contrast, the Sun VFS interface replaces the inode as the primary object
+with the vnode.
+The vnode contains no filesystem-dependent fields except the pointer
+to the set of operations implemented by the filesystem.
+Properties of a vnode that might be transient, such as the ownership,
+permissions, size and timestamps, are maintained by the lower layer.
+These properties may be presented in a generic format upon request;
+callers are expected not to hold this information for any length of time,
+as they may not be up-to-date later on.
+The vnode operations do not include a corollary for \fIiget\fP;
+the only external interface for obtaining vnodes for specific files
+is the name lookup operation.
+(Separate procedures are provided outside of this interface
+that obtain a ``file handle'' for a vnode which may be given
+to a client by a server, such that the vnode may be retrieved
+upon later presentation of the file handle.)
+.NH
+Name translation issues
+.PP
+Each of the systems described include a mechanism for performing
+pathname-to-internal-representation translation.
+The style of the name translation function is very different in all
+three systems.
+As described above, the AT&T and DEC systems retain the \fInamei\fP function.
+The two are quite different, however, as the ULTRIX interface uses
+the \fInamei\fP calling convention introduced in 4.3BSD.
+The parameters and context for the name lookup operation
+are collected in a \fInameidata\fP structure which is passed to \fInamei\fP
+for operation.
+Intent to create or delete the named file is declared in advance,
+so that the final directory scan in \fInamei\fP may retain information
+such as the offset in the directory at which the modification will be made.
+Filesystems that use such mechanisms to avoid redundant work
+must therefore lock the directory to be modified so that it may not
+be modified by another process before completion.
+In the System V filesystem, as in previous versions of
+.UX ,
+this information is stored in the per-process \fIuser\fP structure
+by \fInamei\fP for use by a low-level routine called after performing
+the actual creation or deletion of the file itself.
+In 4.3BSD and in the GFS interface, these side effects of \fInamei\fP
+are stored in the \fInameidata\fP structure given as argument to \fInamei\fP,
+which is also presented to the routine implementing file creation or deletion.
+.PP
+The ULTRIX \fInamei\fP routine is responsible for the generic
+parts of the name translation process, such as copying the name into
+an internal buffer, validating it, interpolating
+the contents of symbolic links, and indirecting at mount points.
+As in 4.3BSD, the name is copied into the buffer in a single call,
+according to the location of the name.
+After determining the type of the filesystem at the start of translation
+(the current directory or root directory), it calls the filesystem's
+\fInamei\fP entry with the same structure it received from its caller.
+The filesystem-specific routine translates the name, component by component,
+as long as no mount points are reached.
+It may return after any number of components have been processed.
+\fINamei\fP performs any processing at mount points, then calls
+the correct translation routine for the next filesystem.
+Network filesystems may pass the remaining pathname to a server for translation,
+or they may look up the pathname components one at a time.
+The former strategy would be more efficient,
+but the latter scheme allows mount points within a remote filesystem
+without server knowledge of all client mounts.
+.PP
+The AT&T \fInamei\fP interface is presumably the same as that in previous
+.UX
+systems, accepting the name of a routine to fetch pathname characters
+and an operation (one of: lookup, lookup for creation, or lookup for deletion).
+It translates, component by component, as before.
+If it detects that a mount point crosses to a remote filesystem,
+it passes the remainder of the pathname to the remote server.
+A pathname-oriented request other than open may be completed
+within the \fInamei\fP call,
+avoiding return to the (unmodified) system call handler
+that called \fInamei\fP.
+.PP
+In contrast to the first two systems, Sun's VFS interface has replaced
+\fInamei\fP with \fIlookupname\fP.
+This routine simply calls a new pathname-handling module to allocate
+a pathname buffer and copy in the pathname (copying a character per call),
+then calls \fIlookuppn\fP.
+\fILookuppn\fP performs the iteration over the directories leading
+to the destination file; it copies each pathname component to a local buffer,
+then calls the filesystem \fIlookup\fP entry to locate the vnode
+for that file in the current directory.
+Per-filesystem \fIlookup\fP routines may translate only one component
+per call.
+For creation and deletion of new files, the lookup operation is unmodified;
+the lookup of the final component only serves to check for the existence
+of the file.
+The subsequent creation or deletion call, if any, must repeat the final
+name translation and associated directory scan.
+For new file creation in particular, this is rather inefficient,
+as file creation requires two complete scans of the directory.
+.PP
+Several of the important performance improvements in 4.3BSD
+were related to the name translation process [McKusick85][Leffler84].
+The following changes were made:
+.IP 1. 4
+A system-wide cache of recent translations is maintained.
+The cache is separate from the inode cache, so that multiple names
+for a file may be present in the cache.
+The cache does not hold ``hard'' references to the inodes,
+so that the normal reference pattern is not disturbed.
+.IP 2.
+A per-process cache is kept of the directory and offset
+at which the last successful name lookup was done.
+This allows sequential lookups of all the entries in a directory to be done
+in linear time.
+.IP 3.
+The entire pathname is copied into a kernel buffer in a single operation,
+rather than using two subroutine calls per character.
+.IP 4.
+A pool of pathname buffers are held by \fInamei\fP, avoiding allocation
+overhead.
+.LP
+All of these performance improvements from 4.3BSD are well worth using
+within a more generalized filesystem framework.
+The generalization of the structure may otherwise make an already-expensive
+function even more costly.
+Most of these improvements are present in the GFS system, as it derives
+from the beta-test version of 4.3BSD.
+The Sun system uses a name-translation cache generally like that in 4.3BSD.
+The name cache is a filesystem-independent facility provided for the use
+of the filesystem-specific lookup routines.
+The Sun cache, like that first used at Berkeley but unlike that in 4.3,
+holds a ``hard'' reference to the vnode (increments the reference count).
+The ``soft'' reference scheme in 4.3BSD cannot be used with the current
+NFS implementation, as NFS allocates vnodes dynamically and frees them
+when the reference count returns to zero rather than caching them.
+As a result, fewer names may be held in the cache
+than (local filesystem) vnodes, and the cache distorts the normal reference
+patterns otherwise seen by the LRU cache.
+As the name cache references overflow the local filesystem inode table,
+the name cache must be purged to make room in the inode table.
+Also, to determine whether a vnode is in use (for example,
+before mounting upon it), the cache must be flushed to free any
+cache reference.
+These problems should be corrected
+by the use of the soft cache reference scheme.
+.PP
+A final observation on the efficiency of name translation in the current
+Sun VFS architecture is that the number of subroutine calls used
+by a multi-component name lookup is dramatically larger
+than in the other systems.
+The name lookup scheme in GFS suffers from this problem much less,
+at no expense in violation of layering.
+.PP
+A final problem to be considered is synchronization and consistency.
+As the filesystem operations are more stylized and broken into separate
+entry points for parts of operations, it is more difficult to guarantee
+consistency throughout an operation and/or to synchronize with other
+processes using the same filesystem objects.
+The Sun interface suffers most severely from this,
+as it forbids the filesystems from locking objects across calls
+to the filesystem.
+It is possible that a file may be created between the time that a lookup
+is performed and a subsequent creation is requested.
+Perhaps more strangely, after a lookup fails to find the target
+of a creation attempt, the actual creation might find that the target
+now exists and is a symbolic link.
+The call will either fail unexpectedly, as the target is of the wrong type,
+or the generic creation routine will have to note the error
+and restart the operation from the lookup.
+This problem will always exist in a stateless filesystem,
+but the VFS interface forces all filesystems to share the problem.
+This restriction against locking between calls also
+forces duplication of work during file creation and deletion.
+This is considered unacceptable.
+.NH
+Support facilities and other interactions
+.PP
+Several support facilities are used by the current
+.UX
+filesystem and require generalization for use by other filesystem types.
+For filesystem implementations to be portable,
+it is desirable that these modified support facilities
+should also have a uniform interface and
+behave in a consistent manner in target systems.
+A prominent example is the filesystem buffer cache.
+The buffer cache in a standard (System V or 4.3BSD)
+.UX
+system contains physical disk blocks with no reference to the files containing
+them.
+This works well for the local filesystem, but has obvious problems
+for remote filesystems.
+Sun has modified the buffer cache routines to describe buffers by vnode
+rather than by device.
+For remote files, the vnode used is that of the file, and the block
+numbers are virtual data blocks.
+For local filesystems, a vnode for the block device is used for cache reference,
+and the block numbers are filesystem physical blocks.
+Use of per-file cache description does not easily accommodate
+caching of indirect blocks, inode blocks, superblocks or cylinder group blocks.
+However, the vnode describing the block device for the cache
+is one created internally,
+rather than the vnode for the device looked up when mounting,
+and it is located by searching a private list of vnodes
+rather than by holding it in the mount structure.
+Although the Sun modification makes it possible to use the buffer
+cache for data blocks of remote files, a better generalization
+of the buffer cache is needed.
+.PP
+The RFS filesystem used by AT&T does not currently cache data blocks
+on client systems, thus the buffer cache is probably unmodified.
+The form of the buffer cache in ULTRIX is unknown to us.
+.PP
+Another subsystem that has a large interaction with the filesystem
+is the virtual memory system.
+The virtual memory system must read data from the filesystem
+to satisfy fill-on-demand page faults.
+For efficiency, this read call is arranged to place the data directly
+into the physical pages assigned to the process (a ``raw'' read) to avoid
+copying the data.
+Although the read operation normally bypasses the filesystem buffer cache,
+consistency must be maintained by checking the buffer cache and copying
+or flushing modified data not yet stored on disk.
+The 4.2BSD virtual memory system, like that of Sun and ULTRIX,
+maintains its own cache of reusable text pages.
+This creates additional complications.
+As the virtual memory systems are redesigned, these problems should be
+resolved by reading through the buffer cache, then mapping the cached
+data into the user address space.
+If the buffer cache or the process pages are changed while the other reference
+remains, the data would have to be copied (``copy-on-write'').
+.PP
+In the meantime, the current virtual memory systems must be used
+with the new filesystem framework.
+Both the Sun and AT&T filesystem interfaces
+provide entry points to the filesystem for optimization of the virtual
+memory system by performing logical-to-physical block number translation
+when setting up a fill-on-demand image for a process.
+The VFS provides a vnode operation analogous to the \fIbmap\fP function of the
+.UX
+filesystem.
+Given a vnode and logical block number, it returns a vnode and block number
+which may be read to obtain the data.
+If the filesystem is local, it returns the private vnode for the block device
+and the physical block number.
+As the \fIbmap\fP operations are all performed at one time, during process
+startup, any indirect blocks for the file will remain in the cache
+after they are once read.
+In addition, the interface provides a \fIstrategy\fP entry that may be used
+for ``raw'' reads from a filesystem device,
+used to read data blocks into an address space without copying.
+This entry uses a buffer header (\fIbuf\fP structure)
+to describe the I/O operation
+instead of a \fIuio\fP structure.
+The buffer-style interface is the same as that used by disk drivers internally.
+This difference allows the current \fIuio\fP primitives to be avoided,
+as they copy all data to/from the current user process address space.
+Instead, for local filesystems these operations could be done internally
+with the standard raw disk read routines,
+which use a \fIuio\fP interface.
+When loading from a remote filesystems,
+the data will be received in a network buffer.
+If network buffers are suitably aligned,
+the data may be mapped into the process address space by a page swap
+without copying.
+In either case, it should be possible to use the standard filesystem
+read entry from the virtual memory system.
+.PP
+Other issues that must be considered in devising a portable
+filesystem implementation include kernel memory allocation,
+the implicit use of user-structure global context,
+which may create problems with reentrancy,
+the style of the system call interface,
+and the conventions for synchronization
+(sleep/wakeup, handling of interrupted system calls, semaphores).
+.NH
+The Berkeley Proposal
+.PP
+The Sun VFS interface has been most widely used of the three described here.
+It is also the most general of the three, in that filesystem-specific
+data and operations are best separated from the generic layer.
+Although it has several disadvantages which were described above,
+most of them may be corrected with minor changes to the interface
+(and, in a few areas, philosophical changes).
+The DEC GFS has other advantages, in particular the use of the 4.3BSD
+\fInamei\fP interface and optimizations.
+It allows single or multiple components of a pathname
+to be translated in a single call to the specific filesystem
+and thus accommodates filesystems with either preference.
+The FSS is least well understood, as there is little public information
+about the interface.
+However, the design goals are the least consistent with those of the Berkeley
+research groups.
+Accordingly, a new filesystem interface has been devised to avoid
+some of the problems in the other systems.
+The proposed interface derives directly from Sun's VFS,
+but, like GFS, uses a 4.3BSD-style name lookup interface.
+Additional context information has been moved from the \fIuser\fP structure
+to the \fInameidata\fP structure so that name translation may be independent
+of the global context of a user process.
+This is especially desired in any system where kernel-mode servers
+operate as light-weight or interrupt-level processes,
+or where a server may store or cache context for several clients.
+This calling interface has the additional advantage
+that the call parameters need not all be pushed onto the stack for each call
+through the filesystem interface,
+and they may be accessed using short offsets from a base pointer
+(unlike global variables in the \fIuser\fP structure).
+.PP
+The proposed filesystem interface is described very tersely here.
+For the most part, data structures and procedures are analogous
+to those used by VFS, and only the changes will be treated here.
+See [Kleiman86] for complete descriptions of the vfs and vnode operations
+in Sun's interface.
+.PP
+The central data structure for name translation is the \fInameidata\fP
+structure.
+The same structure is used to pass parameters to \fInamei\fP,
+to pass these same parameters to filesystem-specific lookup routines,
+to communicate completion status from the lookup routines back to \fInamei\fP,
+and to return completion status to the calling routine.
+For creation or deletion requests, the parameters to the filesystem operation
+to complete the request are also passed in this same structure.
+The form of the \fInameidata\fP structure is:
+.br
+.ne 2i
+.ID
+.nf
+.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
+/*
+ * Encapsulation of namei parameters.
+ * One of these is located in the u. area to
+ * minimize space allocated on the kernel stack
+ * and to retain per-process context.
+ */
+struct nameidata {
+ /* arguments to namei and related context: */
+ caddr_t ni_dirp; /* pathname pointer */
+ enum uio_seg ni_seg; /* location of pathname */
+ short ni_nameiop; /* see below */
+ struct vnode *ni_cdir; /* current directory */
+ struct vnode *ni_rdir; /* root directory, if not normal root */
+ struct ucred *ni_cred; /* credentials */
+
+ /* shared between namei, lookup routines and commit routines: */
+ caddr_t ni_pnbuf; /* pathname buffer */
+ char *ni_ptr; /* current location in pathname */
+ int ni_pathlen; /* remaining chars in path */
+ short ni_more; /* more left to translate in pathname */
+ short ni_loopcnt; /* count of symlinks encountered */
+
+ /* results: */
+ struct vnode *ni_vp; /* vnode of result */
+ struct vnode *ni_dvp; /* vnode of intermediate directory */
+
+/* BEGIN UFS SPECIFIC */
+ struct diroffcache { /* last successful directory search */
+ struct vnode *nc_prevdir; /* terminal directory */
+ long nc_id; /* directory's unique id */
+ off_t nc_prevoffset; /* where last entry found */
+ } ni_nc;
+/* END UFS SPECIFIC */
+};
+.DE
+.DS
+.ta \w'#define\0\0'u +\w'WANTPARENT\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
+/*
+ * namei operations and modifiers
+ */
+#define LOOKUP 0 /* perform name lookup only */
+#define CREATE 1 /* setup for file creation */
+#define DELETE 2 /* setup for file deletion */
+#define WANTPARENT 0x10 /* return parent directory vnode also */
+#define NOCACHE 0x20 /* name must not be left in cache */
+#define FOLLOW 0x40 /* follow symbolic links */
+#define NOFOLLOW 0x0 /* don't follow symbolic links (pseudo) */
+.DE
+As in current systems other than Sun's VFS, \fInamei\fP is called
+with an operation request, one of LOOKUP, CREATE or DELETE.
+For a LOOKUP, the operation is exactly like the lookup in VFS.
+CREATE and DELETE allow the filesystem to ensure consistency
+by locking the parent inode (private to the filesystem),
+and (for the local filesystem) to avoid duplicate directory scans
+by storing the new directory entry and its offset in the directory
+in the \fIndirinfo\fP structure.
+This is intended to be opaque to the filesystem-independent levels.
+Not all lookups for creation or deletion are actually followed
+by the intended operation; permission may be denied, the filesystem
+may be read-only, etc.
+Therefore, an entry point to the filesystem is provided
+to abort a creation or deletion operation
+and allow release of any locked internal data.
+After a \fInamei\fP with a CREATE or DELETE flag, the pathname pointer
+is set to point to the last filename component.
+Filesystems that choose to implement creation or deletion entirely
+within the subsequent call to a create or delete entry
+are thus free to do so.
+.PP
+The \fInameidata\fP is used to store context used during name translation.
+The current and root directories for the translation are stored here.
+For the local filesystem, the per-process directory offset cache
+is also kept here.
+A file server could leave the directory offset cache empty,
+could use a single cache for all clients,
+or could hold caches for several recent clients.
+.PP
+Several other data structures are used in the filesystem operations.
+One is the \fIucred\fP structure which describes a client's credentials
+to the filesystem.
+This is modified slightly from the Sun structure;
+the ``accounting'' group ID has been merged into the groups array.
+The actual number of groups in the array is given explicitly
+to avoid use of a reserved group ID as a terminator.
+Also, typedefs introduced in 4.3BSD for user and group ID's have been used.
+The \fIucred\fP structure is thus:
+.DS
+.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
+/*
+ * Credentials.
+ */
+struct ucred {
+ u_short cr_ref; /* reference count */
+ uid_t cr_uid; /* effective user id */
+ short cr_ngroups; /* number of groups */
+ gid_t cr_groups[NGROUPS]; /* groups */
+ /*
+ * The following either should not be here,
+ * or should be treated as opaque.
+ */
+ uid_t cr_ruid; /* real user id */
+ gid_t cr_svgid; /* saved set-group id */
+};
+.DE
+.PP
+A final structure used by the filesystem interface is the \fIuio\fP
+structure mentioned earlier.
+This structure describes the source or destination of an I/O
+operation, with provision for scatter/gather I/O.
+It is used in the read and write entries to the filesystem.
+The \fIuio\fP structure presented here is modified from the one
+used in 4.2BSD to specify the location of each vector of the operation
+(user or kernel space)
+and to allow an alternate function to be used to implement the data movement.
+The alternate function might perform page remapping rather than a copy,
+for example.
+.DS
+.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
+/*
+ * Description of an I/O operation which potentially
+ * involves scatter-gather, with individual sections
+ * described by iovec, below. uio_resid is initially
+ * set to the total size of the operation, and is
+ * decremented as the operation proceeds. uio_offset
+ * is incremented by the amount of each operation.
+ * uio_iov is incremented and uio_iovcnt is decremented
+ * after each vector is processed.
+ */
+struct uio {
+ struct iovec *uio_iov;
+ int uio_iovcnt;
+ off_t uio_offset;
+ int uio_resid;
+ enum uio_rw uio_rw;
+};
+
+enum uio_rw { UIO_READ, UIO_WRITE };
+.DE
+.DS
+.ta .5i +\w'caddr_t\0\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
+/*
+ * Description of a contiguous section of an I/O operation.
+ * If iov_op is non-null, it is called to implement the copy
+ * operation, possibly by remapping, with the call
+ * (*iov_op)(from, to, count);
+ * where from and to are caddr_t and count is int.
+ * Otherwise, the copy is done in the normal way,
+ * treating base as a user or kernel virtual address
+ * according to iov_segflg.
+ */
+struct iovec {
+ caddr_t iov_base;
+ int iov_len;
+ enum uio_seg iov_segflg;
+ int (*iov_op)();
+};
+.DE
+.DS
+.ta .5i +\w'UIO_USERSPACE\0\0\0\0\0'u
+/*
+ * Segment flag values.
+ */
+enum uio_seg {
+ UIO_USERSPACE, /* from user data space */
+ UIO_SYSSPACE, /* from system space */
+};
+.DE
+.NH
+File and filesystem operations
+.PP
+With the introduction of the data structures used by the filesystem
+operations, the complete list of filesystem entry points may be listed.
+As noted, they derive mostly from the Sun VFS interface.
+Lines marked with \fB+\fP are additions to the Sun definitions;
+lines marked with \fB!\fP are modified from VFS.
+.PP
+The structure describing the externally-visible features of a mounted
+filesystem, \fIvfs\fP, is:
+.DS
+.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
+/*
+ * Structure per mounted file system.
+ * Each mounted file system has an array of
+ * operations and an instance record.
+ * The file systems are put on a doubly linked list.
+ */
+struct vfs {
+ struct vfs *vfs_next; /* next vfs in vfs list */
+\fB+\fP struct vfs *vfs_prev; /* prev vfs in vfs list */
+ struct vfsops *vfs_op; /* operations on vfs */
+ struct vnode *vfs_vnodecovered; /* vnode we mounted on */
+ int vfs_flag; /* flags */
+\fB!\fP int vfs_fsize; /* fundamental block size */
+\fB+\fP int vfs_bsize; /* optimal transfer size */
+\fB!\fP uid_t vfs_exroot; /* exported fs uid 0 mapping */
+ short vfs_exflags; /* exported fs flags */
+ caddr_t vfs_data; /* private data */
+};
+.DE
+.DS
+.ta \w'\fB+\fP 'u +\w'#define\0\0'u +\w'VFS_EXPORTED\0\0'u +\w'0x40\0\0\0\0\0'u
+ /*
+ * vfs flags.
+ * VFS_MLOCK lock the vfs so that name lookup cannot proceed past the vfs.
+ * This keeps the subtree stable during mounts and unmounts.
+ */
+ #define VFS_RDONLY 0x01 /* read only vfs */
+\fB+\fP #define VFS_NOEXEC 0x02 /* can't exec from filesystem */
+ #define VFS_MLOCK 0x04 /* lock vfs so that subtree is stable */
+ #define VFS_MWAIT 0x08 /* someone is waiting for lock */
+ #define VFS_NOSUID 0x10 /* don't honor setuid bits on vfs */
+ #define VFS_EXPORTED 0x20 /* file system is exported (NFS) */
+
+ /*
+ * exported vfs flags.
+ */
+ #define EX_RDONLY 0x01 /* exported read only */
+.DE
+.LP
+The operations supported by the filesystem-specific layer
+on an individual filesystem are:
+.DS
+.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
+/*
+ * Operations supported on virtual file system.
+ */
+struct vfsops {
+\fB!\fP int (*vfs_mount)( /* vfs, path, data, datalen */ );
+\fB!\fP int (*vfs_unmount)( /* vfs, forcibly */ );
+\fB+\fP int (*vfs_mountroot)();
+ int (*vfs_root)( /* vfs, vpp */ );
+\fB!\fP int (*vfs_statfs)( /* vfs, vp, sbp */ );
+\fB!\fP int (*vfs_sync)( /* vfs, waitfor */ );
+\fB+\fP int (*vfs_fhtovp)( /* vfs, fhp, vpp */ );
+\fB+\fP int (*vfs_vptofh)( /* vp, fhp */ );
+};
+.DE
+.LP
+The \fIvfs_statfs\fP entry returns a structure of the form:
+.DS
+.ta .5i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
+/*
+ * file system statistics
+ */
+struct statfs {
+\fB!\fP short f_type; /* type of filesystem */
+\fB+\fP short f_flags; /* copy of vfs (mount) flags */
+\fB!\fP long f_fsize; /* fundamental file system block size */
+\fB+\fP long f_bsize; /* optimal transfer block size */
+ long f_blocks; /* total data blocks in file system */
+ long f_bfree; /* free blocks in fs */
+ long f_bavail; /* free blocks avail to non-superuser */
+ long f_files; /* total file nodes in file system */
+ long f_ffree; /* free file nodes in fs */
+ fsid_t f_fsid; /* file system id */
+\fB+\fP char *f_mntonname; /* directory on which mounted */
+\fB+\fP char *f_mntfromname; /* mounted filesystem */
+ long f_spare[7]; /* spare for later */
+};
+
+typedef long fsid_t[2]; /* file system id type */
+.DE
+.LP
+The modifications to Sun's interface at this level are minor.
+Additional arguments are present for the \fIvfs_mount\fP and \fIvfs_umount\fP
+entries.
+\fIvfs_statfs\fP accepts a vnode as well as filesystem identifier,
+as the information may not be uniform throughout a filesystem.
+For example,
+if a client may mount a file tree that spans multiple physical
+filesystems on a server, different sections may have different amounts
+of free space.
+(NFS does not allow remotely-mounted file trees to span physical filesystems
+on the server.)
+The final additions are the entries that support file handles.
+\fIvfs_vptofh\fP is provided for the use of file servers,
+which need to obtain an opaque
+file handle to represent the current vnode for transmission to clients.
+This file handle may later be used to relocate the vnode using \fIvfs_fhtovp\fP
+without requiring the vnode to remain in memory.
+.PP
+Finally, the external form of a filesystem object, the \fIvnode\fP, is:
+.DS
+.ta .5i +\w'struct vnodeops\0\0'u +\w'*v_vfsmountedhere;\0\0\0'u
+/*
+ * vnode types. VNON means no type.
+ */
+enum vtype { VNON, VREG, VDIR, VBLK, VCHR, VLNK, VSOCK };
+
+struct vnode {
+ u_short v_flag; /* vnode flags (see below) */
+ u_short v_count; /* reference count */
+ u_short v_shlockc; /* count of shared locks */
+ u_short v_exlockc; /* count of exclusive locks */
+ struct vfs *v_vfsmountedhere; /* ptr to vfs mounted here */
+ struct vfs *v_vfsp; /* ptr to vfs we are in */
+ struct vnodeops *v_op; /* vnode operations */
+\fB+\fP struct text *v_text; /* text/mapped region */
+ enum vtype v_type; /* vnode type */
+ caddr_t v_data; /* private data for fs */
+};
+.DE
+.DS
+.ta \w'#define\0\0'u +\w'NOFOLLOW\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
+/*
+ * vnode flags.
+ */
+#define VROOT 0x01 /* root of its file system */
+#define VTEXT 0x02 /* vnode is a pure text prototype */
+#define VEXLOCK 0x10 /* exclusive lock */
+#define VSHLOCK 0x20 /* shared lock */
+#define VLWAIT 0x40 /* proc is waiting on shared or excl. lock */
+.DE
+.LP
+The operations supported by the filesystems on individual \fIvnode\fP\^s
+are:
+.DS
+.ta .5i +\w'int\0\0\0\0\0'u +\w'(*vn_getattr)(\0\0\0\0\0'u
+/*
+ * Operations on vnodes.
+ */
+struct vnodeops {
+\fB!\fP int (*vn_lookup)( /* ndp */ );
+\fB!\fP int (*vn_create)( /* ndp, vap, fflags */ );
+\fB+\fP int (*vn_mknod)( /* ndp, vap, fflags */ );
+\fB!\fP int (*vn_open)( /* vp, fflags, cred */ );
+ int (*vn_close)( /* vp, fflags, cred */ );
+ int (*vn_access)( /* vp, fflags, cred */ );
+ int (*vn_getattr)( /* vp, vap, cred */ );
+ int (*vn_setattr)( /* vp, vap, cred */ );
+
+\fB+\fP int (*vn_read)( /* vp, uiop, offp, ioflag, cred */ );
+\fB+\fP int (*vn_write)( /* vp, uiop, offp, ioflag, cred */ );
+\fB!\fP int (*vn_ioctl)( /* vp, com, data, fflag, cred */ );
+ int (*vn_select)( /* vp, which, cred */ );
+\fB+\fP int (*vn_mmap)( /* vp, ..., cred */ );
+ int (*vn_fsync)( /* vp, cred */ );
+\fB+\fP int (*vn_seek)( /* vp, offp, off, whence */ );
+
+\fB!\fP int (*vn_remove)( /* ndp */ );
+\fB!\fP int (*vn_link)( /* vp, ndp */ );
+\fB!\fP int (*vn_rename)( /* src ndp, target ndp */ );
+\fB!\fP int (*vn_mkdir)( /* ndp, vap */ );
+\fB!\fP int (*vn_rmdir)( /* ndp */ );
+\fB!\fP int (*vn_symlink)( /* ndp, vap, nm */ );
+ int (*vn_readdir)( /* vp, uiop, offp, ioflag, cred */ );
+ int (*vn_readlink)( /* vp, uiop, ioflag, cred */ );
+
+\fB+\fP int (*vn_abortop)( /* ndp */ );
+\fB+\fP int (*vn_lock)( /* vp */ );
+\fB+\fP int (*vn_unlock)( /* vp */ );
+\fB!\fP int (*vn_inactive)( /* vp */ );
+};
+.DE
+.DS
+.ta \w'#define\0\0'u +\w'NOFOLLOW\0\0'u +\w'0x40\0\0\0\0\0'u
+/*
+ * flags for ioflag
+ */
+#define IO_UNIT 0x01 /* do io as atomic unit for VOP_RDWR */
+#define IO_APPEND 0x02 /* append write for VOP_RDWR */
+#define IO_SYNC 0x04 /* sync io for VOP_RDWR */
+.DE
+.LP
+The argument types listed in the comments following each operation are:
+.sp
+.IP ndp 10
+A pointer to a \fInameidata\fP structure.
+.IP vap
+A pointer to a \fIvattr\fP structure (vnode attributes; see below).
+.IP fflags
+File open flags, possibly including O_APPEND, O_CREAT, O_TRUNC and O_EXCL.
+.IP vp
+A pointer to a \fIvnode\fP previously obtained with \fIvn_lookup\fP.
+.IP cred
+A pointer to a \fIucred\fP credentials structure.
+.IP uiop
+A pointer to a \fIuio\fP structure.
+.IP ioflag
+Any of the IO flags defined above.
+.IP com
+An \fIioctl\fP command, with type \fIunsigned long\fP.
+.IP data
+A pointer to a character buffer used to pass data to or from an \fIioctl\fP.
+.IP which
+One of FREAD, FWRITE or 0 (select for exceptional conditions).
+.IP off
+A file offset of type \fIoff_t\fP.
+.IP offp
+A pointer to file offset of type \fIoff_t\fP.
+.IP whence
+One of L_SET, L_INCR, or L_XTND.
+.IP fhp
+A pointer to a file handle buffer.
+.sp
+.PP
+Several changes have been made to Sun's set of vnode operations.
+Most obviously, the \fIvn_lookup\fP receives a \fInameidata\fP structure
+containing its arguments and context as described.
+The same structure is also passed to one of the creation or deletion
+entries if the lookup operation is for CREATE or DELETE to complete
+an operation, or to the \fIvn_abortop\fP entry if no operation
+is undertaken.
+For filesystems that perform no locking between lookup for creation
+or deletion and the call to implement that action,
+the final pathname component may be left untranslated by the lookup
+routine.
+In any case, the pathname pointer points at the final name component,
+and the \fInameidata\fP contains a reference to the vnode of the parent
+directory.
+The interface is thus flexible enough to accommodate filesystems
+that are fully stateful or fully stateless, while avoiding redundant
+operations whenever possible.
+One operation remains problematical, the \fIvn_rename\fP call.
+It is tempting to look up the source of the rename for deletion
+and the target for creation.
+However, filesystems that lock directories during such lookups must avoid
+deadlock if the two paths cross.
+For that reason, the source is translated for LOOKUP only,
+with the WANTPARENT flag set;
+the target is then translated with an operation of CREATE.
+.PP
+In addition to the changes concerned with the \fInameidata\fP interface,
+several other changes were made in the vnode operations.
+The \fIvn_rdrw\fP entry was split into \fIvn_read\fP and \fIvn_write\fP;
+frequently, the read/write entry amounts to a routine that checks
+the direction flag, then calls either a read routine or a write routine.
+The two entries may be identical for any given filesystem;
+the direction flag is contained in the \fIuio\fP given as an argument.
+.PP
+All of the read and write operations use a \fIuio\fP to describe
+the file offset and buffer locations.
+All of these fields must be updated before return.
+In particular, the \fIvn_readdir\fP entry uses this
+to return a new file offset token for its current location.
+.PP
+Several new operations have been added.
+The first, \fIvn_seek\fP, is a concession to record-oriented files
+such as directories.
+It allows the filesystem to verify that a seek leaves a file at a sensible
+offset, or to return a new offset token relative to an earlier one.
+For most filesystems and files, this operation amounts to performing
+simple arithmetic.
+Another new entry point is \fIvn_mmap\fP, for use in mapping device memory
+into a user process address space.
+Its semantics are not yet decided.
+The final additions are the \fIvn_lock\fP and \fIvn_unlock\fP entries.
+These are used to request that the underlying file be locked against
+changes for short periods of time if the filesystem implementation allows it.
+They are used to maintain consistency
+during internal operations such as \fIexec\fP,
+and may not be used to construct atomic operations from other filesystem
+operations.
+.PP
+The attributes of a vnode are not stored in the vnode,
+as they might change with time and may need to be read from a remote
+source.
+Attributes have the form:
+.DS
+.ta .5i +\w'struct vnodeops\0\0'u +\w'*v_vfsmountedhere;\0\0\0'u
+/*
+ * Vnode attributes. A field value of -1
+ * represents a field whose value is unavailable
+ * (getattr) or which is not to be changed (setattr).
+ */
+struct vattr {
+ enum vtype va_type; /* vnode type (for create) */
+ u_short va_mode; /* files access mode and type */
+\fB!\fP uid_t va_uid; /* owner user id */
+\fB!\fP gid_t va_gid; /* owner group id */
+ long va_fsid; /* file system id (dev for now) */
+\fB!\fP long va_fileid; /* file id */
+ short va_nlink; /* number of references to file */
+ u_long va_size; /* file size in bytes (quad?) */
+\fB+\fP u_long va_size1; /* reserved if not quad */
+ long va_blocksize; /* blocksize preferred for i/o */
+ struct timeval va_atime; /* time of last access */
+ struct timeval va_mtime; /* time of last modification */
+ struct timeval va_ctime; /* time file changed */
+ dev_t va_rdev; /* device the file represents */
+ u_long va_bytes; /* bytes of disk space held by file */
+\fB+\fP u_long va_bytes1; /* reserved if va_bytes not a quad */
+};
+.DE
+.NH
+Conclusions
+.PP
+The Sun VFS filesystem interface is the most widely used generic
+filesystem interface.
+Of the interfaces examined, it creates the cleanest separation
+between the filesystem-independent and -dependent layers and data structures.
+It has several flaws, but it is felt that certain changes in the interface
+can ameliorate most of them.
+The interface proposed here includes those changes.
+The proposed interface is now being implemented by the Computer Systems
+Research Group at Berkeley.
+If the design succeeds in improving the flexibility and performance
+of the filesystem layering, it will be advanced as a model interface.
+.NH
+Acknowledgements
+.PP
+The filesystem interface described here is derived from Sun's VFS interface.
+It also includes features similar to those of DEC's GFS interface.
+We are indebted to members of the Sun and DEC system groups
+for long discussions of the issues involved.
+.br
+.ne 2i
+.NH
+References
+
+.IP Brownbridge82 \w'Satyanarayanan85\0\0'u
+Brownbridge, D.R., L.F. Marshall, B. Randell,
+``The Newcastle Connection, or UNIXes of the World Unite!,''
+\fISoftware\- Practice and Experience\fP, Vol. 12, pp. 1147-1162, 1982.
+
+.IP Cole85
+Cole, C.T., P.B. Flinn, A.B. Atlas,
+``An Implementation of an Extended File System for UNIX,''
+\fIUsenix Conference Proceedings\fP,
+pp. 131-150, June, 1985.
+
+.IP Kleiman86
+``Vnodes: An Architecture for Multiple File System Types in Sun UNIX,''
+\fIUsenix Conference Proceedings\fP,
+pp. 238-247, June, 1986.
+
+.IP Leffler84
+Leffler, S., M.K. McKusick, M. Karels,
+``Measuring and Improving the Performance of 4.2BSD,''
+\fIUsenix Conference Proceedings\fP, pp. 237-252, June, 1984.
+
+.IP McKusick84
+McKusick, M.K., W.N. Joy, S.J. Leffler, R.S. Fabry,
+``A Fast File System for UNIX,'' \fITransactions on Computer Systems\fP,
+Vol. 2, pp. 181-197,
+ACM, August, 1984.
+
+.IP McKusick85
+McKusick, M.K., M. Karels, S. Leffler,
+``Performance Improvements and Functional Enhancements in 4.3BSD,''
+\fIUsenix Conference Proceedings\fP, pp. 519-531, June, 1985.
+
+.IP Rifkin86
+Rifkin, A.P., M.P. Forbes, R.L. Hamilton, M. Sabrio, S. Shah, and K. Yueh,
+``RFS Architectural Overview,'' \fIUsenix Conference Proceedings\fP,
+pp. 248-259, June, 1986.
+
+.IP Ritchie74
+Ritchie, D.M. and K. Thompson, ``The Unix Time-Sharing System,''
+\fICommunications of the ACM\fP, Vol. 17, pp. 365-375, July, 1974.
+
+.IP Rodriguez86
+Rodriguez, R., M. Koehler, R. Hyde,
+``The Generic File System,'' \fIUsenix Conference Proceedings\fP,
+pp. 260-269, June, 1986.
+
+.IP Sandberg85
+Sandberg, R., D. Goldberg, S. Kleiman, D. Walsh, B. Lyon,
+``Design and Implementation of the Sun Network Filesystem,''
+\fIUsenix Conference Proceedings\fP,
+pp. 119-130, June, 1985.
+
+.IP Satyanarayanan85
+Satyanarayanan, M., \fIet al.\fP,
+``The ITC Distributed File System: Principles and Design,''
+\fIProc. 10th Symposium on Operating Systems Principles\fP, pp. 35-50,
+ACM, December, 1985.
+
+.IP Walker85
+Walker, B.J. and S.H. Kiser, ``The LOCUS Distributed Filesystem,''
+\fIThe LOCUS Distributed System Architecture\fP,
+G.J. Popek and B.J. Walker, ed., The MIT Press, Cambridge, MA, 1985.
+
+.IP Weinberger84
+Weinberger, P.J., ``The Version 8 Network File System,''
+\fIUsenix Conference presentation\fP,
+June, 1984.
diff --git a/share/doc/papers/fsinterface/slides.t b/share/doc/papers/fsinterface/slides.t
new file mode 100644
index 000000000000..3caaafbeea59
--- /dev/null
+++ b/share/doc/papers/fsinterface/slides.t
@@ -0,0 +1,318 @@
+.\" Copyright (c) 1986 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)slides.t 5.2 (Berkeley) 4/16/91
+.\"
+.so macros
+.nf
+.LL
+Encapsulation of namei parameters
+.NP 0
+.ta .5i +\w'caddr_t\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
+struct nameidata {
+ /* arguments and context: */
+ caddr_t ni_dirp;
+ enum uio_seg ni_seg;
+ short ni_nameiop;
+ struct vnode *ni_cdir;
+ struct vnode *ni_rdir;
+ struct ucred *ni_cred;
+.sp .2
+ /* shared with lookup and commit: */
+ caddr_t ni_pnbuf;
+ char *ni_ptr;
+ int ni_pathlen;
+ short ni_more;
+ short ni_loopcnt;
+.sp .2
+ /* results: */
+ struct vnode *ni_vp;
+ struct vnode *ni_dvp;
+.sp .2
+/* BEGIN UFS SPECIFIC */
+ struct diroffcache {
+ struct vnode *nc_prevdir;
+ long nc_id;
+ off_t nc_prevoffset;
+ } ni_nc;
+/* END UFS SPECIFIC */
+};
+.bp
+
+
+.LL
+Namei operations and modifiers
+
+.NP 0
+.ta \w'#define\0\0'u +\w'WANTPARENT\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
+#define LOOKUP 0 /* name lookup only */
+#define CREATE 1 /* setup for creation */
+#define DELETE 2 /* setup for deletion */
+#define WANTPARENT 0x10 /* return parent vnode also */
+#define NOCACHE 0x20 /* remove name from cache */
+#define FOLLOW 0x40 /* follow symbolic links */
+.bp
+
+.LL
+Namei operations and modifiers
+
+.NP 0
+.ta \w'#define\0\0'u +\w'WANTPARENT\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
+#define LOOKUP 0
+#define CREATE 1
+#define DELETE 2
+#define WANTPARENT 0x10
+#define NOCACHE 0x20
+#define FOLLOW 0x40
+.bp
+
+
+.LL
+Credentials
+
+.NP 0
+.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
+struct ucred {
+ u_short cr_ref;
+ uid_t cr_uid;
+ short cr_ngroups;
+ gid_t cr_groups[NGROUPS];
+ /*
+ * The following either should not be here,
+ * or should be treated as opaque.
+ */
+ uid_t cr_ruid;
+ gid_t cr_svgid;
+};
+.bp
+.LL
+Scatter-gather I/O
+.NP 0
+.ta .5i +\w'caddr_t\0\0\0'u +\w'struct\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
+struct uio {
+ struct iovec *uio_iov;
+ int uio_iovcnt;
+ off_t uio_offset;
+ int uio_resid;
+ enum uio_rw uio_rw;
+};
+
+enum uio_rw { UIO_READ, UIO_WRITE };
+
+
+
+.ta .5i +\w'caddr_t\0\0\0'u +\w'vnode *nc_prevdir;\0\0\0\0\0'u
+struct iovec {
+ caddr_t iov_base;
+ int iov_len;
+ enum uio_seg iov_segflg;
+ int (*iov_op)();
+};
+.bp
+.LL
+Per-filesystem information
+.NP 0
+.ta .25i +\w'struct vfsops\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
+struct vfs {
+ struct vfs *vfs_next;
+\fB+\fP struct vfs *vfs_prev;
+ struct vfsops *vfs_op;
+ struct vnode *vfs_vnodecovered;
+ int vfs_flag;
+\fB!\fP int vfs_fsize;
+\fB+\fP int vfs_bsize;
+\fB!\fP uid_t vfs_exroot;
+ short vfs_exflags;
+ caddr_t vfs_data;
+};
+
+.NP 0
+.ta \w'\fB+\fP 'u +\w'#define\0\0'u +\w'VFS_EXPORTED\0\0'u +\w'0x40\0\0\0\0\0'u
+ /* vfs flags: */
+ #define VFS_RDONLY 0x01
+\fB+\fP #define VFS_NOEXEC 0x02
+ #define VFS_MLOCK 0x04
+ #define VFS_MWAIT 0x08
+ #define VFS_NOSUID 0x10
+ #define VFS_EXPORTED 0x20
+
+ /* exported vfs flags: */
+ #define EX_RDONLY 0x01
+.bp
+
+
+.LL
+Operations supported on virtual file system.
+
+.NP 0
+.ta .25i +\w'int\0\0'u +\w'*vfs_mountroot();\0'u
+struct vfsops {
+\fB!\fP int (*vfs_mount)(vfs, path, data, len);
+\fB!\fP int (*vfs_unmount)(vfs, forcibly);
+\fB+\fP int (*vfs_mountroot)();
+ int (*vfs_root)(vfs, vpp);
+ int (*vfs_statfs)(vfs, sbp);
+\fB!\fP int (*vfs_sync)(vfs, waitfor);
+\fB+\fP int (*vfs_fhtovp)(vfs, fhp, vpp);
+\fB+\fP int (*vfs_vptofh)(vp, fhp);
+};
+.bp
+
+
+.LL
+Dynamic file system information
+
+.NP 0
+.ta .5i +\w'struct\0\0\0'u +\w'*vfs_vnodecovered;\0\0\0\0\0'u
+struct statfs {
+\fB!\fP short f_type;
+\fB+\fP short f_flags;
+\fB!\fP long f_fsize;
+\fB+\fP long f_bsize;
+ long f_blocks;
+ long f_bfree;
+ long f_bavail;
+ long f_files;
+ long f_ffree;
+ fsid_t f_fsid;
+\fB+\fP char *f_mntonname;
+\fB+\fP char *f_mntfromname;
+ long f_spare[7];
+};
+
+typedef long fsid_t[2];
+.bp
+.LL
+Filesystem objects (vnodes)
+.NP 0
+.ta .25i +\w'struct vnodeops\0\0'u +\w'*v_vfsmountedhere;\0\0\0'u
+enum vtype { VNON, VREG, VDIR, VBLK, VCHR, VLNK, VSOCK };
+
+struct vnode {
+ u_short v_flag;
+ u_short v_count;
+ u_short v_shlockc;
+ u_short v_exlockc;
+ struct vfs *v_vfsmountedhere;
+ struct vfs *v_vfsp;
+ struct vnodeops *v_op;
+\fB+\fP struct text *v_text;
+ enum vtype v_type;
+ caddr_t v_data;
+};
+.ta \w'#define\0\0'u +\w'NOFOLLOW\0\0'u +\w'0x40\0\0\0\0\0\0\0'u
+
+/* vnode flags */
+#define VROOT 0x01
+#define VTEXT 0x02
+#define VEXLOCK 0x10
+#define VSHLOCK 0x20
+#define VLWAIT 0x40
+.bp
+.LL
+Operations on vnodes
+
+.NP 0
+.ta .25i +\w'int\0\0'u +\w'(*vn_getattr)(\0\0\0\0\0'u
+struct vnodeops {
+\fB!\fP int (*vn_lookup)(ndp);
+\fB!\fP int (*vn_create)(ndp, vap, fflags);
+\fB+\fP int (*vn_mknod)(ndp, vap, fflags);
+\fB!\fP int (*vn_open)(vp, fflags, cred);
+ int (*vn_close)(vp, fflags, cred);
+ int (*vn_access)(vp, fflags, cred);
+ int (*vn_getattr)(vp, vap, cred);
+ int (*vn_setattr)(vp, vap, cred);
+.sp .5
+\fB+\fP int (*vn_read)(vp, uiop,
+ offp, ioflag, cred);
+\fB+\fP int (*vn_write)(vp, uiop,
+ offp, ioflag, cred);
+\fB!\fP int (*vn_ioctl)(vp, com,
+ data, fflag, cred);
+ int (*vn_select)(vp, which, cred);
+\fB+\fP int (*vn_mmap)(vp, ..., cred);
+ int (*vn_fsync)(vp, cred);
+\fB+\fP int (*vn_seek)(vp, offp, off,
+ whence);
+.bp
+.LL
+Operations on vnodes (cont)
+
+.NP 0
+.ta .25i +\w'int\0\0'u +\w'(*vn_getattr)(\0\0\0\0\0'u
+
+\fB!\fP int (*vn_remove)(ndp);
+\fB!\fP int (*vn_link)(vp, ndp);
+\fB!\fP int (*vn_rename)(sndp, tndp);
+\fB!\fP int (*vn_mkdir)(ndp, vap);
+\fB!\fP int (*vn_rmdir)(ndp);
+\fB!\fP int (*vn_symlink)(ndp, vap, nm);
+\fB!\fP int (*vn_readdir)(vp, uiop,
+ offp, ioflag, cred);
+\fB!\fP int (*vn_readlink)(vp, uiop,
+ offp, ioflag, cred);
+.sp .5
+\fB+\fP int (*vn_abortop)(ndp);
+\fB!\fP int (*vn_inactive)(vp);
+};
+
+.NP 0
+.ta \w'#define\0\0'u +\w'NOFOLLOW\0\0'u +\w'0x40\0\0\0\0\0'u
+/* flags for ioflag */
+#define IO_UNIT 0x01
+#define IO_APPEND 0x02
+#define IO_SYNC 0x04
+.bp
+
+.LL
+Vnode attributes
+
+.NP 0
+.ta .5i +\w'struct timeval\0\0'u +\w'*v_vfsmountedhere;\0\0\0'u
+struct vattr {
+ enum vtype va_type;
+ u_short va_mode;
+\fB!\fP uid_t va_uid;
+\fB!\fP gid_t va_gid;
+ long va_fsid;
+\fB!\fP long va_fileid;
+ short va_nlink;
+ u_long va_size;
+\fB+\fP u_long va_size1;
+ long va_blocksize;
+ struct timeval va_atime;
+ struct timeval va_mtime;
+ struct timeval va_ctime;
+ dev_t va_rdev;
+\fB!\fP u_long va_bytes;
+\fB+\fP u_long va_bytes1;
+};
diff --git a/share/doc/papers/hwpmc/Makefile b/share/doc/papers/hwpmc/Makefile
new file mode 100644
index 000000000000..d24fe06d9d2d
--- /dev/null
+++ b/share/doc/papers/hwpmc/Makefile
@@ -0,0 +1,8 @@
+# $FreeBSD$
+
+VOLUME= papers
+DOC= hwpmc
+SRCS= hwpmc.ms
+MACROS= -ms
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/hwpmc/hwpmc.ms b/share/doc/papers/hwpmc/hwpmc.ms
new file mode 100644
index 000000000000..9061bb7a69a3
--- /dev/null
+++ b/share/doc/papers/hwpmc/hwpmc.ms
@@ -0,0 +1,34 @@
+.\" Copyright (c) 2004 Joseph Koshy.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY JOSEPH KOSHY AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL JOSEPH KOSHY OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" $FreeBSD$
+.\"
+.OH '''Using Hardware Performance Monitoring Counters'
+.EH 'HWPMC'''
+.TL
+Using Hardware Performance Monitoring Counters in FreeBSD
+.sp
+\s-2FreeBSD 5.2.1\s+2
+.sp
+\fRJuly, 2004\fR
+.PP
diff --git a/share/doc/papers/jail/Makefile b/share/doc/papers/jail/Makefile
new file mode 100644
index 000000000000..5d493542ea98
--- /dev/null
+++ b/share/doc/papers/jail/Makefile
@@ -0,0 +1,14 @@
+# $FreeBSD$
+
+VOLUME= papers
+DOC= jail
+SRCS= paper.ms-patched
+EXTRA= implementation.ms mgt.ms future.ms jail01.eps
+MACROS= -ms
+USE_SOELIM=
+CLEANFILES= paper.ms-patched
+
+paper.ms-patched: paper.ms
+ sed "s;jail01\.eps;${.CURDIR}/&;" ${.ALLSRC} > ${.TARGET}
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/jail/future.ms b/share/doc/papers/jail/future.ms
new file mode 100644
index 000000000000..01c325d4d19c
--- /dev/null
+++ b/share/doc/papers/jail/future.ms
@@ -0,0 +1,104 @@
+.\"
+.\" $FreeBSD$
+.\"
+.NH
+Future Directions
+.PP
+The jail facility has already been deployed in numerous capacities and
+a few opportunities for improvement have manifested themselves.
+.NH 2
+Improved Virtualisation
+.PP
+As it stands, the jail code provides a strict subset of system resources
+to the jail environment, based on access to processes, files, network
+resources, and privileged services.
+Virtualisation, or making the jail environments appear to be fully
+functional FreeBSD systems, allows maximum application support and the
+ability to offer a wide range of services within a jail environment.
+However, there are a number of limitations on the degree of virtualisation
+in the current code, and removing these limitations will enhance the
+ability to offer services in a jail environment.
+Two areas that deserve greater attention are the virtualisation of
+network resources, and management of scheduling resources.
+.PP
+Currently, a single IP address may be allocated to each jail, and all
+communication from the jail is limited to that IP address.
+In particular, these addresses are IPv4 addresses.
+There has been substantial interest in improving interface virtualisation,
+allowing one or more addresses to be assigned to an interface, and
+removing the requirement that the address be an IPv4 address, allowing
+the use of IPv6.
+Also, access to raw sockets is currently prohibited, as the current
+implementation of raw sockets allows access to raw IP packets associated
+with all interfaces.
+Limiting the scope of the raw socket would allow its safe use within
+a jail, re-enabling support for ping, and other network debugging and
+evaluation tools.
+.PP
+Another area of great interest to the current consumers of the jail code
+is the ability to limit the impact of one jail on the CPU resources
+available for other jails.
+Specifically, this would require that the jail of a process play a rule in
+its scheduling parameters.
+Prior work in the area of lottery scheduling, currently available as
+patches on FreeBSD 2.2.x, might be leveraged to allow some degree of
+partitioning between jail environments \s-2[LOTTERY1] [LOTTERY2]\s+2.
+However, as the current scheduling mechanism is targeted at time
+sharing, and FreeBSD does not currently support real time preemption
+of processes in kernel, complete partitioning is not possible within the
+current framework.
+.NH 2
+Improved Management
+.PP
+Management of jail environments is currently somewhat ad hoc--creating
+and starting jails is a well-documented procedure, but day-to-day
+management of jails, as well as special case procedures such as shutdown,
+are not well analysed and documented.
+The current kernel process management infrastructure does not have the
+ability to manage pools of processes in a jail-centric way.
+For example, it is possible to, within a jail, deliver a signal to all
+processes in a jail, but it is not possibly to atomically target all
+processes within a jail from outside of the jail.
+If the jail code is to effectively limit the behaviour of a jail, the
+ability to shut it down cleanly is paramount.
+Similarly, shutting down a jail cleanly from within is also not well
+defined, the traditional shutdown utilities having been written with
+a host environment in mind.
+This suggests a number of improvements, both in the kernel and in the
+user-land utility set.
+.PP
+First, the ability to address kernel-centric management mechanisms at
+jails is important.
+One way in which this might be done is to assign a unique jail id, not
+unlike a process id or process group id, at jail creation time.
+A new jailkill() syscall would permit the direction of signals to
+specific jailids, allowing for the effective termination of all processes
+in the jail.
+A unique jailid could also supplant the hostname as the unique
+identifier for a jail, allowing the hostname to be changed by the
+processes in the jail without interfering with jail management.
+.PP
+More carefully defining the user-land semantics of a jail during startup
+and shutdown is also important.
+The traditional FreeBSD environment makes use of an init process to
+bring the system up during the boot process, and to assist in shutdown.
+A similar technique might be used for jail, in effect a jailinit,
+formulated to handle the clean startup and shutdown, including calling
+out to jail-local /etc/rc.shutdown, and other useful shutdown functions.
+A jailinit would also present a central location for delivering
+management requests to within a jail from the host environment, allowing
+the host environment to request the shutdown of the jail cleanly, before
+resorting to terminating processes, in the same style as the host
+environment shutting down before killing all processes and halting the
+kernel.
+.PP
+Improvements in the host environment would also assist in improving
+jail management, possibly including automated runtime jail management tools,
+tools to more easily construct the per-jail file system area, and
+include jail shutdown as part of normal system shutdown.
+.PP
+These improvements in the jail framework would improve both raw
+functionality and usability from a management perspective.
+The jail code has raised significant interest in the FreeBSD community,
+and it is hoped that this type of improved functionality will be
+available in upcoming releases of FreeBSD.
diff --git a/share/doc/papers/jail/implementation.ms b/share/doc/papers/jail/implementation.ms
new file mode 100644
index 000000000000..eafc8f25c9c7
--- /dev/null
+++ b/share/doc/papers/jail/implementation.ms
@@ -0,0 +1,126 @@
+.\"
+.\" $FreeBSD$
+.\"
+.NH
+Implementation jail in the FreeBSD kernel.
+.NH 2
+The jail(2) system call, allocation, refcounting and deallocation of
+\fCstruct prison\fP.
+.PP
+The jail(2) system call is implemented as a non-optional system call
+in FreeBSD. Other system calls are controlled by compile time options
+in the kernel configuration file, but due to the minute footprint of
+the jail implementation, it was decided to make it a standard
+facility in FreeBSD.
+.PP
+The implementation of the system call is straightforward: a data structure
+is allocated and populated with the arguments provided. The data structure
+is attached to the current process' \fCstruct proc\fP, its reference count
+set to one and a call to the
+chroot(2) syscall implementation completes the task.
+.PP
+Hooks in the code implementing process creation and destruction maintains
+the reference count on the data structure and free it when the last reference
+is lost.
+Any new process created by a process in a jail will inherit a reference
+to the jail, which effectively puts the new process in the same jail.
+.PP
+There is no way to modify the contents of the data structure describing
+the jail after its creation, and no way to attach a process to an existing
+jail if it was not created from the inside that jail.
+.NH 2
+Fortification of the chroot(2) facility for filesystem name scoping.
+.PP
+A number of ways to escape the confines of a chroot(2)-created subscope
+of the filesystem view have been identified over the years.
+chroot(2) was never intended to be security mechanism as such, but even
+then the ftp daemon largely depended on the security provided by
+chroot(2) to provide the ``anonymous ftp'' access method.
+.PP
+Three classes of escape routes existed: recursive chroot(2) escapes,
+``..'' based escapes and fchdir(2) based escapes.
+All of these exploited the fact that chroot(2) didn't try sufficiently
+hard to enforce the new root directory.
+.PP
+New code were added to detect and thwart these escapes, amongst
+other things by tracking the directory of the first level of chroot(2)
+experienced by a process and refusing backwards traversal across
+this directory, as well as additional code to refuse chroot(2) if
+file-descriptors were open referencing directories.
+.NH 2
+Restriction of process visibility and interaction.
+.PP
+A macro was already in available in the kernel to determine if one process
+could affect another process. This macro did the rather complex checking
+of uid and gid values. It was felt that the complexity of the macro were
+approaching the lower edge of IOCCC entrance criteria, and it was therefore
+converted to a proper function named \fCp_trespass(p1, p2)\fP which does
+all the previous checks and additionally checks the jail aspect of the access.
+The check is implemented such that access fails if the origin process is jailed
+but the target process is not in the same jail.
+.PP
+Process visibility is provided through two mechanisms in FreeBSD,
+the \fCprocfs\fP file system and a sub-tree of the \fCsysctl\fP tree.
+Both of these were modified to report only the processes in the same
+jail to a jailed process.
+.NH 2
+Restriction to one IP number.
+.PP
+Restricting TCP and UDP access to just one IP number was done almost
+entirely in the code which manages ``protocol control blocks''.
+When a jailed process binds to a socket, the IP number provided by
+the process will not be used, instead the pre-configured IP number of
+the jail is used.
+.PP
+BSD based TCP/IP network stacks sport a special interface, the loop-back
+interface, which has the ``magic'' IP number 127.0.0.1.
+This is often used by processes to contact servers on the local machine,
+and consequently special handling for jails were needed.
+To handle this case it was necessary to also intercept and modify the
+behaviour of connection establishment, and when the 127.0.0.1 address
+were seen from a jailed process, substitute the jails configured IP number.
+.PP
+Finally the APIs through which the network configuration and connection
+state may be queried were modified to report only information relevant
+to the configured IP number of a jailed process.
+.NH 2
+Adding jail awareness to selected device drivers.
+.PP
+A couple of device drivers needed to be taught about jails, the ``pty''
+driver is one of them. The pty driver provides ``virtual terminals'' to
+services like telnet, ssh, rlogin and X11 terminal window programs.
+Therefore jails need access to the pty driver, and code had to be added
+to enforce that a particular virtual terminal were not accessed from more
+than one jail at the same time.
+.NH 2
+General restriction of super-users powers for jailed super-users.
+.PP
+This item proved to be the simplest but most tedious to implement.
+Tedious because a manual review of all places where the kernel allowed
+the super user special powers were called for,
+simple because very few places were required to let a jailed root through.
+Of the approximately 260 checks in the FreeBSD 4.0 kernel, only
+about 35 will let a jailed root through.
+.PP
+Since the default is for jailed roots to not receive privilege, new
+code or drivers in the FreeBSD kernel are automatically jail-aware: they
+will refuse jailed roots privilege.
+The other part of this protection comes from the fact that a jailed
+root cannot create new device nodes with the mknod(2) systemcall, so
+unless the machine administrator creates device nodes for a particular
+device inside the jails filesystem tree, the driver in effect does
+not exist in the jail.
+.PP
+As a side-effect of this work the suser(9) API were cleaned up and
+extended to cater for not only the jail facility, but also to make room
+for future partitioning facilities.
+.NH 2
+Implementation statistics
+.PP
+The change of the suser(9) API modified approx 350 source lines
+distributed over approx. 100 source files. The vast majority of
+these changes were generated automatically with a script.
+.PP
+The implementation of the jail facility added approx 200 lines of
+code in total, distributed over approx. 50 files. and about 200 lines
+in two new kernel files.
diff --git a/share/doc/papers/jail/jail01.eps b/share/doc/papers/jail/jail01.eps
new file mode 100644
index 000000000000..ffcfa30386f1
--- /dev/null
+++ b/share/doc/papers/jail/jail01.eps
@@ -0,0 +1,234 @@
+%!PS-Adobe-2.0 EPSF-2.0
+%%Title: jail01.eps
+%%Creator: fig2dev Version 3.2 Patchlevel 1
+%%CreationDate: Fri Mar 24 20:37:59 2000
+%%For: $FreeBSD$
+%%Orientation: Portrait
+%%BoundingBox: 0 0 425 250
+%%Pages: 0
+%%BeginSetup
+%%EndSetup
+%%Magnification: 1.0000
+%%EndComments
+/$F2psDict 200 dict def
+$F2psDict begin
+$F2psDict /mtrx matrix put
+/col-1 {0 setgray} bind def
+/col0 {0.000 0.000 0.000 srgb} bind def
+/col1 {0.000 0.000 1.000 srgb} bind def
+/col2 {0.000 1.000 0.000 srgb} bind def
+/col3 {0.000 1.000 1.000 srgb} bind def
+/col4 {1.000 0.000 0.000 srgb} bind def
+/col5 {1.000 0.000 1.000 srgb} bind def
+/col6 {1.000 1.000 0.000 srgb} bind def
+/col7 {1.000 1.000 1.000 srgb} bind def
+/col8 {0.000 0.000 0.560 srgb} bind def
+/col9 {0.000 0.000 0.690 srgb} bind def
+/col10 {0.000 0.000 0.820 srgb} bind def
+/col11 {0.530 0.810 1.000 srgb} bind def
+/col12 {0.000 0.560 0.000 srgb} bind def
+/col13 {0.000 0.690 0.000 srgb} bind def
+/col14 {0.000 0.820 0.000 srgb} bind def
+/col15 {0.000 0.560 0.560 srgb} bind def
+/col16 {0.000 0.690 0.690 srgb} bind def
+/col17 {0.000 0.820 0.820 srgb} bind def
+/col18 {0.560 0.000 0.000 srgb} bind def
+/col19 {0.690 0.000 0.000 srgb} bind def
+/col20 {0.820 0.000 0.000 srgb} bind def
+/col21 {0.560 0.000 0.560 srgb} bind def
+/col22 {0.690 0.000 0.690 srgb} bind def
+/col23 {0.820 0.000 0.820 srgb} bind def
+/col24 {0.500 0.190 0.000 srgb} bind def
+/col25 {0.630 0.250 0.000 srgb} bind def
+/col26 {0.750 0.380 0.000 srgb} bind def
+/col27 {1.000 0.500 0.500 srgb} bind def
+/col28 {1.000 0.630 0.630 srgb} bind def
+/col29 {1.000 0.750 0.750 srgb} bind def
+/col30 {1.000 0.880 0.880 srgb} bind def
+/col31 {1.000 0.840 0.000 srgb} bind def
+
+end
+save
+-117.0 298.0 translate
+1 -1 scale
+
+/cp {closepath} bind def
+/ef {eofill} bind def
+/gr {grestore} bind def
+/gs {gsave} bind def
+/sa {save} bind def
+/rs {restore} bind def
+/l {lineto} bind def
+/m {moveto} bind def
+/rm {rmoveto} bind def
+/n {newpath} bind def
+/s {stroke} bind def
+/sh {show} bind def
+/slc {setlinecap} bind def
+/slj {setlinejoin} bind def
+/slw {setlinewidth} bind def
+/srgb {setrgbcolor} bind def
+/rot {rotate} bind def
+/sc {scale} bind def
+/sd {setdash} bind def
+/ff {findfont} bind def
+/sf {setfont} bind def
+/scf {scalefont} bind def
+/sw {stringwidth} bind def
+/tr {translate} bind def
+/tnt {dup dup currentrgbcolor
+ 4 -2 roll dup 1 exch sub 3 -1 roll mul add
+ 4 -2 roll dup 1 exch sub 3 -1 roll mul add
+ 4 -2 roll dup 1 exch sub 3 -1 roll mul add srgb}
+ bind def
+/shd {dup dup currentrgbcolor 4 -2 roll mul 4 -2 roll mul
+ 4 -2 roll mul srgb} bind def
+/$F2psBegin {$F2psDict begin /$F2psEnteredState save def} def
+/$F2psEnd {$F2psEnteredState restore end} def
+%%EndProlog
+
+$F2psBegin
+10 setmiterlimit
+n -1000 5962 m -1000 -1000 l 10022 -1000 l 10022 5962 l cp clip
+ 0.06000 0.06000 sc
+/Courier-BoldOblique ff 180.00 scf sf
+7725 3600 m
+gs 1 -1 sc (10.0.0.2) dup sw pop neg 0 rm col0 sh gr
+% Polyline
+15.000 slw
+n 9000 3300 m 9000 4275 l gs col0 s gr
+% Polyline
+2 slc
+n 7875 3225 m 7800 3225 l gs col0 s gr
+% Polyline
+0 slc
+n 7875 4125 m 7800 4125 l gs col0 s gr
+% Polyline
+n 7875 3225 m 7875 4425 l gs col0 s gr
+% Polyline
+n 7875 3825 m 7800 3825 l gs col0 s gr
+% Polyline
+n 7875 3525 m 7800 3525 l gs col0 s gr
+% Polyline
+n 8175 3825 m 7875 3825 l gs col0 s gr
+% Polyline
+2 slc
+n 7875 4425 m 7800 4425 l gs col0 s gr
+/Courier-Bold ff 180.00 scf sf
+8700 3900 m
+gs 1 -1 sc (fxp0) dup sw pop neg 0 rm col0 sh gr
+% Polyline
+0 slc
+7.500 slw
+n 2925 1425 m 3075 1425 l gs col0 s gr
+% Polyline
+15.000 slw
+n 2475 1350 m 2472 1347 l 2465 1342 l 2453 1334 l 2438 1323 l 2420 1311 l
+ 2401 1299 l 2383 1289 l 2366 1281 l 2351 1275 l 2338 1274 l
+ 2325 1275 l 2314 1279 l 2303 1285 l 2291 1293 l 2278 1303 l
+ 2264 1314 l 2250 1326 l 2236 1339 l 2222 1353 l 2209 1366 l
+ 2198 1379 l 2188 1391 l 2181 1403 l 2177 1414 l 2175 1425 l
+ 2177 1436 l 2181 1447 l 2188 1459 l 2198 1471 l 2209 1484 l
+ 2222 1497 l 2236 1511 l 2250 1524 l 2264 1536 l 2278 1547 l
+ 2291 1557 l 2303 1565 l 2314 1571 l 2325 1575 l 2338 1576 l
+ 2351 1575 l 2366 1569 l 2383 1561 l 2401 1551 l 2420 1539 l
+ 2438 1527 l 2453 1516 l 2465 1508 l 2472 1503 l 2475 1500 l gs col0 s gr
+/Courier-Bold ff 180.00 scf sf
+2550 1500 m
+gs 1 -1 sc (lo0) col0 sh gr
+/Courier-BoldOblique ff 180.00 scf sf
+3075 1500 m
+gs 1 -1 sc (127.0.0.1) col0 sh gr
+% Polyline
+7.500 slw
+n 2100 3525 m 2250 3525 l gs col0 s gr
+% Polyline
+n 2550 2100 m 2250 2400 l 2250 4500 l 2550 4800 l gs col0 s gr
+/Courier-Bold ff 180.00 scf sf
+1950 3600 m
+gs 1 -1 sc (/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+2550 3900 m
+gs 1 -1 sc (jail_1/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+2550 4200 m
+gs 1 -1 sc (jail_2/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+2550 4500 m
+gs 1 -1 sc (jail_3/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+2550 2400 m
+gs 1 -1 sc (dev/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+2550 2700 m
+gs 1 -1 sc (etc/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+2550 3000 m
+gs 1 -1 sc (usr/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+2550 3300 m
+gs 1 -1 sc (var/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+2550 3600 m
+gs 1 -1 sc (home/) col0 sh gr
+% Polyline
+n 3375 3825 m 3900 3825 l 4950 1800 l 5100 1800 l gs col0 s gr
+% Polyline
+n 3375 4125 m 3900 4125 l 4950 3900 l 5100 3900 l gs col0 s gr
+% Polyline
+n 5400 900 m 5100 1200 l 5100 2400 l 5400 2700 l gs col0 s gr
+% Polyline
+n 5400 3000 m 5100 3300 l 5100 4500 l 5400 4800 l gs col0 s gr
+% Polyline
+n 4650 825 m 4650 2775 l 6675 2775 l 6675 3375 l 7950 3375 l 7950 825 l
+ cp gs col0 s gr
+% Polyline
+n 4650 2775 m 4650 4950 l 6300 4950 l 6300 3675 l 7950 3675 l 7950 3375 l
+ 6675 3375 l 6675 2775 l cp gs col0 s gr
+/Courier-Bold ff 180.00 scf sf
+5400 1200 m
+gs 1 -1 sc (dev/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+5400 1500 m
+gs 1 -1 sc (etc/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+5400 1800 m
+gs 1 -1 sc (usr/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+5400 2100 m
+gs 1 -1 sc (var/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+5400 2400 m
+gs 1 -1 sc (home/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+5400 3300 m
+gs 1 -1 sc (dev/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+5400 3600 m
+gs 1 -1 sc (etc/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+5400 3900 m
+gs 1 -1 sc (usr/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+5400 4200 m
+gs 1 -1 sc (var/) col0 sh gr
+/Courier-Bold ff 180.00 scf sf
+5400 4500 m
+gs 1 -1 sc (home/) col0 sh gr
+/Courier-BoldOblique ff 180.00 scf sf
+7725 3300 m
+gs 1 -1 sc (10.0.0.1) dup sw pop neg 0 rm col0 sh gr
+/Courier-BoldOblique ff 180.00 scf sf
+7725 4500 m
+gs 1 -1 sc (10.0.0.5) dup sw pop neg 0 rm col0 sh gr
+/Courier-BoldOblique ff 180.00 scf sf
+7725 4200 m
+gs 1 -1 sc (10.0.0.4) dup sw pop neg 0 rm col0 sh gr
+/Courier-BoldOblique ff 180.00 scf sf
+7725 3900 m
+gs 1 -1 sc (10.0.0.3) dup sw pop neg 0 rm col0 sh gr
+% Polyline
+15.000 slw
+n 9000 3825 m 8775 3825 l gs col0 s gr
+$F2psEnd
+rs
diff --git a/share/doc/papers/jail/jail01.fig b/share/doc/papers/jail/jail01.fig
new file mode 100644
index 000000000000..d4ef1655e195
--- /dev/null
+++ b/share/doc/papers/jail/jail01.fig
@@ -0,0 +1,86 @@
+#FIG 3.2
+# $FreeBSD$
+Landscape
+Center
+Inches
+A4
+100.00
+Single
+-2
+1200 2
+6 7725 3150 9075 4500
+6 8700 3225 9075 4350
+2 1 0 2 0 7 100 0 -1 0.000 0 0 -1 0 0 2
+ 9000 3825 8775 3825
+2 1 0 2 0 7 100 0 -1 0.000 0 0 -1 0 0 2
+ 9000 3300 9000 4275
+-6
+2 1 0 2 0 7 100 0 -1 0.000 0 2 -1 0 0 2
+ 7875 3225 7800 3225
+2 1 0 2 0 7 100 0 -1 0.000 0 0 -1 0 0 2
+ 7875 4125 7800 4125
+2 1 0 2 0 7 100 0 -1 0.000 0 0 -1 0 0 2
+ 7875 3225 7875 4425
+2 1 0 2 0 7 100 0 -1 0.000 0 0 -1 0 0 2
+ 7875 3825 7800 3825
+2 1 0 2 0 7 100 0 -1 0.000 0 0 -1 0 0 2
+ 7875 3525 7800 3525
+2 1 0 2 0 7 100 0 -1 0.000 0 0 -1 0 0 2
+ 8175 3825 7875 3825
+2 1 0 2 0 7 100 0 -1 0.000 0 2 -1 0 0 2
+ 7875 4425 7800 4425
+4 2 0 100 0 14 12 0.0000 4 180 420 8700 3900 fxp0\001
+-6
+6 2100 1200 4050 1650
+2 1 0 1 0 7 100 0 -1 0.000 0 0 -1 0 0 2
+ 2925 1425 3075 1425
+3 2 0 2 0 7 100 0 -1 0.000 0 0 0 5
+ 2475 1350 2325 1275 2175 1425 2325 1575 2475 1500
+ 0.000 -1.000 -1.000 -1.000 0.000
+4 0 0 100 0 14 12 0.0000 4 135 315 2550 1500 lo0\001
+4 0 0 100 0 15 12 0.0000 4 135 945 3075 1500 127.0.0.1\001
+-6
+6 1950 2100 3300 4800
+2 1 0 1 0 7 100 0 -1 0.000 0 0 -1 0 0 2
+ 2100 3525 2250 3525
+2 1 0 1 0 7 100 0 -1 0.000 0 0 -1 0 0 4
+ 2550 2100 2250 2400 2250 4500 2550 4800
+4 0 0 100 0 14 12 0.0000 4 150 105 1950 3600 /\001
+4 0 0 100 0 14 12 0.0000 4 180 735 2550 3900 jail_1/\001
+4 0 0 100 0 14 12 0.0000 4 180 735 2550 4200 jail_2/\001
+4 0 0 100 0 14 12 0.0000 4 180 735 2550 4500 jail_3/\001
+4 0 0 100 0 14 12 0.0000 4 165 420 2550 2400 dev/\001
+4 0 0 100 0 14 12 0.0000 4 150 420 2550 2700 etc/\001
+4 0 0 100 0 14 12 0.0000 4 150 420 2550 3000 usr/\001
+4 0 0 100 0 14 12 0.0000 4 150 420 2550 3300 var/\001
+4 0 0 100 0 14 12 0.0000 4 165 525 2550 3600 home/\001
+-6
+2 1 0 1 0 7 100 0 -1 0.000 0 0 -1 0 0 4
+ 3375 3825 3900 3825 4950 1800 5100 1800
+2 1 0 1 0 7 100 0 -1 0.000 0 0 -1 0 0 4
+ 3375 4125 3900 4125 4950 3900 5100 3900
+2 1 0 1 0 7 100 0 -1 0.000 0 0 -1 0 0 4
+ 5400 900 5100 1200 5100 2400 5400 2700
+2 1 0 1 0 7 100 0 -1 0.000 0 0 -1 0 0 4
+ 5400 3000 5100 3300 5100 4500 5400 4800
+2 3 0 1 0 7 100 0 -1 0.000 0 0 -1 0 0 7
+ 4650 825 4650 2775 6675 2775 6675 3375 7950 3375 7950 825
+ 4650 825
+2 3 0 1 0 7 100 0 -1 0.000 0 0 -1 0 0 9
+ 4650 2775 4650 4950 6300 4950 6300 3675 7950 3675 7950 3375
+ 6675 3375 6675 2775 4650 2775
+4 0 0 100 0 14 12 0.0000 4 165 420 5400 1200 dev/\001
+4 0 0 100 0 14 12 0.0000 4 150 420 5400 1500 etc/\001
+4 0 0 100 0 14 12 0.0000 4 150 420 5400 1800 usr/\001
+4 0 0 100 0 14 12 0.0000 4 150 420 5400 2100 var/\001
+4 0 0 100 0 14 12 0.0000 4 165 525 5400 2400 home/\001
+4 0 0 100 0 14 12 0.0000 4 165 420 5400 3300 dev/\001
+4 0 0 100 0 14 12 0.0000 4 150 420 5400 3600 etc/\001
+4 0 0 100 0 14 12 0.0000 4 150 420 5400 3900 usr/\001
+4 0 0 100 0 14 12 0.0000 4 150 420 5400 4200 var/\001
+4 0 0 100 0 14 12 0.0000 4 165 525 5400 4500 home/\001
+4 2 0 100 0 15 12 0.0000 4 135 840 7725 3300 10.0.0.1\001
+4 2 0 100 0 15 12 0.0000 4 135 840 7725 4500 10.0.0.5\001
+4 2 0 100 0 15 12 0.0000 4 135 840 7725 4200 10.0.0.4\001
+4 2 0 100 0 15 12 0.0000 4 135 840 7725 3900 10.0.0.3\001
+4 2 0 100 0 15 12 0.0000 4 135 840 7725 3600 10.0.0.2\001
diff --git a/share/doc/papers/jail/mgt.ms b/share/doc/papers/jail/mgt.ms
new file mode 100644
index 000000000000..f3ab716dfbf0
--- /dev/null
+++ b/share/doc/papers/jail/mgt.ms
@@ -0,0 +1,216 @@
+.\"
+.\" $FreeBSD$
+.\"
+.NH
+Managing Jails and the Jail File System Environment
+.NH 2
+Creating a Jail Environment
+.PP
+While the jail(2) call could be used in a number of ways, the expected
+configuration creates a complete FreeBSD installation for each jail.
+This includes copies of all relevant system binaries, data files, and its
+own \fC/etc\fP directory.
+Such a configuration maximises the independence of various jails,
+and reduces the chances of interference between jails being possible,
+especially when it is desirable to provide root access within a jail to
+a less trusted user.
+.PP
+On a box making use of the jail facility, we refer to two types of
+environment: the host environment, and the jail environment.
+The host environment is the real operating system environment, which is
+used to configure interfaces, and start up the jails.
+There are then one or more jail environments, effectively virtual
+FreeBSD machines.
+When configuring Jail for use, it is necessary to configure both the
+host and jail environments to prevent overlap.
+.PP
+As jailed virtual machines are generally bound to an IP address configured
+using the normal IP alias mechanism, those jail IP addresses are also
+accessible to host environment applications to use.
+If the accessibility of some host applications in the jail environment is
+not desirable, it is necessary to configure those applications to only
+listen on appropriate addresses.
+.PP
+In most of the production environments where jail is currently in use,
+one IP address is allocated to the host environment, and then a number
+are allocated to jail boxes, with each jail box receiving a unique IP.
+In this situation, it is sufficient to configure the networking applications
+on the host to listen only on the host IP.
+Generally, this consists of specifying the appropriate IP address to be
+used by inetd and SSH, and disabling applications that are not capable
+of limiting their address scope, such as sendmail, the port mapper, and
+syslogd.
+Other third party applications that have been installed on the host must also be
+configured in this manner, or users connecting to the jailbox will
+discover the host environment service, unless the jailbox has
+specifically bound a service to that port.
+In some situations, this can actually be the desirable behaviour.
+.PP
+The jail environments must also be custom-configured.
+This consists of building and installing a miniature version of the
+FreeBSD file system tree off of a subdirectory in the host environment,
+usually \fC/usr/jail\fP, or \fC/data/jail\fP, with a subdirectory per jail.
+Appropriate instructions for generating this tree are included in the
+jail(8) man page, but generally this process may be automated using the
+FreeBSD build environment.
+.PP
+One notable difference from the default FreeBSD install is that only
+a limited set of device nodes should be created.
+.PP
+To improve storage efficiency, a fair number of the binaries in the system tree
+may be deleted, as they are not relevant in a jail environment.
+This includes the kernel, boot loader, and related files, as well as
+hardware and network configuration tools.
+.PP
+After the creation of the jail tree, the easiest way to configure it is
+to start up the jail in single-user mode.
+The sysinstall admin tool may be used to help with the task, although
+it is not installed by default as part of the system tree.
+These tools should be run in the jail environment, or they will affect
+the host environment's configuration.
+.DS
+.ft C
+.ps -2
+# mkdir /data/jail/192.168.11.100/stand
+# cp /stand/sysinstall /data/jail/192.168.11.100/stand
+# jail /data/jail/192.168.11.100 testhostname 192.168.11.100 \e
+ /bin/sh
+.ps +2
+.R
+.DE
+.PP
+After running the jail command, the shell is now within the jail environment,
+and all further commands
+will be limited to the scope of the jail until the shell exits.
+If the network alias has not yet been configured, then the jail will be
+unable to access the network.
+.PP
+The startup configuration of the jail environment may be configured so
+as to quell warnings from services that cannot run in the jail.
+Also, any per-system configuration required for a normal FreeBSD system
+is also required for each jailbox.
+Typically, this includes:
+.IP "" 5n
+\(bu Create empty /etc/fstab
+.IP
+\(bu Disable portmapper
+.IP
+\(bu Run newaliases
+.IP
+\(bu Disabling interface configuration
+.IP
+\(bu Configure the resolver
+.IP
+\(bu Set root password
+.IP
+\(bu Set timezone
+.IP
+\(bu Add any local accounts
+.IP
+\(bu Install any packets
+.NH 2
+Starting Jails
+.PP
+Jails are typically started by executing their /etc/rc script in much
+the same manner a shell was started in the previous section.
+Before starting the jail, any relevant networking configuration
+should also be performed.
+Typically, this involves adding an additional IP address to the
+appropriate network interface, setting network properties for the
+IP address using IP filtering, forwarding, and bandwidth shaping,
+and mounting a process file system for the jail, if the ability to
+debug processes from within the jail is desired.
+.DS
+.ft C
+.ps -2
+# ifconfig ed0 inet add 192.168.11.100 netmask 255.255.255.255
+# mount -t procfs proc /data/jail/192.168.11.100/proc
+# jail /data/jail/192.168.11.100 testhostname 192.168.11.100 \e
+ /bin/sh /etc/rc
+.ps +2
+.ft P
+.DE
+.PP
+A few warnings are generated for sysctl's that are not permitted
+to be set within the jail, but the end result is a set of processes
+in an isolated process environment, bound to a single IP address.
+Normal procedures for accessing a FreeBSD machine apply: telneting in
+through the network reveals a telnet prompt, login, and shell.
+.DS
+.ft C
+.ps -2
+% ps ax
+ PID TT STAT TIME COMMAND
+ 228 ?? SsJ 0:18.73 syslogd
+ 247 ?? IsJ 0:00.05 inetd -wW
+ 249 ?? IsJ 0:28.43 cron
+ 252 ?? SsJ 0:30.46 sendmail: accepting connections on port 25
+ 291 ?? IsJ 0:38.53 /usr/local/sbin/sshd
+93694 ?? SJ 0:01.01 sshd: rwatson@ttyp0 (sshd)
+93695 p0 SsJ 0:00.06 -csh (csh)
+93700 p0 R+J 0:00.00 ps ax
+.ps +2
+.ft P
+.DE
+.PP
+It is immediately obvious that the environment is within a jailbox: there
+is no init process, no kernel daemons, and a J flag is present beside all
+processes indicating the presence of a jail.
+.PP
+As with any FreeBSD system, accounts may be created and deleted,
+mail is delivered, logs are generated, packages may be added, and the
+system may be hacked into if configured incorrectly, or running a buggy
+version of a piece of software.
+However, all of this happens strictly within the scope of the jail.
+.NH 2
+Jail Management
+.PP
+Jail management is an interesting prospect, as there are two perspectives
+from which a jail environment may be administered: from within the jail,
+and from the host environment.
+From within the jail, as described above, the process is remarkably similar
+to any regular FreeBSD install, although certain actions are prohibited,
+such as mounting file systems, modifying system kernel properties, etc.
+The only area that really differs are that of shutting
+the system down: the processes within the jail may deliver signals
+between them, allowing all processes to be killed, but bringing the
+system back up requires intervention from outside of the jailbox.
+.PP
+From outside of the jail, there are a range of capabilities, as well
+as limitations.
+The jail environment is, in effect, a subset of the host environment:
+the jail file system appears as part of the host file system, and may
+be directly modified by processes in the host environment.
+Processes within the jail appear in the process listing of the host,
+and may likewise be signalled or debugged.
+The host process file system makes the hostname of the jail environment
+accessible in /proc/procnum/status, allowing utilities in the host
+environment to manage processes based on jailname.
+However, the default configuration allows privileged processes within
+jails to set the hostname of the jail, which makes the status file less
+useful from a management perspective if the contents of the jail are
+malicious.
+To prevent a jail from changing its hostname, the
+"security.jail.set_hostname_allowed" sysctl may be set to 0 prior to
+starting any jails.
+.PP
+One aspect immediately observable in an environment with multiple jails
+is that uids and gids are local to each jail environment: the uid associated
+with a process in one jail may be for a different user than in another
+jail.
+This collision of identifiers is only visible in the host environment,
+as normally processes from one jail are never visible in an environment
+with another scope for user/uid and group/gid mapping.
+Managers in the host environment should understand these scoping issues,
+or confusion and unintended consequences may result.
+.PP
+Jailed processes are subject to the normal restrictions present for
+any processes, including resource limits, and limits placed by the network
+code, including firewall rules.
+By specifying firewall rules for the IP address bound to a jail, it is
+possible to place connectivity and bandwidth limitations on individual
+jails, restricting services that may be consumed or offered.
+.PP
+Management of jails is an area that will see further improvement in
+future versions of FreeBSD. Some of these potential improvements are
+discussed later in this paper.
diff --git a/share/doc/papers/jail/paper.ms b/share/doc/papers/jail/paper.ms
new file mode 100644
index 000000000000..60be9f2748bd
--- /dev/null
+++ b/share/doc/papers/jail/paper.ms
@@ -0,0 +1,438 @@
+.\"
+.\" $FreeBSD$
+.\"
+.if n .ftr C R
+.ig TL
+.ds CH "
+.nr PI 2n
+.nr PS 12
+.nr LL 15c
+.nr PO 3c
+.nr FM 3.5c
+.po 3c
+.TL
+Jails: Confining the omnipotent root.
+.FS
+This paper was presented at the 2nd International System Administration and Networking Conference "SANE 2000" May 22-25, 2000 in Maastricht, The Netherlands and is published in the proceedings.
+.FE
+.AU
+Poul-Henning Kamp <phk@FreeBSD.org>
+.AU
+Robert N. M. Watson <rwatson@FreeBSD.org>
+.AI
+The FreeBSD Project
+.FS
+This work was sponsored by \fChttp://www.servetheweb.com/\fP and
+donated to the FreeBSD Project for inclusion in the FreeBSD
+OS. FreeBSD 4.0-RELEASE was the first release including this
+code.
+Follow-on work was sponsored by Safeport Network Services,
+\fChttp://www.safeport.com/\fP
+.FE
+.AB
+The traditional UNIX security model is simple but inexpressive.
+Adding fine-grained access control improves the expressiveness,
+but often dramatically increases both the cost of system management
+and implementation complexity.
+In environments with a more complex management model, with delegation
+of some management functions to parties under varying degrees of trust,
+the base UNIX model and most natural
+extensions are inappropriate at best.
+Where multiple mutually un-trusting parties are introduced,
+``inappropriate'' rapidly transitions to ``nightmarish'', especially
+with regards to data integrity and privacy protection.
+.PP
+The FreeBSD ``Jail'' facility provides the ability to partition
+the operating system environment, while maintaining the simplicity
+of the UNIX ``root'' model.
+In Jail, users with privilege find that the scope of their requests
+is limited to the jail, allowing system administrators to delegate
+management capabilities for each virtual machine
+environment.
+Creating virtual machines in this manner has many potential uses; the
+most popular thus far has been for providing virtual machine services
+in Internet Service Provider environments.
+.AE
+.NH
+Introduction
+.PP
+The UNIX access control mechanism is designed for an environment with two
+types of users: those with, and without administrative privilege.
+Within this framework, every attempt is made to provide an open
+system, allowing easy sharing of files and inter-process communication.
+As a member of the UNIX family, FreeBSD inherits these
+security properties.
+Users of FreeBSD in non-traditional UNIX environments must balance
+their need for strong application support, high network performance
+and functionality, and low total cost of ownership with the need
+for alternative security models that are difficult or impossible to
+implement with the UNIX security mechanisms.
+.PP
+One such consideration is the desire to delegate some (but not all)
+administrative functions to untrusted or less trusted parties, and
+simultaneously impose system-wide mandatory policies on process
+interaction and sharing.
+Attempting to create such an environment in the current-day FreeBSD
+security environment is both difficult and costly: in many cases,
+the burden of implementing these policies falls on user
+applications, which means an increase in the size and complexity
+of the code base, in turn translating to higher development
+and maintenance cost, as well as less overall flexibility.
+.PP
+This abstract risk becomes more clear when applied to a practical,
+real-world example:
+many web service providers turn to the FreeBSD
+operating system to host customer web sites, as it provides a
+high-performance, network-centric server environment.
+However, these providers have a number of concerns on their plate, both in
+terms of protecting the integrity and confidentiality of their own
+files and services from their customers, as well as protecting the files
+and services of one customer from (accidental or
+intentional) access by any other customer.
+At the same time, a provider would like to provide
+substantial autonomy to customers, allowing them to install and
+maintain their own software, and to manage their own services,
+such as web servers and other content-related daemon programs.
+.PP
+This problem space points strongly in the direction of a partitioning
+solution, in which customer processes and storage are isolated from those of
+other customers, both in terms of accidental disclosure of data or process
+information, but also in terms of the ability to modify files or processes
+outside of a compartment.
+Delegation of management functions within the system must
+be possible, but not at the cost of system-wide requirements, including
+integrity and privacy protection between partitions.
+.PP
+However, UNIX-style access control makes it notoriously difficult to
+compartmentalise functionality.
+While mechanisms such as chroot(2) provide a modest
+level compartmentalisation, it is well known
+that these mechanisms have serious shortcomings, both in terms of the
+scope of their functionality, and effectiveness at what they provide \s-2[CHROOT]\s+2.
+.PP
+In the case of the chroot(2) call, a process's visibility of
+the file system name-space is limited to a single subtree.
+However, the compartmentalisation does not extend to the process
+or networking spaces and therefore both observation of and interference
+with processes outside their compartment is possible.
+.PP
+To this end, we describe the new FreeBSD ``Jail'' facility, which
+provides a strong partitioning solution, leveraging existing
+mechanisms, such as chroot(2), to what effectively amounts to a
+virtual machine environment. Processes in a jail are provided
+full access to the files that they may manipulate, processes they
+may influence, and network services they can make use of, and neither
+access nor visibility of files, processes or network services outside
+their partition.
+.PP
+Unlike other fine-grained security solutions, Jail does not
+substantially increase the policy management requirements for the
+system administrator, as each Jail is a virtual FreeBSD environment
+permitting local policy to be independently managed, with much the
+same properties as the main system itself, making Jail easy to use
+for the administrator, and far more compatible with applications.
+.NH
+Traditional UNIX Security, or, ``God, root, what difference?" \s-2[UF]\s+2.
+.PP
+The traditional UNIX access model assigns numeric uids to each user of the
+system. In turn, each process ``owned'' by a user will be tagged with that
+user's uid in an unforgeable manner. The uids serve two purposes: first,
+they determine how discretionary access control mechanisms will be applied, and
+second, they are used to determine whether special privileges are accorded.
+.PP
+In the case of discretionary access controls, the primary object protected is
+a file. The uid (and related gids indicating group membership) are mapped to
+a set of rights for each object, courtesy the UNIX file mode, in effect acting
+as a limited form of access control list. Jail is, in general, not concerned
+with modifying the semantics of discretionary access control mechanisms,
+although there are important implications from a management perspective.
+.PP
+For the purposes of determining whether special privileges are accorded to a
+process, the check is simple: ``is the numeric uid equal to 0 ?''.
+If so, the
+process is acting with ``super-user privileges'', and all access checks are
+granted, in effect allowing the process the ability to do whatever it wants
+to \**.
+.FS
+\&... no matter how patently stupid it may be.
+.FE
+.PP
+For the purposes of human convenience, uid 0 is canonically allocated
+to the ``root'' user \s-2[ROOT]\s+2.
+For the purposes of jail, this behaviour is extremely relevant: many of
+these privileged operations can be used to manage system hardware and
+configuration, file system name-space, and special network operations.
+.PP
+Many limitations to this model are immediately clear: the root user is a
+single, concentrated source of privilege that is exposed to many pieces of
+software, and as such an immediate target for attacks. In the event of a
+compromise of the root capability set, the attacker has complete control over
+the system. Even without an attacker, the risks of a single administrative
+account are serious: delegating a narrow scope of capability to an
+inexperienced administrator is difficult, as the granularity of delegation is
+that of all system management abilities. These features make the omnipotent
+root account a sharp, efficient and extremely dangerous tool.
+.PP
+The BSD family of operating systems have implemented the ``securelevel''
+mechanism which allows the administrator to block certain configuration
+and management functions from being performed by root,
+until the system is restarted and brought up into single-user mode.
+While this does provide some amount of protection in the case of a root
+compromise of the machine, it does nothing to address the need for
+delegation of certain root abilities.
+.NH
+Other Solutions to the Root Problem
+.PP
+Many operating systems attempt to address these limitations by providing
+fine-grained access controls for system resources \s-2[BIBA]\s+2.
+These efforts vary in
+degrees of success, but almost all suffer from at least three serious
+limitations:
+.PP
+First, increasing the granularity of security controls increases the
+complexity of the administration process, in turn increasing both the
+opportunity for incorrect configuration, as well as the demand on
+administrator time and resources. In many cases, the increased complexity
+results in significant frustration for the administrator, which may result
+in two
+disastrous types of policy: ``all doors open as it's too much trouble'', and
+``trust that the system is secure, when in fact it isn't''.
+.PP
+The extent of the trouble is best illustrated by the fact that an entire
+niche industry has emerged providing tools to manage fine grained security
+controls \s-2[UAS]\s+2.
+.PP
+Second, usefully segregating capabilities and assigning them to running code
+and users is very difficult. Many privileged operations in UNIX seem
+independent, but are in fact closely related, and the handing out of one
+privilege may, in effect, be transitive to the many others. For example, in
+some trusted operating systems, a system capability may be assigned to a
+running process to allow it to read any file, for the purposes of backup.
+However, this capability is, in effect, equivalent to the ability to switch to
+any other account, as the ability to access any file provides access to system
+keying material, which in turn provides the ability to authenticate as any
+user. Similarly, many operating systems attempt to segregate management
+capabilities from auditing capabilities. In a number of these operating
+systems, however, ``management capabilities'' permit the administrator to
+assign ``auditing capabilities'' to itself, or another account, circumventing
+the segregation of capability.
+.PP
+Finally, introducing new security features often involves introducing new
+security management APIs. When fine-grained capabilities are introduced to
+replace the setuid mechanism in UNIX-like operating systems, applications that
+previously did an ``appropriateness check'' to see if they were running as
+root before executing must now be changed to know that they need not run as
+root. In the case of applications running with privilege and executing other
+programs, there is now a new set of privileges that must be voluntarily given
+up before executing another program. These change can introduce significant
+incompatibility for existing applications, and make life more difficult for
+application developers who may not be aware of differing security semantics on
+different systems \s-2[POSIX1e]\s+2.
+.NH
+The Jail Partitioning Solution
+.PP
+Jail neatly side-steps the majority of these problems through partitioning.
+Rather
+than introduce additional fine-grained access control mechanism, we partition
+a FreeBSD environment (processes, file system, network resources) into a
+management environment, and optionally subset Jail environments. In doing so,
+we simultaneously maintain the existing UNIX security model, allowing
+multiple users and a privileged root user in each jail, while
+limiting the scope of root's activities to his jail.
+Consequently the administrator of a
+FreeBSD machine can partition the machine into separate jails, and provide
+access to the super-user account in each of these without losing control of
+the over-all environment.
+.PP
+A process in a partition is referred to as ``in jail''. When a FreeBSD
+system is booted up after a fresh install, no processes will be in jail.
+When
+a process is placed in a jail, it, and any descendents of the process created
+after the jail creation, will be in that jail. A process may be in only one
+jail, and after creation, it can not leave the jail.
+Jails are created when a
+privileged process calls the jail(2) syscall, with a description of the jail as an
+argument to the call. Each call to jail(2) creates a new jail; the only way
+for a new process to enter the jail is by inheriting access to the jail from
+another process already in that jail.
+Processes may never
+leave the jail they created, or were created in.
+.KF
+.if t .PSPIC jail01.eps 4i
+.ce 1
+Fig. 1 \(em Schematic diagram of machine with two configured jails
+.sp
+.KE
+.PP
+Membership in a jail involves a number of restrictions: access to the file
+name-space is restricted in the style of chroot(2), the ability to bind network
+resources is limited to a specific IP address, the ability to manipulate
+system resources and perform privileged operations is sharply curtailed, and
+the ability to interact with other processes is limited to only processes
+inside the same jail.
+.PP
+Jail takes advantage of the existing chroot(2) behaviour to limit access to the
+file system name-space for jailed processes. When a jail is created, it is
+bound to a particular file system root.
+Processes are unable to manipulate files that they cannot address,
+and as such the integrity and confidentiality of files outside of the jail
+file system root are protected. Traditional mechanisms for breaking out of
+chroot(2) have been blocked.
+In the expected and documented configuration, each jail is provided
+with its exclusive file system root, and standard FreeBSD directory layout,
+but this is not mandated by the implementation.
+.PP
+Each jail is bound to a single IP address: processes within the jail may not
+make use of any other IP address for outgoing or incoming connections; this
+includes the ability to restrict what network services a particular jail may
+offer. As FreeBSD distinguishes attempts to bind all IP addresses from
+attempts to bind a particular address, bind requests for all IP addresses are
+redirected to the individual Jail address. Some network functionality
+associated with privileged calls are wholesale disabled due to the nature of the
+functionality offered, in particular facilities which would allow ``spoofing''
+of IP numbers or disruptive traffic to be generated have been disabled.
+.PP
+Processes running without root privileges will notice few, if any differences
+between a jailed environment or un-jailed environment. Processes running with
+root privileges will find that many restrictions apply to the privileged calls
+they may make. Some calls will now return an access error \(em for example, an
+attempt to create a device node will now fail. Others will have a more
+limited scope than normal \(em attempts to bind a reserved port number on all
+available addresses will result in binding only the address associated with
+the jail. Other calls will succeed as normal: root may read a file owned by
+any uid, as long as it is accessible through the jail file system name-space.
+.PP
+Processes within the jail will find that they are unable to interact or
+even verify the existence of
+processes outside the jail \(em processes within the jail are
+prevented from delivering signals to processes outside the jail, as well as
+connecting to those processes with debuggers, or even see them in the
+sysctl or process file system monitoring mechanisms. Jail does not prevent,
+nor is it intended to prevent, the use of covert channels or communications
+mechanisms via accepted interfaces \(em for example, two processes may communicate
+via sockets over the IP network interface. Nor does it attempt to provide
+scheduling services based on the partition; however, it does prevent calls
+that interfere with normal process operation.
+.PP
+As a result of these attempts to retain the standard FreeBSD API and
+framework, almost all applications will run unaffected. Standard system
+services such as Telnet, FTP, and SSH all behave normally, as do most third
+party applications, including the popular Apache web server.
+.NH
+Jail Implementation
+.PP
+Processes running with root privileges in the jail find that there are serious
+restrictions on what it is capable of doing \(em in particular, activities that
+would extend outside of the jail:
+.IP "" 5n
+\(bu Modifying the running kernel by direct access and loading kernel
+modules is prohibited.
+.IP
+\(bu Modifying any of the network configuration, interfaces, addresses, and
+routing table is prohibited.
+.IP
+\(bu Mounting and unmounting file systems is prohibited.
+.IP
+\(bu Creating device nodes is prohibited.
+.IP
+\(bu Accessing raw, divert, or routing sockets is prohibited.
+.IP
+\(bu Modifying kernel runtime parameters, such as most sysctl settings, is
+prohibited.
+.IP
+\(bu Changing securelevel-related file flags is prohibited.
+.IP
+\(bu Accessing network resources not associated with the jail is prohibited.
+.PP
+Other privileged activities are permitted as long as they are limited to the
+scope of the jail:
+.IP "" 5n
+\(bu Signalling any process within the jail is permitted.
+.IP
+\(bu Changing the ownership and mode of any file within the jail is permitted, as
+long as the file flags permit this.
+.IP
+\(bu Deleting any file within the jail is permitted, as long as the file flags
+permit this.
+.IP
+\(bu Binding reserved TCP and UDP port numbers on the jails IP address is
+permitted. (Attempts to bind TCP and UDP ports using INADDR_ANY will be
+redirected to the jails IP address.)
+.IP
+\(bu Functions which operate on the uid/gid space are all permitted since they
+act as labels for filesystem objects of proceses
+which are partitioned off by other mechanisms.
+.PP
+These restrictions on root access limit the scope of root processes, enabling
+most applications to run un-hindered, but preventing calls that might allow an
+application to reach beyond the jail and influence other processes or
+system-wide configuration.
+.PP
+.so implementation.ms
+.so mgt.ms
+.so future.ms
+.NH
+Conclusion
+.PP
+The jail facility provides FreeBSD with a conceptually simple security
+partitioning mechanism, allowing the delegation of administrative rights
+within virtual machine partitions.
+.PP
+The implementation relies on
+restricting access within the jail environment to a well-defined subset
+of the overall host environment. This includes limiting interaction
+between processes, and to files, network resources, and privileged
+operations. Administrative overhead is reduced through avoiding
+fine-grained access control mechanisms, and maintaining a consistent
+administrative interface across partitions and the host environment.
+.PP
+The jail facility has already seen widespread deployment in particular as
+a vehicle for delivering "virtual private server" services.
+.PP
+The jail code is included in the base system as part of FreeBSD 4.0-RELEASE,
+and fully documented in the jail(2) and jail(8) man-pages.
+.bp
+.SH
+Notes & References
+.IP \s-2[BIBA]\s+2 .5i
+K. J. Biba, Integrity Considerations for Secure
+Computer Systems, USAF Electronic Systems Division, 1977
+.IP \s-2[CHROOT]\s+2 .5i
+Dr. Marshall Kirk Mckusick, private communication:
+``According to the SCCS logs, the chroot call was added by Bill Joy
+on March 18, 1982 approximately 1.5 years before 4.2BSD was released.
+That was well before we had ftp servers of any sort (ftp did not
+show up in the source tree until January 1983). My best guess as
+to its purpose was to allow Bill to chroot into the /4.2BSD build
+directory and build a system using only the files, include files,
+etc contained in that tree. That was the only use of chroot that
+I remember from the early days.''
+.IP \s-2[LOTTERY1]\s+2 .5i
+David Petrou and John Milford. Proportional-Share Scheduling:
+Implementation and Evaluation in a Widely-Deployed Operating System,
+December 1997.
+.nf
+\s-2\fChttp://www.cs.cmu.edu/~dpetrou/papers/freebsd_lottery_writeup98.ps\fP\s+2
+\s-2\fChttp://www.cs.cmu.edu/~dpetrou/code/freebsd_lottery_code.tar.gz\fP\s+2
+.IP \s-2[LOTTERY2]\s+2 .5i
+Carl A. Waldspurger and William E. Weihl. Lottery Scheduling: Flexible Proportional-Share Resource Management, Proceedings of the First Symposium on Operating Systems Design and Implementation (OSDI '94), pages 1-11, Monterey, California, November 1994.
+.nf
+\s-2\fChttp://www.research.digital.com/SRC/personal/caw/papers.html\fP\s+2
+.IP \s-2[POSIX1e]\s+2 .5i
+Draft Standard for Information Technology \(em
+Portable Operating System Interface (POSIX) \(em
+Part 1: System Application Program Interface (API) \(em Amendment:
+Protection, Audit and Control Interfaces [C Language]
+IEEE Std 1003.1e Draft 17 Editor Casey Schaufler
+.IP \s-2[ROOT]\s+2 .5i
+Historically other names have been used at times, Zilog for instance
+called the super-user account ``zeus''.
+.IP \s-2[UAS]\s+2 .5i
+One such niche product is the ``UAS'' system to maintain and audit
+RACF configurations on MVS systems.
+.nf
+\s-2\fChttp://www.entactinfo.com/products/uas/\fP\s+2
+.IP \s-2[UF]\s+2 .5i
+Quote from the User-Friendly cartoon by Illiad.
+.nf
+\s-2\fChttp://www.userfriendly.org/cartoons/archives/98nov/19981111.html\fP\s+2
diff --git a/share/doc/papers/kernmalloc/Makefile b/share/doc/papers/kernmalloc/Makefile
new file mode 100644
index 000000000000..02908918a474
--- /dev/null
+++ b/share/doc/papers/kernmalloc/Makefile
@@ -0,0 +1,14 @@
+# From: @(#)Makefile 1.8 (Berkeley) 6/8/93
+# $FreeBSD$
+
+VOLUME= papers
+DOC= kernmalloc
+SRCS= kernmalloc.t appendix.ms
+EXTRA= alloc.fig usage.tbl
+MACROS= -ms
+USE_EQN=
+USE_PIC=
+USE_SOELIM=
+USE_TBL=
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/kernmalloc/alloc.fig b/share/doc/papers/kernmalloc/alloc.fig
new file mode 100644
index 000000000000..1ef260b9ac7c
--- /dev/null
+++ b/share/doc/papers/kernmalloc/alloc.fig
@@ -0,0 +1,115 @@
+.\" Copyright (c) 1988 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)alloc.fig 5.1 (Berkeley) 4/16/91
+.\"
+.PS
+scale=100
+define m0 |
+[ box invis ht 16 wid 32 with .sw at 0,0
+line from 4,12 to 4,4
+line from 8,12 to 8,4
+line from 12,12 to 12,4
+line from 16,12 to 16,4
+line from 20,12 to 20,4
+line from 24,12 to 24,4
+line from 28,12 to 28,4
+line from 0,16 to 0,0
+line from 0,8 to 32,8
+] |
+
+define m1 |
+[ box invis ht 16 wid 32 with .sw at 0,0
+line from 8,12 to 8,4
+line from 16,12 to 16,4
+line from 24,12 to 24,4
+line from 0,8 to 32,8
+line from 0,16 to 0,0
+] |
+
+define m2 |
+[ box invis ht 16 wid 32 with .sw at 0,0
+line from 0,8 to 32,8
+line from 0,16 to 0,0
+] |
+
+define m3 |
+[ box invis ht 16 wid 31 with .sw at 0,0
+line from 15,12 to 15,4
+line from 0,8 to 31,8
+line from 0,16 to 0,0
+] |
+
+box invis ht 212 wid 580 with .sw at 0,0
+"\f1\s10\&kernel memory pages\f1\s0" at 168,204
+"\f1\s10\&Legend:\f1\s0" at 36,144
+"\f1\s10\&cont \- continuation of previous page\f1\s0" at 28,112 ljust
+"\f1\s10\&free \- unused page\f1\s0" at 28,128 ljust
+"\f1\s10\&Usage:\f1\s0" at 34,87
+"\f1\s10\&memsize(addr)\f1\s0" at 36,71 ljust
+"\f1\s10\&char *addr;\f1\s0" at 66,56 ljust
+"\f1\s10\&{\f1\s0" at 36,43 ljust
+"\f1\s10\&return(kmemsizes[(addr \- kmembase) \- \s-1PAGESIZE\s+1]);\f1" at 66,29 ljust
+"\f1\s10\&}\f1\s0" at 36,8 ljust
+line from 548,192 to 548,176
+line from 548,184 to 580,184 dotted
+"\f1\s10\&1024,\f1\s0" at 116,168
+"\f1\s10\&256,\f1\s0" at 148,168
+"\f1\s10\&512,\f1\s0" at 180,168
+"\f1\s10\&3072,\f1\s0" at 212,168
+"\f1\s10\&cont,\f1\s0" at 276,168
+"\f1\s10\&cont,\f1\s0" at 244,168
+"\f1\s10\&128,\f1\s0" at 308,168
+"\f1\s10\&128,\f1\s0" at 340,168
+"\f1\s10\&free,\f1\s0" at 372,168
+"\f1\s10\&cont,\f1\s0" at 404,168
+"\f1\s10\&128,\f1\s0" at 436,168
+"\f1\s10\&1024,\f1\s0" at 468,168
+"\f1\s10\&free,\f1\s0" at 500,168
+"\f1\s10\&cont,\f1\s0" at 532,168
+"\f1\s10\&cont,\f1\s0" at 564,168
+m2 with .nw at 100,192
+m1 with .nw at 132,192
+m3 with .nw at 164,192
+m2 with .nw at 196,192
+m2 with .nw at 228,192
+m2 with .nw at 260,192
+m0 with .nw at 292,192
+m0 with .nw at 324,192
+m2 with .nw at 356,192
+m2 with .nw at 388,192
+m0 with .nw at 420,192
+m2 with .nw at 452,192
+m2 with .nw at 484,192
+m2 with .nw at 516,192
+"\f1\s10\&kmemsizes[] = {\f1\s0" at 100,168 rjust
+"\f1\s10\&char *kmembase\f1\s0" at 97,184 rjust
+.PE
diff --git a/share/doc/papers/kernmalloc/appendix.ms b/share/doc/papers/kernmalloc/appendix.ms
new file mode 100644
index 000000000000..058912700b9f
--- /dev/null
+++ b/share/doc/papers/kernmalloc/appendix.ms
@@ -0,0 +1,275 @@
+.\" $FreeBSD$
+.am vS
+..
+.am vE
+..
+'ss 23
+'ds _ \d\(mi\u
+'ps 9z
+'vs 10p
+'ds - \(mi
+'ds / \\h'\\w' 'u-\\w'/'u'/
+'ds /* \\h'\\w' 'u-\\w'/'u'/*
+'bd B 3
+'bd S B 3
+'nr cm 0
+'nf
+'de vH
+'ev 2
+'ft 1
+'sp .35i
+'tl '\s14\f3\\*(=F\fP\s0'\\*(=H'\f3\s14\\*(=F\fP\s0'
+'sp .25i
+'ft 1
+\f2\s12\h'\\n(.lu-\w'\\*(=f'u'\\*(=f\fP\s0\h'|0u'
+.sp .05i
+'ev
+'ds =G \\*(=F
+..
+'de vF
+'ev 2
+'sp .35i
+'ie o 'tl '\f2\\*(=M''Page % of \\*(=G\fP'
+'el 'tl '\f2Page % of \\*(=G''\\*(=M\fP'
+'bp
+'ev
+'ft 1
+'if \\n(cm=1 'ft 2
+..
+'de ()
+'pn 1
+..
+'de +C
+'nr cm 1
+'ft 2
+'ds +K
+'ds -K
+..
+'de -C
+'nr cm 0
+'ft 1
+'ds +K \f3
+'ds -K \fP
+..
+'+C
+'-C
+'am +C
+'ne 3
+..
+'de FN
+\f2\s14\h'\\n(.lu-\w'\\$1'u'\\$1\fP\s0\h'|0u'\c
+.if r x .if \\nx .if d =F .tm \\$1 \\*(=F \\n%
+'ds =f \&...\\$1
+..
+'de FC
+.if r x .if \\nx .if d =F .tm \\$1 \\*(=F \\n%
+'ds =f \&...\\$1
+..
+'de -F
+'rm =f
+..
+'ft 1
+'lg 0
+'-F
+.\" Copyright (c) 1988 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)appendix.t 5.1 (Berkeley) 4/16/91
+.\"
+.bp
+.H 1 "Appendix A - Implementation Details"
+.LP
+.nf
+.vS
+\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+
+ \fI*\fP Constants for setting the parameters of the kernel memory allocator\&.
+ \fI*\fP
+ \fI*\fP 2 \fI*\fP\fI*\fP MINBUCKET is the smallest unit of memory that will be
+ \fI*\fP allocated\&. It must be at least large enough to hold a pointer\&.
+ \fI*\fP
+ \fI*\fP Units of memory less or equal to MAXALLOCSAVE will permanently
+ \fI*\fP allocate physical memory; requests for these size pieces of memory
+ \fI*\fP are quite fast\&. Allocations greater than MAXALLOCSAVE must
+ \fI*\fP always allocate and free physical memory; requests for these size
+ \fI*\fP allocations should be done infrequently as they will be slow\&.
+ \fI*\fP Constraints: CLBYTES <= MAXALLOCSAVE <= 2 \fI*\fP\fI*\fP (MINBUCKET + 14)
+ \fI*\fP and MAXALLOCSIZE must be a power of two\&.
+ \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+\*(+K#define\*(-K MINBUCKET\h'|31n'4\h'|51n'\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+ 4 => min allocation of 16 bytes \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+'FN MAXALLOCSAVE
+\*(+K#define\*(-K MAXALLOCSAVE\h'|31n'(2 \fI*\fP CLBYTES)
+
+\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+
+ \fI*\fP Maximum amount of kernel dynamic memory\&.
+ \fI*\fP Constraints: must be a multiple of the pagesize\&.
+ \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+'FN MAXKMEM
+\*(+K#define\*(-K MAXKMEM\h'|31n'(1024 \fI*\fP PAGESIZE)
+
+\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+
+ \fI*\fP Arena for all kernel dynamic memory allocation\&.
+ \fI*\fP This arena is known to start on a page boundary\&.
+ \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+\*(+Kextern\*(-K \*(+Kchar\*(-K kmembase[MAXKMEM];
+
+\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+
+ \fI*\fP Array of descriptors that describe the contents of each page
+ \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+\*(+Kstruct\*(-K kmemsizes \*(+K{\*(-K
+\h'|11n'\*(+Kshort\*(-K\h'|21n'ks\*_indx;\h'|41n'\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+ bucket index, size of small allocations \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+\h'|11n'u\*_short\h'|21n'ks\*_pagecnt;\h'|41n'\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+ for large allocations, pages allocated \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+\*(+K}\*(-K\c\c
+'-F
+ kmemsizes[MAXKMEM \fI\h'\w' 'u-\w'/'u'/\fP PAGESIZE];
+'FC MAXALLOCSAVE
+
+\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+
+ \fI*\fP Set of buckets for each size of memory block that is retained
+ \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+\*(+Kstruct\*(-K kmembuckets \*(+K{\*(-K
+\h'|11n'caddr\*_t kb\*_next;\h'|41n'\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+ list of free blocks \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+\*(+K}\*(-K\c\c
+'-F
+ bucket[MINBUCKET + 16];
+.bp
+\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+
+ \fI*\fP Macro to convert a size to a bucket index\&. If the size is constant,
+ \fI*\fP this macro reduces to a compile time constant\&.
+ \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+'FN MINALLOCSIZE
+\*(+K#define\*(-K MINALLOCSIZE\h'|31n'(1 << MINBUCKET)
+'FN BUCKETINDX
+\*(+K#define\*(-K BUCKETINDX(size) \e
+\h'|11n'(size) <= (MINALLOCSIZE \fI*\fP 128) \e
+\h'|21n'? (size) <= (MINALLOCSIZE \fI*\fP 8) \e
+\h'|31n'? (size) <= (MINALLOCSIZE \fI*\fP 2) \e
+\h'|41n'? (size) <= (MINALLOCSIZE \fI*\fP 1) \e
+\h'|51n'? (MINBUCKET + 0) \e
+\h'|51n': (MINBUCKET + 1) \e
+\h'|41n': (size) <= (MINALLOCSIZE \fI*\fP 4) \e
+\h'|51n'? (MINBUCKET + 2) \e
+\h'|51n': (MINBUCKET + 3) \e
+\h'|31n': (size) <= (MINALLOCSIZE\fI*\fP 32) \e
+\h'|41n'? (size) <= (MINALLOCSIZE \fI*\fP 16) \e
+\h'|51n'? (MINBUCKET + 4) \e
+\h'|51n': (MINBUCKET + 5) \e
+\h'|41n': (size) <= (MINALLOCSIZE \fI*\fP 64) \e
+\h'|51n'? (MINBUCKET + 6) \e
+\h'|51n': (MINBUCKET + 7) \e
+\h'|21n': (size) <= (MINALLOCSIZE \fI*\fP 2048) \e
+\h'|31n'\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+ etc \&.\&.\&. \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+
+\fI\h'\w' 'u-\w'/'u'/\fP\fI*\fP\c\c
+'+C
+
+ \fI*\fP Macro versions for the usual cases of malloc\fI\h'\w' 'u-\w'/'u'/\fPfree
+ \fI*\fP\fI\h'\w' 'u-\w'/'u'/\fP\c
+'-C
+
+'FN MALLOC
+\*(+K#define\*(-K MALLOC(space, cast, size, flags) \*(+K{\*(-K \e
+\h'|11n'\*(+Kregister\*(-K \*(+Kstruct\*(-K kmembuckets \fI*\fPkbp = &bucket[BUCKETINDX(size)]; \e
+\h'|11n'\*(+Klong\*(-K s = splimp(); \e
+\h'|11n'\*(+Kif\*(-K (kbp\*->kb\*_next == NULL) \*(+K{\*(-K \e
+\h'|21n'(space) = (cast)malloc(size, flags); \e
+\h'|11n'\*(+K}\*(-K \*(+Kelse\*(-K \*(+K{\*(-K \e
+\h'|21n'(space) = (cast)kbp\*->kb\*_next; \e
+\h'|21n'kbp\*->kb\*_next = \fI*\fP(caddr\*_t \fI*\fP)(space); \e
+\h'|11n'\*(+K}\*(-K \e
+\h'|11n'splx(s); \e
+\*(+K}\*(-K\c\c
+'-F
+
+'FC BUCKETINDX
+
+'FN FREE
+\*(+K#define\*(-K FREE(addr) \*(+K{\*(-K \e
+\h'|11n'\*(+Kregister\*(-K \*(+Kstruct\*(-K kmembuckets \fI*\fPkbp; \e
+\h'|11n'\*(+Kregister\*(-K \*(+Kstruct\*(-K kmemsizes \fI*\fPksp = \e
+\h'|21n'&kmemsizes[((addr) \*- kmembase) \fI\h'\w' 'u-\w'/'u'/\fP PAGESIZE]; \e
+\h'|11n'\*(+Klong\*(-K s = splimp(); \e
+\h'|11n'\*(+Kif\*(-K (1 << ksp\*->ks\*_indx > MAXALLOCSAVE) \*(+K{\*(-K \e
+\h'|21n'free(addr); \e
+\h'|11n'\*(+K}\*(-K \*(+Kelse\*(-K \*(+K{\*(-K \e
+\h'|21n'kbp = &bucket[ksp\*->ks\*_indx]; \e
+\h'|21n'\fI*\fP(caddr\*_t \fI*\fP)(addr) = kbp\*->kb\*_next; \e
+\h'|21n'kbp\*->kb\*_next = (caddr\*_t)(addr); \e
+\h'|11n'\*(+K}\*(-K \e
+\h'|11n'splx(s); \e
+\*(+K}\*(-K\c\c
+'-F
+
+'FC BUCKETINDX
+.vE
diff --git a/share/doc/papers/kernmalloc/appendix.t b/share/doc/papers/kernmalloc/appendix.t
new file mode 100644
index 000000000000..bcd3e8ce7ef7
--- /dev/null
+++ b/share/doc/papers/kernmalloc/appendix.t
@@ -0,0 +1,137 @@
+.\" Copyright (c) 1988 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)appendix.t 5.1 (Berkeley) 4/16/91
+.\"
+.bp
+.H 1 "Appendix A - Implementation Details"
+.LP
+.nf
+.vS
+/*
+ * Constants for setting the parameters of the kernel memory allocator.
+ *
+ * 2 ** MINBUCKET is the smallest unit of memory that will be
+ * allocated. It must be at least large enough to hold a pointer.
+ *
+ * Units of memory less or equal to MAXALLOCSAVE will permanently
+ * allocate physical memory; requests for these size pieces of memory
+ * are quite fast. Allocations greater than MAXALLOCSAVE must
+ * always allocate and free physical memory; requests for these size
+ * allocations should be done infrequently as they will be slow.
+ * Constraints: CLBYTES <= MAXALLOCSAVE <= 2 ** (MINBUCKET + 14)
+ * and MAXALLOCSIZE must be a power of two.
+ */
+#define MINBUCKET 4 /* 4 => min allocation of 16 bytes */
+#define MAXALLOCSAVE (2 * CLBYTES)
+
+/*
+ * Maximum amount of kernel dynamic memory.
+ * Constraints: must be a multiple of the pagesize.
+ */
+#define MAXKMEM (1024 * PAGESIZE)
+
+/*
+ * Arena for all kernel dynamic memory allocation.
+ * This arena is known to start on a page boundary.
+ */
+extern char kmembase[MAXKMEM];
+
+/*
+ * Array of descriptors that describe the contents of each page
+ */
+struct kmemsizes {
+ short ks_indx; /* bucket index, size of small allocations */
+ u_short ks_pagecnt; /* for large allocations, pages allocated */
+} kmemsizes[MAXKMEM / PAGESIZE];
+
+/*
+ * Set of buckets for each size of memory block that is retained
+ */
+struct kmembuckets {
+ caddr_t kb_next; /* list of free blocks */
+} bucket[MINBUCKET + 16];
+.bp
+/*
+ * Macro to convert a size to a bucket index. If the size is constant,
+ * this macro reduces to a compile time constant.
+ */
+#define MINALLOCSIZE (1 << MINBUCKET)
+#define BUCKETINDX(size) \
+ (size) <= (MINALLOCSIZE * 128) \
+ ? (size) <= (MINALLOCSIZE * 8) \
+ ? (size) <= (MINALLOCSIZE * 2) \
+ ? (size) <= (MINALLOCSIZE * 1) \
+ ? (MINBUCKET + 0) \
+ : (MINBUCKET + 1) \
+ : (size) <= (MINALLOCSIZE * 4) \
+ ? (MINBUCKET + 2) \
+ : (MINBUCKET + 3) \
+ : (size) <= (MINALLOCSIZE* 32) \
+ ? (size) <= (MINALLOCSIZE * 16) \
+ ? (MINBUCKET + 4) \
+ : (MINBUCKET + 5) \
+ : (size) <= (MINALLOCSIZE * 64) \
+ ? (MINBUCKET + 6) \
+ : (MINBUCKET + 7) \
+ : (size) <= (MINALLOCSIZE * 2048) \
+ /* etc ... */
+
+/*
+ * Macro versions for the usual cases of malloc/free
+ */
+#define MALLOC(space, cast, size, flags) { \
+ register struct kmembuckets *kbp = &bucket[BUCKETINDX(size)]; \
+ long s = splimp(); \
+ if (kbp->kb_next == NULL) { \
+ (space) = (cast)malloc(size, flags); \
+ } else { \
+ (space) = (cast)kbp->kb_next; \
+ kbp->kb_next = *(caddr_t *)(space); \
+ } \
+ splx(s); \
+}
+
+#define FREE(addr) { \
+ register struct kmembuckets *kbp; \
+ register struct kmemsizes *ksp = \
+ &kmemsizes[((addr) - kmembase) / PAGESIZE]; \
+ long s = splimp(); \
+ if (1 << ksp->ks_indx > MAXALLOCSAVE) { \
+ free(addr); \
+ } else { \
+ kbp = &bucket[ksp->ks_indx]; \
+ *(caddr_t *)(addr) = kbp->kb_next; \
+ kbp->kb_next = (caddr_t)(addr); \
+ } \
+ splx(s); \
+}
+.vE
diff --git a/share/doc/papers/kernmalloc/kernmalloc.t b/share/doc/papers/kernmalloc/kernmalloc.t
new file mode 100644
index 000000000000..d074c9ed48d4
--- /dev/null
+++ b/share/doc/papers/kernmalloc/kernmalloc.t
@@ -0,0 +1,653 @@
+.\" Copyright (c) 1988 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)kernmalloc.t 5.1 (Berkeley) 4/16/91
+.\" $FreeBSD$
+.\"
+.\" reference a system routine name
+.de RN
+\fI\\$1\fP\^(\h'1m/24u')\\$2
+..
+.\" reference a header name
+.de H
+.NH \\$1
+\\$2
+..
+.\" begin figure
+.\" .FI "title"
+.nr Fn 0 1
+.de FI
+.ds Lb Figure \\n+(Fn
+.ds Lt \\$1
+.KF
+.DS B
+.nf
+..
+.\"
+.\" end figure
+.de Fe
+.DE
+.ce
+\\*(Lb. \\*(Lt
+.sp
+.KE
+..
+.EQ
+delim $$
+.EN
+.ds CH "
+.pn 295
+.sp
+.rs
+.ps -1
+.sp -1
+.fi
+Reprinted from:
+\fIProceedings of the San Francisco USENIX Conference\fP,
+pp. 295-303, June 1988.
+.ps
+.\".sp |\n(HMu
+.rm CM
+.nr PO 1.25i
+.TL
+Design of a General Purpose Memory Allocator for the 4.3BSD UNIX\(dg Kernel
+.ds LF Summer USENIX '88
+.ds CF "%
+.ds RF San Francisco, June 20-24
+.EH 'Design of a General Purpose Memory ...''McKusick, Karels'
+.OH 'McKusick, Karels''Design of a General Purpose Memory ...'
+.FS
+\(dgUNIX is a registered trademark of AT&T in the US and other countries.
+.FE
+.AU
+Marshall Kirk McKusick
+.AU
+Michael J. Karels
+.AI
+Computer Systems Research Group
+Computer Science Division
+Department of Electrical Engineering and Computer Science
+University of California, Berkeley
+Berkeley, California 94720
+.AB
+The 4.3BSD UNIX kernel uses many memory allocation mechanisms,
+each designed for the particular needs of the utilizing subsystem.
+This paper describes a general purpose dynamic memory allocator
+that can be used by all of the kernel subsystems.
+The design of this allocator takes advantage of known memory usage
+patterns in the UNIX kernel and a hybrid strategy that is time-efficient
+for small allocations and space-efficient for large allocations.
+This allocator replaces the multiple memory allocation interfaces
+with a single easy-to-program interface,
+results in more efficient use of global memory by eliminating
+partitioned and specialized memory pools,
+and is quick enough that no performance loss is observed
+relative to the current implementations.
+The paper concludes with a discussion of our experience in using
+the new memory allocator,
+and directions for future work.
+.AE
+.LP
+.H 1 "Kernel Memory Allocation in 4.3BSD
+.PP
+The 4.3BSD kernel has at least ten different memory allocators.
+Some of them handle large blocks,
+some of them handle small chained data structures,
+and others include information to describe I/O operations.
+Often the allocations are for small pieces of memory that are only
+needed for the duration of a single system call.
+In a user process such short-term
+memory would be allocated on the run-time stack.
+Because the kernel has a limited run-time stack,
+it is not feasible to allocate even moderate blocks of memory on it.
+Consequently, such memory must be allocated through a more dynamic mechanism.
+For example,
+when the system must translate a pathname,
+it must allocate a one kilobye buffer to hold the name.
+Other blocks of memory must be more persistent than a single system call
+and really have to be allocated from dynamic memory.
+Examples include protocol control blocks that remain throughout
+the duration of the network connection.
+.PP
+Demands for dynamic memory allocation in the kernel have increased
+as more services have been added.
+Each time a new type of memory allocation has been required,
+a specialized memory allocation scheme has been written to handle it.
+Often the new memory allocation scheme has been built on top
+of an older allocator.
+For example, the block device subsystem provides a crude form of
+memory allocation through the allocation of empty buffers [Thompson78].
+The allocation is slow because of the implied semantics of
+finding the oldest buffer, pushing its contents to disk if they are dirty,
+and moving physical memory into or out of the buffer to create
+the requested size.
+To reduce the overhead, a ``new'' memory allocator was built in 4.3BSD
+for name translation that allocates a pool of empty buffers.
+It keeps them on a free list so they can
+be quickly allocated and freed [McKusick85].
+.PP
+This memory allocation method has several drawbacks.
+First, the new allocator can only handle a limited range of sizes.
+Second, it depletes the buffer pool, as it steals memory intended
+to buffer disk blocks to other purposes.
+Finally, it creates yet another interface of
+which the programmer must be aware.
+.PP
+A generalized memory allocator is needed to reduce the complexity
+of writing code inside the kernel.
+Rather than providing many semi-specialized ways of allocating memory,
+the kernel should provide a single general purpose allocator.
+With only a single interface,
+programmers do not need to figure
+out the most appropriate way to allocate memory.
+If a good general purpose allocator is available,
+it helps avoid the syndrome of creating yet another special
+purpose allocator.
+.PP
+To ease the task of understanding how to use it,
+the memory allocator should have an interface similar to the interface
+of the well-known memory allocator provided for
+applications programmers through the C library routines
+.RN malloc
+and
+.RN free .
+Like the C library interface,
+the allocation routine should take a parameter specifying the
+size of memory that is needed.
+The range of sizes for memory requests should not be constrained.
+The free routine should take a pointer to the storage being freed,
+and should not require additional information such as the size
+of the piece of memory being freed.
+.H 1 "Criteria for a Kernel Memory Allocator
+.PP
+The design specification for a kernel memory allocator is similar to,
+but not identical to,
+the design criteria for a user level memory allocator.
+The first criterion for a memory allocator is that it make good use
+of the physical memory.
+Good use of memory is measured by the amount of memory needed to hold
+a set of allocations at any point in time.
+Percentage utilization is expressed as:
+.ie t \{\
+.EQ
+utilization~=~requested over required
+.EN
+.\}
+.el \{\
+.sp
+.ce
+\fIutilization\fP=\fIrequested\fP/\fIrequired\fP
+.sp
+.\}
+Here, ``requested'' is the sum of the memory that has been requested
+and not yet freed.
+``Required'' is the amount of memory that has been
+allocated for the pool from which the requests are filled.
+An allocator requires more memory than requested because of fragmentation
+and a need to have a ready supply of free memory for future requests.
+A perfect memory allocator would have a utilization of 100%.
+In practice,
+having a 50% utilization is considered good [Korn85].
+.PP
+Good memory utilization in the kernel is more important than
+in user processes.
+Because user processes run in virtual memory,
+unused parts of their address space can be paged out.
+Thus pages in the process address space
+that are part of the ``required'' pool that are not
+being ``requested'' need not tie up physical memory.
+Because the kernel is not paged,
+all pages in the ``required'' pool are held by the kernel and
+cannot be used for other purposes.
+To keep the kernel utilization percentage as high as possible,
+it is desirable to release unused memory in the ``required'' pool
+rather than to hold it as is typically done with user processes.
+Because the kernel can directly manipulate its own page maps,
+releasing unused memory is fast;
+a user process must do a system call to release memory.
+.PP
+The most important criterion for a memory allocator is that it be fast.
+Because memory allocation is done frequently,
+a slow memory allocator will degrade the system performance.
+Speed of allocation is more critical when executing in the
+kernel than in user code,
+because the kernel must allocate many data structure that user
+processes can allocate cheaply on their run-time stack.
+In addition, the kernel represents the platform on which all user
+processes run,
+and if it is slow, it will degrade the performance of every process
+that is running.
+.PP
+Another problem with a slow memory allocator is that programmers
+of frequently-used kernel interfaces will feel that they
+cannot afford to use it as their primary memory allocator.
+Instead they will build their own memory allocator on top of the
+original by maintaining their own pool of memory blocks.
+Multiple allocators reduce the efficiency with which memory is used.
+The kernel ends up with many different free lists of memory
+instead of a single free list from which all allocation can be drawn.
+For example,
+consider the case of two subsystems that need memory.
+If they have their own free lists,
+the amount of memory tied up in the two lists will be the
+sum of the greatest amount of memory that each of
+the two subsystems has ever used.
+If they share a free list,
+the amount of memory tied up in the free list may be as low as the
+greatest amount of memory that either subsystem used.
+As the number of subsystems grows,
+the savings from having a single free list grow.
+.H 1 "Existing User-level Implementations
+.PP
+There are many different algorithms and
+implementations of user-level memory allocators.
+A survey of those available on UNIX systems appeared in [Korn85].
+Nearly all of the memory allocators tested made good use of memory,
+though most of them were too slow for use in the kernel.
+The fastest memory allocator in the survey by nearly a factor of two
+was the memory allocator provided on 4.2BSD originally
+written by Chris Kingsley at California Institute of Technology.
+Unfortunately,
+the 4.2BSD memory allocator also wasted twice as much memory
+as its nearest competitor in the survey.
+.PP
+The 4.2BSD user-level memory allocator works by maintaining a set of lists
+that are ordered by increasing powers of two.
+Each list contains a set of memory blocks of its corresponding size.
+To fulfill a memory request,
+the size of the request is rounded up to the next power of two.
+A piece of memory is then removed from the list corresponding
+to the specified power of two and returned to the requester.
+Thus, a request for a block of memory of size 53 returns
+a block from the 64-sized list.
+A typical memory allocation requires a roundup calculation
+followed by a linked list removal.
+Only if the list is empty is a real memory allocation done.
+The free operation is also fast;
+the block of memory is put back onto the list from which it came.
+The correct list is identified by a size indicator stored
+immediately preceding the memory block.
+.H 1 "Considerations Unique to a Kernel Allocator
+.PP
+There are several special conditions that arise when writing a
+memory allocator for the kernel that do not apply to a user process
+memory allocator.
+First, the maximum memory allocation can be determined at
+the time that the machine is booted.
+This number is never more than the amount of physical memory on the machine,
+and is typically much less since a machine with all its
+memory dedicated to the operating system is uninteresting to use.
+Thus, the kernel can statically allocate a set of data structures
+to manage its dynamically allocated memory.
+These data structures never need to be
+expanded to accommodate memory requests;
+yet, if properly designed, they need not be large.
+For a user process, the maximum amount of memory that may be allocated
+is a function of the maximum size of its virtual memory.
+Although it could allocate static data structures to manage
+its entire virtual memory,
+even if they were efficiently encoded they would potentially be huge.
+The other alternative is to allocate data structures as they are needed.
+However, that adds extra complications such as new
+failure modes if it cannot allocate space for additional
+structures and additional mechanisms to link them all together.
+.PP
+Another special condition of the kernel memory allocator is that it
+can control its own address space.
+Unlike user processes that can only grow and shrink their heap at one end,
+the kernel can keep an arena of kernel addresses and allocate
+pieces from that arena which it then populates with physical memory.
+The effect is much the same as a user process that has parts of
+its address space paged out when they are not in use,
+except that the kernel can explicitly control the set of pages
+allocated to its address space.
+The result is that the ``working set'' of pages in use by the
+kernel exactly corresponds to the set of pages that it is really using.
+.FI "One day memory usage on a Berkeley time-sharing machine"
+.so usage.tbl
+.Fe
+.PP
+A final special condition that applies to the kernel is that
+all of the different uses of dynamic memory are known in advance.
+Each one of these uses of dynamic memory can be assigned a type.
+For each type of dynamic memory that is allocated,
+the kernel can provide allocation limits.
+One reason given for having separate allocators is that
+no single allocator could starve the rest of the kernel of all
+its available memory and thus a single runaway
+client could not paralyze the system.
+By putting limits on each type of memory,
+the single general purpose memory allocator can provide the same
+protection against memory starvation.\(dg
+.FS
+\(dgOne might seriously ask the question what good it is if ``only''
+one subsystem within the kernel hangs if it is something like the
+network on a diskless workstation.
+.FE
+.PP
+\*(Lb shows the memory usage of the kernel over a one day period
+on a general timesharing machine at Berkeley.
+The ``In Use'', ``Free'', and ``Mem Use'' fields are instantaneous values;
+the ``Requests'' field is the number of allocations since system startup;
+the ``High Use'' field is the maximum value of
+the ``Mem Use'' field since system startup.
+The figure demonstrates that most
+allocations are for small objects.
+Large allocations occur infrequently,
+and are typically for long-lived objects
+such as buffers to hold the superblock for
+a mounted file system.
+Thus, a memory allocator only needs to be
+fast for small pieces of memory.
+.H 1 "Implementation of the Kernel Memory Allocator
+.PP
+In reviewing the available memory allocators,
+none of their strategies could be used without some modification.
+The kernel memory allocator that we ended up with is a hybrid
+of the fast memory allocator found in the 4.2BSD C library
+and a slower but more-memory-efficient first-fit allocator.
+.PP
+Small allocations are done using the 4.2BSD power-of-two list strategy;
+the typical allocation requires only a computation of
+the list to use and the removal of an element if it is available,
+so it is quite fast.
+Macros are provided to avoid the cost of a subroutine call.
+Only if the request cannot be fulfilled from a list is a call
+made to the allocator itself.
+To ensure that the allocator is always called for large requests,
+the lists corresponding to large allocations are always empty.
+Appendix A shows the data structures and implementation of the macros.
+.PP
+Similarly, freeing a block of memory can be done with a macro.
+The macro computes the list on which to place the request
+and puts it there.
+The free routine is called only if the block of memory is
+considered to be a large allocation.
+Including the cost of blocking out interrupts,
+the allocation and freeing macros generate respectively
+only nine and sixteen (simple) VAX instructions.
+.PP
+Because of the inefficiency of power-of-two allocation strategies
+for large allocations,
+a different strategy is used for allocations larger than two kilobytes.
+The selection of two kilobytes is derived from our statistics on
+the utilization of memory within the kernel,
+that showed that 95 to 98% of allocations are of size one kilobyte or less.
+A frequent caller of the memory allocator
+(the name translation function)
+always requests a one kilobyte block.
+Additionally the allocation method for large blocks is based on allocating
+pieces of memory in multiples of pages.
+Consequently the actual allocation size for requests of size
+$2~times~pagesize$ or less are identical.\(dg
+.FS
+\(dgTo understand why this number is $size 8 {2~times~pagesize}$ one
+observes that the power-of-two algorithm yields sizes of 1, 2, 4, 8, \&...
+pages while the large block algorithm that allocates in multiples
+of pages yields sizes of 1, 2, 3, 4, \&... pages.
+Thus for allocations of sizes between one and two pages
+both algorithms use two pages;
+it is not until allocations of sizes between two and three pages
+that a difference emerges where the power-of-two algorithm will use
+four pages while the large block algorithm will use three pages.
+.FE
+In 4.3BSD on the VAX, the (software) page size is one kilobyte,
+so two kilobytes is the smallest logical cutoff.
+.PP
+Large allocations are first rounded up to be a multiple of the page size.
+The allocator then uses a first-fit algorithm to find space in the
+kernel address arena set aside for dynamic allocations.
+Thus a request for a five kilobyte piece of memory will use exactly
+five pages of memory rather than eight kilobytes as with
+the power-of-two allocation strategy.
+When a large piece of memory is freed,
+the memory pages are returned to the free memory pool,
+and the address space is returned to the kernel address arena
+where it is coalesced with adjacent free pieces.
+.PP
+Another technique to improve both the efficiency of memory utilization
+and the speed of allocation
+is to cluster same-sized small allocations on a page.
+When a list for a power-of-two allocation is empty,
+a new page is allocated and divided into pieces of the needed size.
+This strategy speeds future allocations as several pieces of memory
+become available as a result of the call into the allocator.
+.PP
+.FI "Calculation of allocation size"
+.so alloc.fig
+.Fe
+Because the size is not specified when a block of memory is freed,
+the allocator must keep track of the sizes of the pieces it has handed out.
+The 4.2BSD user-level allocator stores the size of each block
+in a header just before the allocation.
+However, this strategy doubles the memory requirement for allocations that
+require a power-of-two-sized block.
+Therefore,
+instead of storing the size of each piece of memory with the piece itself,
+the size information is associated with the memory page.
+\*(Lb shows how the kernel determines
+the size of a piece of memory that is being freed,
+by calculating the page in which it resides,
+and looking up the size associated with that page.
+Eliminating the cost of the overhead per piece improved utilization
+far more than expected.
+The reason is that many allocations in the kernel are for blocks of
+memory whose size is exactly a power of two.
+These requests would be nearly doubled if the user-level strategy were used.
+Now they can be accommodated with no wasted memory.
+.PP
+The allocator can be called both from the top half of the kernel,
+which is willing to wait for memory to become available,
+and from the interrupt routines in the bottom half of the kernel
+that cannot wait for memory to become available.
+Clients indicate their willingness (and ability) to wait with a flag
+to the allocation routine.
+For clients that are willing to wait,
+the allocator guarrentees that their request will succeed.
+Thus, these clients can need not check the return value from the allocator.
+If memory is unavailable and the client cannot wait,
+the allocator returns a null pointer.
+These clients must be prepared to cope with this
+(hopefully infrequent) condition
+(usually by giving up and hoping to do better later).
+.H 1 "Results of the Implementation
+.PP
+The new memory allocator was written about a year ago.
+Conversion from the old memory allocators to the new allocator
+has been going on ever since.
+Many of the special purpose allocators have been eliminated.
+This list includes
+.RN calloc ,
+.RN wmemall ,
+and
+.RN zmemall .
+Many of the special purpose memory allocators built on
+top of other allocators have also been eliminated.
+For example, the allocator that was built on top of the buffer pool allocator
+.RN geteblk
+to allocate pathname buffers in
+.RN namei
+has been eliminated.
+Because the typical allocation is so fast,
+we have found that none of the special purpose pools are needed.
+Indeed, the allocation is about the same as the previous cost of
+allocating buffers from the network pool (\fImbuf\fP\^s).
+Consequently applications that used to allocate network
+buffers for their own uses have been switched over to using
+the general purpose allocator without increasing their running time.
+.PP
+Quantifying the performance of the allocator is difficult because
+it is hard to measure the amount of time spent allocating
+and freeing memory in the kernel.
+The usual approach is to compile a kernel for profiling
+and then compare the running time of the routines that
+implemented the old abstraction versus those that implement the new one.
+The old routines are difficult to quantify because
+individual routines were used for more than one purpose.
+For example, the
+.RN geteblk
+routine was used both to allocate one kilobyte memory blocks
+and for its intended purpose of providing buffers to the filesystem.
+Differentiating these uses is often difficult.
+To get a measure of the cost of memory allocation before
+putting in our new allocator,
+we summed up the running time of all the routines whose
+exclusive task was memory allocation.
+To this total we added the fraction
+of the running time of the multi-purpose routines that could
+clearly be identified as memory allocation usage.
+This number showed that approximately three percent of
+the time spent in the kernel could be accounted to memory allocation.
+.PP
+The new allocator is difficult to measure
+because the usual case of the memory allocator is implemented as a macro.
+Thus, its running time is a small fraction of the running time of the
+numerous routines in the kernel that use it.
+To get a bound on the cost,
+we changed the macro always to call the memory allocation routine.
+Running in this mode, the memory allocator accounted for six percent
+of the time spent in the kernel.
+Factoring out the cost of the statistics collection and the
+subroutine call overhead for the cases that could
+normally be handled by the macro,
+we estimate that the allocator would account for
+at most four percent of time in the kernel.
+These measurements show that the new allocator does not introduce
+significant new run-time costs.
+.PP
+The other major success has been in keeping the size information
+on a per-page basis.
+This technique allows the most frequently requested sizes to be
+allocated without waste.
+It also reduces the amount of bookkeeping information associated
+with the allocator to four kilobytes of information
+per megabyte of memory under management (with a one kilobyte page size).
+.H 1 "Future Work
+.PP
+Our next project is to convert many of the static
+kernel tables to be dynamically allocated.
+Static tables include the process table, the file table,
+and the mount table.
+Making these tables dynamic will have two benefits.
+First, it will reduce the amount of memory
+that must be statically allocated at boot time.
+Second, it will eliminate the arbitrary upper limit imposed
+by the current static sizing
+(although a limit will be retained to constrain runaway clients).
+Other researchers have already shown the memory savings
+achieved by this conversion [Rodriguez88].
+.PP
+Under the current implementation,
+memory is never moved from one size list to another.
+With the 4.2BSD memory allocator this causes problems,
+particularly for large allocations where a process may use
+a quarter megabyte piece of memory once,
+which is then never available for any other size request.
+In our hybrid scheme,
+memory can be shuffled between large requests so that large blocks
+of memory are never stranded as they are with the 4.2BSD allocator.
+However, pages allocated to small requests are allocated once
+to a particular size and never changed thereafter.
+If a burst of requests came in for a particular size,
+that size would acquire a large amount of memory
+that would then not be available for other future requests.
+.PP
+In practice, we do not find that the free lists become too large.
+However, we have been investigating ways to handle such problems
+if they occur in the future.
+Our current investigations involve a routine
+that can run as part of the idle loop that would sort the elements
+on each of the free lists into order of increasing address.
+Since any given page has only one size of elements allocated from it,
+the effect of the sorting would be to sort the list into distinct pages.
+When all the pieces of a page became free,
+the page itself could be released back to the free pool so that
+it could be allocated to another purpose.
+Although there is no guarantee that all the pieces of a page would ever
+be freed,
+most allocations are short-lived, lasting only for the duration of
+an open file descriptor, an open network connection, or a system call.
+As new allocations would be made from the page sorted to
+the front of the list,
+return of elements from pages at the back would eventually
+allow pages later in the list to be freed.
+.PP
+Two of the traditional UNIX
+memory allocators remain in the current system.
+The terminal subsystem uses \fIclist\fP\^s (character lists).
+That part of the system is expected to undergo major revision within
+the next year or so, and it will probably be changed to use
+\fImbuf\fP\^s as it is merged into the network system.
+The other major allocator that remains is
+.RN getblk ,
+the routine that manages the filesystem buffer pool memory
+and associated control information.
+Only the filesystem uses
+.RN getblk
+in the current system;
+it manages the constant-sized buffer pool.
+We plan to merge the filesystem buffer cache into the virtual memory system's
+page cache in the future.
+This change will allow the size of the buffer pool to be changed
+according to memory load,
+but will require a policy for balancing memory needs
+with filesystem cache performance.
+.H 1 "Acknowledgments
+.PP
+In the spirit of community support,
+we have made various versions of our allocator available to our test sites.
+They have been busily burning it in and giving
+us feedback on their experiences.
+We acknowledge their invaluable input.
+The feedback from the Usenix program committee on the initial draft of
+our paper suggested numerous important improvements.
+.H 1 "References
+.LP
+.IP Korn85 \w'Rodriguez88\0\0'u
+David Korn, Kiem-Phong Vo,
+``In Search of a Better Malloc''
+\fIProceedings of the Portland Usenix Conference\fP,
+pp 489-506, June 1985.
+.IP McKusick85
+M. McKusick, M. Karels, S. Leffler,
+``Performance Improvements and Functional Enhancements in 4.3BSD''
+\fIProceedings of the Portland Usenix Conference\fP,
+pp 519-531, June 1985.
+.IP Rodriguez88
+Robert Rodriguez, Matt Koehler, Larry Palmer, Ricky Palmer,
+``A Dynamic UNIX Operating System''
+\fIProceedings of the San Francisco Usenix Conference\fP,
+June 1988.
+.IP Thompson78
+Ken Thompson,
+``UNIX Implementation''
+\fIBell System Technical Journal\fP, volume 57, number 6,
+pp 1931-1946, 1978.
diff --git a/share/doc/papers/kernmalloc/spell.ok b/share/doc/papers/kernmalloc/spell.ok
new file mode 100644
index 000000000000..10c3ab7d8ed4
--- /dev/null
+++ b/share/doc/papers/kernmalloc/spell.ok
@@ -0,0 +1,57 @@
+BUCKETINDX
+CLBYTES
+CM
+Karels
+Kiem
+Koehler
+Korn
+Korn85
+MAXALLOCSAVE
+MAXALLOCSIZE
+MAXKMEM
+MINALLOCSIZE
+MINBUCKET
+Matt
+McKusick
+McKusick85
+Mem
+Phong
+Ricky
+Rodriguez88
+S.Leffler
+Thompson78
+ULTRIX
+Usenix
+VAX
+Vo
+arptbl
+caddr
+devbuf
+extern
+fragtbl
+freelist
+geteblk
+indx
+ioctlops
+kb
+kbp
+kmembase
+kmembuckets
+kmemsizes
+ks
+ksp
+mbuf
+mbufs
+namei
+pagecnt
+pathname
+pcb
+pp
+routetbl
+runtime
+splimp
+splx
+superblk
+temp
+wmemall
+zmemall
diff --git a/share/doc/papers/kernmalloc/usage.tbl b/share/doc/papers/kernmalloc/usage.tbl
new file mode 100644
index 000000000000..c5ebdfee0508
--- /dev/null
+++ b/share/doc/papers/kernmalloc/usage.tbl
@@ -0,0 +1,75 @@
+.\" Copyright (c) 1988 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)usage.tbl 5.1 (Berkeley) 4/16/91
+.\"
+.TS
+box;
+c s s s
+c c c c
+n n n n.
+Memory statistics by bucket size
+=
+Size In Use Free Requests
+_
+128 329 39 3129219
+256 0 0 0
+512 4 0 16
+1024 17 5 648771
+2048 13 0 13
+2049\-4096 0 0 157
+4097\-8192 2 0 103
+8193\-16384 0 0 0
+16385\-32768 1 0 1
+.TE
+.DE
+.DS B
+.TS
+box;
+c s s s s
+c c c c c
+c n n n n.
+Memory statistics by type
+=
+Type In Use Mem Use High Use Requests
+_
+mbuf 6 1K 17K 3099066
+devbuf 13 53K 53K 13
+socket 37 5K 6K 1275
+pcb 55 7K 8K 1512
+routetbl 229 29K 29K 2424
+fragtbl 0 0K 1K 404
+zombie 3 1K 1K 24538
+namei 0 0K 5K 648754
+ioctlops 0 0K 1K 12
+superblk 24 34K 34K 24
+temp 0 0K 8K 258
+.TE
diff --git a/share/doc/papers/kerntune/0.t b/share/doc/papers/kerntune/0.t
new file mode 100644
index 000000000000..90fa2bf3a934
--- /dev/null
+++ b/share/doc/papers/kerntune/0.t
@@ -0,0 +1,129 @@
+.\" Copyright (c) 1984 M. K. McKusick
+.\" Copyright (c) 1984 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)0.t 1.2 (Berkeley) 11/8/90
+.\"
+.EQ
+delim $$
+.EN
+.if n .ND
+.TL
+Using gprof to Tune the 4.2BSD Kernel
+.AU
+Marshall Kirk McKusick
+.AI
+Computer Systems Research Group
+Computer Science Division
+Department of Electrical Engineering and Computer Science
+University of California, Berkeley
+Berkeley, California 94720
+.AB
+This paper describes how the \fIgprof\fP profiler
+accounts for the running time of called routines
+in the running time of the routines that call them.
+It then explains how to configure a profiling kernel on
+the 4.2 Berkeley Software Distribution of
+.UX
+for the VAX\(dd
+.FS
+\(dd VAX is a trademark of Digital Equipment Corporation.
+.FE
+and discusses tradeoffs in techniques for collecting
+profile data.
+\fIGprof\fP identifies problems
+that severely affects the overall performance of the kernel.
+Once a potential problem areas is identified
+benchmark programs are devised to highlight the bottleneck.
+These benchmarks verify that the problem exist and provide
+a metric against which to validate proposed solutions.
+Two caches are added to the kernel to alleviate the bottleneck
+and \fIgprof\fP is used to validates their effectiveness.
+.AE
+.LP
+.de PT
+.lt \\n(LLu
+.pc %
+.nr PN \\n%
+.tl '\\*(LH'\\*(CH'\\*(RH'
+.lt \\n(.lu
+..
+.af PN i
+.ds LH 4.2BSD Performance
+.ds RH Contents
+.bp 1
+.if t .ds CF May 21, 1984
+.if t .ds LF
+.if t .ds RF McKusick
+.ce
+.B "TABLE OF CONTENTS"
+.LP
+.sp 1
+.nf
+.B "1. Introduction"
+.LP
+.sp .5v
+.nf
+.B "2. The \fIgprof\fP Profiler"
+\0.1. Data Presentation"
+\0.1.1. The Flat Profile
+\0.1.2. The Call Graph Profile
+\0.2 Profiling the Kernel
+.LP
+.sp .5v
+.nf
+.B "3. Using \fIgprof\fP to Improve Performance
+\0.1. Using the Profiler
+\0.2. An Example of Tuning
+.LP
+.sp .5v
+.nf
+.B "4. Conclusions"
+.LP
+.sp .5v
+.nf
+.B Acknowledgements
+.LP
+.sp .5v
+.nf
+.B References
+.af PN 1
+.bp 1
+.de _d
+.if t .ta .6i 2.1i 2.6i
+.\" 2.94 went to 2.6, 3.64 to 3.30
+.if n .ta .84i 2.6i 3.30i
+..
+.de _f
+.if t .ta .5i 1.25i 2.5i
+.\" 3.5i went to 3.8i
+.if n .ta .7i 1.75i 3.8i
+..
diff --git a/share/doc/papers/kerntune/1.t b/share/doc/papers/kerntune/1.t
new file mode 100644
index 000000000000..49b653f501f8
--- /dev/null
+++ b/share/doc/papers/kerntune/1.t
@@ -0,0 +1,49 @@
+.\" Copyright (c) 1984 M. K. McKusick
+.\" Copyright (c) 1984 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)1.t 1.2 (Berkeley) 11/8/90
+.\" $FreeBSD$
+.\"
+.ds RH Introduction
+.NH 1
+Introduction
+.PP
+The purpose of this paper is to describe the tools and techniques
+that are available for improving the performance of the kernel.
+The primary tool used to measure the kernel is the hierarchical
+profiler \fIgprof\fP.
+The profiler enables the user to measure the cost of
+the abstractions that the kernel provides to the user.
+Once the expensive abstractions are identified,
+optimizations are postulated to help improve their performance.
+These optimizations are each individually
+verified to insure that they are producing a measurable improvement.
diff --git a/share/doc/papers/kerntune/2.t b/share/doc/papers/kerntune/2.t
new file mode 100644
index 000000000000..2857dc29ad5b
--- /dev/null
+++ b/share/doc/papers/kerntune/2.t
@@ -0,0 +1,234 @@
+.\" Copyright (c) 1984 M. K. McKusick
+.\" Copyright (c) 1984 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)2.t 1.3 (Berkeley) 11/8/90
+.\"
+.ds RH The \fIgprof\fP Profiler
+.NH 1
+The \fIgprof\fP Profiler
+.PP
+The purpose of the \fIgprof\fP profiling tool is to
+help the user evaluate alternative implementations
+of abstractions.
+The \fIgprof\fP design takes advantage of the fact that the kernel
+though large, is structured and hierarchical.
+We provide a profile in which the execution time
+for a set of routines that implement an
+abstraction is collected and charged
+to that abstraction.
+The profile can be used to compare and assess the costs of
+various implementations [Graham82] [Graham83].
+.NH 2
+Data presentation
+.PP
+The data is presented to the user in two different formats.
+The first presentation simply lists the routines
+without regard to the amount of time their descendants use.
+The second presentation incorporates the call graph of the
+kernel.
+.NH 3
+The Flat Profile
+.PP
+The flat profile consists of a list of all the routines
+that are called during execution of the kernel,
+with the count of the number of times they are called
+and the number of seconds of execution time for which they
+are themselves accountable.
+The routines are listed in decreasing order of execution time.
+A list of the routines that are never called during execution of
+the kernel is also available
+to verify that nothing important is omitted by
+this profiling run.
+The flat profile gives a quick overview of the routines that are used,
+and shows the routines that are themselves responsible
+for large fractions of the execution time.
+In practice,
+this profile usually shows that no single function
+is overwhelmingly responsible for
+the total time of the kernel.
+Notice that for this profile,
+the individual times sum to the total execution time.
+.NH 3
+The Call Graph Profile
+.PP
+Ideally, we would like to print the call graph of the kernel,
+but we are limited by the two-dimensional nature of our output
+devices.
+We cannot assume that a call graph is planar,
+and even if it is, that we can print a planar version of it.
+Instead, we choose to list each routine,
+together with information about
+the routines that are its direct parents and children.
+This listing presents a window into the call graph.
+Based on our experience,
+both parent information and child information
+is important,
+and should be available without searching
+through the output.
+Figure 1 shows a sample \fIgprof\fP entry.
+.KF
+.DS L
+.TS
+box center;
+c c c c c l l
+c c c c c l l
+c c c c c l l
+l n n n c l l.
+ called/total \ \ parents
+index %time self descendants called+self name index
+ called/total \ \ children
+_
+ 0.20 1.20 4/10 \ \ \s-1CALLER1\s+1 [7]
+ 0.30 1.80 6/10 \ \ \s-1CALLER2\s+1 [1]
+[2] 41.5 0.50 3.00 10+4 \s-1EXAMPLE\s+1 [2]
+ 1.50 1.00 20/40 \ \ \s-1SUB1\s+1 <cycle1> [4]
+ 0.00 0.50 1/5 \ \ \s-1SUB2\s+1 [9]
+ 0.00 0.00 0/5 \ \ \s-1SUB3\s+1 [11]
+.TE
+.ce
+Figure 1. Profile entry for \s-1EXAMPLE\s+1.
+.DE
+.KE
+.PP
+The major entries of the call graph profile are the entries from the
+flat profile, augmented by the time propagated to each
+routine from its descendants.
+This profile is sorted by the sum of the time for the routine
+itself plus the time inherited from its descendants.
+The profile shows which of the higher level routines
+spend large portions of the total execution time
+in the routines that they call.
+For each routine, we show the amount of time passed by each child
+to the routine, which includes time for the child itself
+and for the descendants of the child
+(and thus the descendants of the routine).
+We also show the percentage these times represent of the total time
+accounted to the child.
+Similarly, the parents of each routine are listed,
+along with time,
+and percentage of total routine time,
+propagated to each one.
+.PP
+Cycles are handled as single entities.
+The cycle as a whole is shown as though it were a single routine,
+except that members of the cycle are listed in place of the children.
+Although the number of calls of each member
+from within the cycle are shown,
+they do not affect time propagation.
+When a child is a member of a cycle,
+the time shown is the appropriate fraction of the time
+for the whole cycle.
+Self-recursive routines have their calls broken
+down into calls from the outside and self-recursive calls.
+Only the outside calls affect the propagation of time.
+.PP
+The example shown in Figure 2 is the fragment of a call graph
+corresponding to the entry in the call graph profile listing
+shown in Figure 1.
+.KF
+.DS L
+.so fig2.pic
+.ce
+Figure 2. Example call graph fragment.
+.DE
+.KE
+.PP
+The entry is for routine \s-1EXAMPLE\s+1, which has
+the Caller routines as its parents,
+and the Sub routines as its children.
+The reader should keep in mind that all information
+is given \fIwith respect to \s-1EXAMPLE\s+1\fP.
+The index in the first column shows that \s-1EXAMPLE\s+1
+is the second entry in the profile listing.
+The \s-1EXAMPLE\s+1 routine is called ten times, four times by \s-1CALLER1\s+1,
+and six times by \s-1CALLER2\s+1.
+Consequently 40% of \s-1EXAMPLE\s+1's time is propagated to \s-1CALLER1\s+1,
+and 60% of \s-1EXAMPLE\s+1's time is propagated to \s-1CALLER2\s+1.
+The self and descendant fields of the parents
+show the amount of self and descendant time \s-1EXAMPLE\s+1
+propagates to them (but not the time used by
+the parents directly).
+Note that \s-1EXAMPLE\s+1 calls itself recursively four times.
+The routine \s-1EXAMPLE\s+1 calls routine \s-1SUB1\s+1 twenty times, \s-1SUB2\s+1 once,
+and never calls \s-1SUB3\s+1.
+Since \s-1SUB2\s+1 is called a total of five times,
+20% of its self and descendant time is propagated to \s-1EXAMPLE\s+1's
+descendant time field.
+Because \s-1SUB1\s+1 is a member of \fIcycle 1\fR,
+the self and descendant times
+and call count fraction
+are those for the cycle as a whole.
+Since cycle 1 is called a total of forty times
+(not counting calls among members of the cycle),
+it propagates 50% of the cycle's self and descendant
+time to \s-1EXAMPLE\s+1's descendant time field.
+Finally each name is followed by an index that shows
+where on the listing to find the entry for that routine.
+.NH 2
+Profiling the Kernel
+.PP
+It is simple to build a 4.2BSD kernel that will automatically
+collect profiling information as it operates simply by specifying the
+.B \-p
+option to \fIconfig\fP\|(8) when configuring a kernel.
+The program counter sampling can be driven by the system clock,
+or by an alternate real time clock.
+The latter is highly recommended as use of the system clock results
+in statistical anomalies in accounting for
+the time spent in the kernel clock routine.
+.PP
+Once a profiling system has been booted statistic gathering is
+handled by \fIkgmon\fP\|(8).
+\fIKgmon\fP allows profiling to be started and stopped
+and the internal state of the profiling buffers to be dumped.
+\fIKgmon\fP can also be used to reset the state of the internal
+buffers to allow multiple experiments to be run without
+rebooting the machine.
+The profiling data can then be processed with \fIgprof\fP\|(1)
+to obtain information regarding the system's operation.
+.PP
+A profiled system is about 5-10% larger in its text space because of
+the calls to count the subroutine invocations.
+When the system executes,
+the profiling data is stored in a buffer that is 1.2
+times the size of the text space.
+All the information is summarized in memory,
+it is not necessary to have a trace file
+being continuously dumped to disk.
+The overhead for running a profiled system varies;
+under normal load we see anywhere from 5-25%
+of the system time spent in the profiling code.
+Thus the system is noticeably slower than an unprofiled system,
+yet is not so bad that it cannot be used in a production environment.
+This is important since it allows us to gather data
+in a real environment rather than trying to
+devise synthetic work loads.
diff --git a/share/doc/papers/kerntune/3.t b/share/doc/papers/kerntune/3.t
new file mode 100644
index 000000000000..e03236b4bac6
--- /dev/null
+++ b/share/doc/papers/kerntune/3.t
@@ -0,0 +1,290 @@
+.\" Copyright (c) 1984 M. K. McKusick
+.\" Copyright (c) 1984 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)3.t 1.2 (Berkeley) 11/8/90
+.\"
+.ds RH Techniques for Improving Performance
+.NH 1
+Techniques for Improving Performance
+.PP
+This section gives several hints on general optimization techniques.
+It then proceeds with an example of how they can be
+applied to the 4.2BSD kernel to improve its performance.
+.NH 2
+Using the Profiler
+.PP
+The profiler is a useful tool for improving
+a set of routines that implement an abstraction.
+It can be helpful in identifying poorly coded routines,
+and in evaluating the new algorithms and code that replace them.
+Taking full advantage of the profiler
+requires a careful examination of the call graph profile,
+and a thorough knowledge of the abstractions underlying
+the kernel.
+.PP
+The easiest optimization that can be performed
+is a small change
+to a control construct or data structure.
+An obvious starting point
+is to expand a small frequently called routine inline.
+The drawback to inline expansion is that the data abstractions
+in the kernel may become less parameterized,
+hence less clearly defined.
+The profiling will also become less useful since the loss of
+routines will make its output more granular.
+.PP
+Further potential for optimization lies in routines that
+implement data abstractions whose total execution
+time is long.
+If the data abstraction function cannot easily be speeded up,
+it may be advantageous to cache its results,
+and eliminate the need to rerun
+it for identical inputs.
+These and other ideas for program improvement are discussed in
+[Bentley81].
+.PP
+This tool is best used in an iterative approach:
+profiling the kernel,
+eliminating one bottleneck,
+then finding some other part of the kernel
+that begins to dominate execution time.
+.PP
+A completely different use of the profiler is to analyze the control
+flow of an unfamiliar section of the kernel.
+By running an example that exercises the unfamiliar section of the kernel,
+and then using \fIgprof\fR, you can get a view of the
+control structure of the unfamiliar section.
+.NH 2
+An Example of Tuning
+.PP
+The first step is to come up with a method for generating
+profile data.
+We prefer to run a profiling system for about a one day
+period on one of our general timesharing machines.
+While this is not as reproducible as a synthetic workload,
+it certainly represents a realistic test.
+We have run one day profiles on several
+occasions over a three month period.
+Despite the long period of time that elapsed
+between the test runs the shape of the profiles,
+as measured by the number of times each system call
+entry point was called, were remarkably similar.
+.PP
+A second alternative is to write a small benchmark
+program to repeated exercise a suspected bottleneck.
+While these benchmarks are not useful as a long term profile
+they can give quick feedback on whether a hypothesized
+improvement is really having an effect.
+It is important to realize that the only real assurance
+that a change has a beneficial effect is through
+long term measurements of general timesharing.
+We have numerous examples where a benchmark program
+suggests vast improvements while the change
+in the long term system performance is negligible,
+and conversely examples in which the benchmark program run more slowly,
+but the long term system performance improves significantly.
+.PP
+An investigation of our long term profiling showed that
+the single most expensive function performed by the kernel
+is path name translation.
+We find that our general time sharing systems do about
+500,000 name translations per day.
+The cost of doing name translation in the original 4.2BSD
+is 24.2 milliseconds,
+representing 40% of the time processing system calls,
+which is 19% of the total cycles in the kernel,
+or 11% of all cycles executed on the machine.
+The times are shown in Figure 3.
+.KF
+.DS L
+.TS
+center box;
+l r r.
+part time % of kernel
+_
+self 14.3 ms/call 11.3%
+child 9.9 ms/call 7.9%
+_
+total 24.2 ms/call 19.2%
+.TE
+.ce
+Figure 3. Call times for \fInamei\fP.
+.DE
+.KE
+.PP
+The system measurements collected showed the
+pathname translation routine, \fInamei\fP,
+was clearly worth optimizing.
+An inspection of \fInamei\fP shows that
+it consists of two nested loops.
+The outer loop is traversed once per pathname component.
+The inner loop performs a linear search through a directory looking
+for a particular pathname component.
+.PP
+Our first idea was to observe that many programs
+step through a directory performing an operation on
+each entry in turn.
+This caused us to modify \fInamei\fP to cache
+the directory offset of the last pathname
+component looked up by a process.
+The cached offset is then used
+as the point at which a search in the same directory
+begins. Changing directories invalidates the cache, as
+does modifying the directory.
+For programs that step sequentially through a directory with
+$N$ files, search time decreases from $O ( N sup 2 )$
+to $O(N)$.
+.PP
+The cost of the cache is about 20 lines of code
+(about 0.2 kilobytes)
+and 16 bytes per process, with the cached data
+stored in a process's \fIuser\fP vector.
+.PP
+As a quick benchmark to verify the effectiveness of the
+cache we ran ``ls \-l''
+on a directory containing 600 files.
+Before the per-process cache this command
+used 22.3 seconds of system time.
+After adding the cache the program used the same amount
+of user time, but the system time dropped to 3.3 seconds.
+.PP
+This change prompted our rerunning a profiled system
+on a machine containing the new \fInamei\fP.
+The results showed that the time in \fInamei\fP
+dropped by only 2.6 ms/call and
+still accounted for 36% of the system call time,
+18% of the kernel, or about 10% of all the machine cycles.
+This amounted to a drop in system time from 57% to about 55%.
+The results are shown in Figure 4.
+.KF
+.DS L
+.TS
+center box;
+l r r.
+part time % of kernel
+_
+self 11.0 ms/call 9.2%
+child 10.6 ms/call 8.9%
+_
+total 21.6 ms/call 18.1%
+.TE
+.ce
+Figure 4. Call times for \fInamei\fP with per-process cache.
+.DE
+.KE
+.PP
+The small performance improvement
+was caused by a low cache hit ratio.
+Although the cache was 90% effective when hit,
+it was only usable on about 25% of the names being translated.
+An additional reason for the small improvement was that
+although the amount of time spent in \fInamei\fP itself
+decreased substantially,
+more time was spent in the routines that it called
+since each directory had to be accessed twice;
+once to search from the middle to the end,
+and once to search from the beginning to the middle.
+.PP
+Most missed names were caused by path name components
+other than the last.
+Thus Robert Elz introduced a system wide cache of most recent
+name translations.
+The cache is keyed on a name and the
+inode and device number of the directory that contains it.
+Associated with each entry is a pointer to the corresponding
+entry in the inode table.
+This has the effect of short circuiting the outer loop of \fInamei\fP.
+For each path name component,
+\fInamei\fP first looks in its cache of recent translations
+for the needed name.
+If it exists, the directory search can be completely eliminated.
+If the name is not recognized,
+then the per-process cache may still be useful in
+reducing the directory search time.
+The two cacheing schemes complement each other well.
+.PP
+The cost of the name cache is about 200 lines of code
+(about 1.2 kilobytes)
+and 44 bytes per cache entry.
+Depending on the size of the system,
+about 200 to 1000 entries will normally be configured,
+using 10-44 kilobytes of physical memory.
+The name cache is resident in memory at all times.
+.PP
+After adding the system wide name cache we reran ``ls \-l''
+on the same directory.
+The user time remained the same,
+however the system time rose slightly to 3.7 seconds.
+This was not surprising as \fInamei\fP
+now had to maintain the cache,
+but was never able to make any use of it.
+.PP
+Another profiled system was created and measurements
+were collected over a one day period. These measurements
+showed a 6 ms/call decrease in \fInamei\fP, with
+\fInamei\fP accounting for only 31% of the system call time,
+16% of the time in the kernel,
+or about 7% of all the machine cycles.
+System time dropped from 55% to about 49%.
+The results are shown in Figure 5.
+.KF
+.DS L
+.TS
+center box;
+l r r.
+part time % of kernel
+_
+self 9.5 ms/call 9.6%
+child 6.1 ms/call 6.1%
+_
+total 15.6 ms/call 15.7%
+.TE
+.ce
+Figure 5. Call times for \fInamei\fP with both caches.
+.DE
+.KE
+.PP
+Statistics on the performance of both caches show
+the large performance improvement is
+caused by the high hit ratio.
+On the profiled system a 60% hit rate was observed in
+the system wide cache. This, coupled with the 25%
+hit rate in the per-process offset cache yielded an
+effective cache hit rate of 85%.
+While the system wide cache reduces both the amount of time in
+the routines that \fInamei\fP calls as well as \fInamei\fP itself
+(since fewer directories need to be accessed or searched),
+it is interesting to note that the actual percentage of system
+time spent in \fInamei\fP itself increases even though the
+actual time per call decreases.
+This is because less total time is being spent in the kernel,
+hence a smaller absolute time becomes a larger total percentage.
diff --git a/share/doc/papers/kerntune/4.t b/share/doc/papers/kerntune/4.t
new file mode 100644
index 000000000000..38bae438ae85
--- /dev/null
+++ b/share/doc/papers/kerntune/4.t
@@ -0,0 +1,99 @@
+.\" Copyright (c) 1984 M. K. McKusick
+.\" Copyright (c) 1984 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)4.t 1.2 (Berkeley) 11/8/90
+.\"
+.ds RH Conclusions
+.NH 1
+Conclusions
+.PP
+We have created a profiler that aids in the evaluation
+of the kernel.
+For each routine in the kernel,
+the profile shows the extent to which that routine
+helps support various abstractions,
+and how that routine uses other abstractions.
+The profile assesses the cost of routines
+at all levels of the kernel decomposition.
+The profiler is easily used,
+and can be compiled into the kernel.
+It adds only five to thirty percent execution overhead to the kernel
+being profiled,
+produces no additional output while the kernel is running
+and allows the kernel to be measured in its real environment.
+Kernel profiles can be used to identify bottlenecks in performance.
+We have shown how to improve performance
+by caching recently calculated name translations.
+The combined caches added to the name translation process
+reduce the average cost of translating a pathname to an inode by 35%.
+These changes reduce the percentage of time spent running
+in the system by nearly 9%.
+.nr H2 1
+.ds RH Acknowledgements
+.NH
+\s+2Acknowledgements\s0
+.PP
+I would like to thank Robert Elz for sharing his ideas and
+his code for cacheing system wide names.
+Thanks also to all the users at Berkeley who provided all the
+input to generate the kernel profiles.
+This work was supported by
+the Defense Advance Research Projects Agency (DoD) under
+Arpa Order No. 4031 monitored by Naval Electronic System Command under
+Contract No. N00039-82-C-0235.
+.ds RH References
+.nr H2 1
+.sp 2
+.NH
+\s+2References\s-2
+.LP
+.IP [Bentley81] 20
+Bentley, J. L.,
+``Writing Efficient Code'',
+Department of Computer Science,
+Carnegie-Mellon University,
+Pittsburgh, Pennsylvania,
+CMU-CS-81-116, 1981.
+.IP [Graham82] 20
+Graham, S., Kessler, P., McKusick, M.,
+``gprof: A Call Graph Execution Profiler'',
+Proceedings of the SIGPLAN '82 Symposium on Compiler Construction,
+Volume 17, Number 6, June 1982. pp 120-126
+.IP [Graham83] 20
+Graham, S., Kessler, P., McKusick, M.,
+``An Execution Profiler for Modular Programs''
+Software - Practice and Experience,
+Volume 13, 1983. pp 671-685
+.IP [Ritchie74] 20
+Ritchie, D. M. and Thompson, K.,
+``The UNIX Time-Sharing System'',
+CACM 17, 7. July 1974. pp 365-375
diff --git a/share/doc/papers/kerntune/Makefile b/share/doc/papers/kerntune/Makefile
new file mode 100644
index 000000000000..33416d6e3c1b
--- /dev/null
+++ b/share/doc/papers/kerntune/Makefile
@@ -0,0 +1,14 @@
+# From: @(#)Makefile 1.5 (Berkeley) 6/8/93
+# $FreeBSD$
+
+VOLUME= papers
+DOC= kerntune
+SRCS= 0.t 1.t 2.t 3.t 4.t
+EXTRA= fig2.pic
+MACROS= -ms
+USE_EQN=
+USE_PIC=
+USE_SOELIM=
+USE_TBL=
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/kerntune/fig2.pic b/share/doc/papers/kerntune/fig2.pic
new file mode 100644
index 000000000000..6731ca99f972
--- /dev/null
+++ b/share/doc/papers/kerntune/fig2.pic
@@ -0,0 +1,57 @@
+.\" Copyright (c) 1987 M. K. McKusick
+.\" Copyright (c) 1987 The Regents of the University of California.
+.\" All rights reserved.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\" 3. All advertising materials mentioning features or use of this software
+.\" must display the following acknowledgement:
+.\" This product includes software developed by the University of
+.\" California, Berkeley and its contributors.
+.\" 4. Neither the name of the University nor the names of its contributors
+.\" may be used to endorse or promote products derived from this software
+.\" without specific prior written permission.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\"
+.\" @(#)fig2.pic 1.2 (Berkeley) 11/8/90
+.\"
+.PS
+ellipse ht .3i wid .75i "\s-1CALLER1\s+1"
+ellipse ht .3i wid .75i "\s-1CALLER2\s+1" at 1st ellipse + (2i,0i)
+ellipse ht .3i wid .8i "\s-1EXAMPLE\s+1" at 1st ellipse + (1i,-.5i)
+ellipse ht .3i wid .5i "\s-1SUB1\s+1" at 1st ellipse - (0i,1i)
+ellipse ht .3i wid .5i "\s-1SUB2\s+1" at 3rd ellipse - (0i,.5i)
+ellipse ht .3i wid .5i "\s-1SUB3\s+1" at 2nd ellipse - (0i,1i)
+line <- from 1st ellipse up .5i left .5i chop .1875i
+line <- from 1st ellipse up .5i right .5i chop .1875i
+line <- from 2nd ellipse up .5i left .5i chop .1875i
+line <- from 2nd ellipse up .5i right .5i chop .1875i
+arrow from 1st ellipse to 3rd ellipse chop
+arrow from 2nd ellipse to 3rd ellipse chop
+arrow from 3rd ellipse to 4th ellipse chop
+arrow from 3rd ellipse to 5th ellipse chop .15i chop .15i
+arrow from 3rd ellipse to 6th ellipse chop
+arrow from 4th ellipse down .5i left .5i chop .1875i
+arrow from 4th ellipse down .5i right .5i chop .1875i
+arrow from 5th ellipse down .5i left .5i chop .1875i
+arrow from 5th ellipse down .5i right .5i chop .1875i
+arrow from 6th ellipse down .5i left .5i chop .1875i
+arrow from 6th ellipse down .5i right .5i chop .1875i
+.PE
diff --git a/share/doc/papers/malloc/Makefile b/share/doc/papers/malloc/Makefile
new file mode 100644
index 000000000000..00e1e3d87a3e
--- /dev/null
+++ b/share/doc/papers/malloc/Makefile
@@ -0,0 +1,10 @@
+# From: @(#)Makefile 6.3 (Berkeley) 6/8/93
+# $FreeBSD$
+
+VOLUME= papers
+DOC= malloc
+SRCS= abs.ms intro.ms kernel.ms malloc.ms problems.ms alternatives.ms \
+ performance.ms implementation.ms conclusion.ms
+MACROS= -ms
+
+.include <bsd.doc.mk>
diff --git a/share/doc/papers/malloc/abs.ms b/share/doc/papers/malloc/abs.ms
new file mode 100644
index 000000000000..f58d719f4c28
--- /dev/null
+++ b/share/doc/papers/malloc/abs.ms
@@ -0,0 +1,35 @@
+.\"
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@FreeBSD.org> wrote this file. As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $FreeBSD$
+.\"
+.if n .ND
+.TL
+Malloc(3) in modern Virtual Memory environments.
+.sp
+Revised
+Fri Apr 5 12:50:07 1996
+.AU
+Poul-Henning Kamp
+.AI
+<phk@FreeBSD.org>
+Den Andensidste Viking
+Valbygaardsvej 8
+DK-4200 Slagelse
+Denmark
+.AB
+Malloc/free is one of the oldest parts of the C language environment
+and obviously the world has changed a bit since it was first made.
+The fact that most UNIX kernels have changed from swap/segment to
+virtual memory/page based memory management has not been sufficiently
+reflected in the implementations of the malloc/free API.
+.PP
+A new implementation was designed, written, tested and bench-marked
+with an eye on the workings and performance characteristics of modern
+Virtual Memory systems.
+.AE
diff --git a/share/doc/papers/malloc/alternatives.ms b/share/doc/papers/malloc/alternatives.ms
new file mode 100644
index 000000000000..5a46f9520984
--- /dev/null
+++ b/share/doc/papers/malloc/alternatives.ms
@@ -0,0 +1,45 @@
+.\"
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@FreeBSD.org> wrote this file. As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $FreeBSD$
+.\"
+.ds RH Alternative implementations
+.NH
+Alternative implementations
+.PP
+These problems were actually the inspiration for the first alternative
+malloc implementations.
+Since their main aim was debugging, they would often use techniques
+like allocating a guard zone before and after the chunk,
+and possibly filling these guard zones
+with some pattern, so accesses outside the allocated chunk could be detected
+with some decent probability.
+Another widely used technique is to use tables to keep track of which
+chunks are actually in which state and so on.
+.PP
+This class of debugging has been taken to its practical extreme by
+the product "Purify" which does the entire memory-coloring exercise
+and not only keeps track of what is in use and what isn't, but also
+detects if the first reference is a read (which would return undefined
+values) and other such violations.
+.PP
+Later actual complete implementations of malloc arrived, but many of
+these still based their workings on the basic schema mentioned previously,
+disregarding that in the meantime virtual memory and paging have
+become the standard environment.
+.PP
+The most widely used "alternative" malloc is undoubtedly ``gnumalloc''
+which has received wide acclaim and certainly runs faster than
+most stock mallocs. It does, however, tend to fare badly in
+cases where paging is the norm rather than the exception.
+.PP
+The particular malloc that prompted this work basically didn't bother
+reusing storage until the kernel forced it to do so by refusing
+further allocations with sbrk(2).
+That may make sense if you work alone on your own personal mainframe,
+but as a general policy it is less than optimal.
diff --git a/share/doc/papers/malloc/conclusion.ms b/share/doc/papers/malloc/conclusion.ms
new file mode 100644
index 000000000000..da7d7e98bdb5
--- /dev/null
+++ b/share/doc/papers/malloc/conclusion.ms
@@ -0,0 +1,48 @@
+.\"
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@FreeBSD.org> wrote this file. As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $FreeBSD$
+.\"
+.ds RH Conclusion and experience.
+.NH
+Conclusion and experience.
+.PP
+In general the performance differences between gnumalloc and this
+malloc are not that big.
+The major difference comes when primary storage is seriously
+over-committed, in which case gnumalloc
+wastes time paging in pages it's not going to use.
+In such cases as much as a factor of five in wall-clock time has
+been seen in difference.
+Apart from that gnumalloc and this implementation are pretty
+much head-on performance-wise.
+.PP
+Several legacy programs in the BSD 4.4 Lite distribution had
+code that depended on the memory returned from malloc
+being zeroed. In a couple of cases, free(3) was called more than
+once for the same allocation, and a few cases even called free(3)
+with pointers to objects in the data section or on the stack.
+.PP
+A couple of users have reported that using this malloc on other
+platforms yielded "pretty impressive results", but no hard benchmarks
+have been made.
+.ds RH Acknowledgements & references.
+.NH
+Acknowledgements & references.
+.PP
+The first implementation of this algorithm was actually a file system,
+done in assembler using 5-hole ``Baudot'' paper tape for a drum storage
+device attached to a 20 bit germanium transistor computer with 2000 words
+of memory, but that was many years ago.
+.PP
+Peter Wemm <peter@FreeBSD.org> came up with the idea to store the
+page-directory in mmap(2)'ed memory instead of in the heap.
+This has proven to be a good move.
+.PP
+Lars Fredriksen <fredriks@mcs.com> found and identified a
+fence-post bug in the code.
diff --git a/share/doc/papers/malloc/implementation.ms b/share/doc/papers/malloc/implementation.ms
new file mode 100644
index 000000000000..2507e4cb1b77
--- /dev/null
+++ b/share/doc/papers/malloc/implementation.ms
@@ -0,0 +1,225 @@
+.\"
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@FreeBSD.org> wrote this file. As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $FreeBSD$
+.\"
+.ds RH Implementation
+.NH
+Implementation
+.PP
+A new malloc(3) implementation was written to meet the goals,
+and to the extent possible to address the shortcomings listed previously.
+.PP
+The source is 1218 lines of C code, and can be found in FreeBSD 2.2
+(and probably later versions as well) as src/lib/libc/stdlib/malloc.c.
+.PP
+The main data structure is the
+.I page-directory
+which contains a
+.B void*
+for each page we have control over.
+The value can be one of:
+.IP
+.B MALLOC_NOT_MINE
+Another part of the code may call brk(2) to get a piece of the cake.
+Consequently, we cannot rely on the memory we get from the kernel
+being one consecutive piece of memory, and therefore we need a way to
+mark such pages as "untouchable".
+.IP
+.B MALLOC_FREE
+This is a free page.
+.IP
+.B MALLOC_FIRST
+This is the first page in a (multi-)page allocation.
+.IP
+.B MALLOC_FOLLOW
+This is a subsequent page in a multi-page allocation.
+.IP
+.B
+struct pginfo*
+.R
+A pointer to a structure describing a partitioned page.
+.PP
+In addition, there exists a linked list of small data structures that
+describe the free space as runs of free pages.
+.PP
+Notice that these structures are not part of the free pages themselves,
+but rather allocated with malloc so that the free pages themselves
+are never referenced while they are free.
+.PP
+When a request for storage comes in, it will be treated as a ``page''
+allocation if it is bigger than half a page.
+The free list will be searched and the first run of free pages that
+can satisfy the request is used. The first page gets set to
+.B MALLOC_FIRST
+status. If more than that one page is needed, the rest of them get
+.B MALLOC_FOLLOW
+status in the page-directory.
+.PP
+If there were no pages on the free list, brk(2) will be called, and
+the pages will get added to the page-directory with status
+.B MALLOC_FREE
+and the search restarts.
+.PP
+Freeing a number of pages is done by changing their state in the
+page directory to MALLOC_FREE, and then traversing the free-pages list to
+find the right place for this run of pages, possibly collapsing
+with the two neighboring runs into one run and, if possible,
+releasing some memory back to the kernel by calling brk(2).
+.PP
+If the request is less than or equal to half of a page, its size will be
+rounded up to the nearest power of two before being processed
+and if the request is less than some minimum size, it is rounded up to
+that size.
+.PP
+These sub-page allocations are served from pages which are split up
+into some number of equal size chunks.
+For each of these pages a
+.B
+struct pginfo
+.R
+describes the size of the chunks on this page, how many there are,
+how many are free and so on.
+The description consist of a bitmap of used chunks, and various counters
+and numbers used to keep track of the stuff in the page.
+.PP
+For each size of sub-page allocation, the pginfo structures for the
+pages that have free chunks in them form a list.
+The heads of these lists are stored in predetermined slots at
+the beginning of the page directory to make access fast.
+.PP
+To allocate a chunk of some size, the head of the list for the
+corresponding size is examined, and a free chunk found. The number
+of free chunks on that page is decreased by one and, if zero, the
+pginfo structure is unlinked from the list.
+.PP
+To free a chunk, the page is derived from the pointer, the page table
+for that page contains a pointer to the pginfo structure, where the
+free bit is set for the chunk, the number of free chunks increased by
+one, and if equal to one, the pginfo structure is linked into the
+proper place on the list for this size of chunks.
+If the count increases to match the number of chunks on the page, the
+pginfo structure is unlinked from the list and free(3)'ed and the
+actual page itself is free(3)'ed too.
+.PP
+To be 100% correct performance-wise these lists should be ordered
+according to the recent number of accesses to that page. This
+information is not available and it would essentially mean a reordering
+of the list on every memory reference to keep it up-to-date.
+Instead they are ordered according to the address of the pages.
+Interestingly enough, in practice this comes out to almost the same
+thing performance-wise.
+.PP
+It's not that surprising after all, it's the difference between
+following the crowd or actively directing where it can go, in both
+ways you can end up in the middle of it all.
+.PP
+The side effect of this compromise is that it also uses less storage,
+and the list never has to be reordered, all the ordering happens when
+pages are added or deleted.
+.PP
+It is an interesting twist to the implementation that the
+.B
+struct pginfo
+.R
+is allocated with malloc.
+That is, "as with malloc" to be painfully correct.
+The code knows the special case where the first (couple) of allocations on
+the page is actually the pginfo structure and deals with it accordingly.
+This avoids some silly "chicken and egg" issues.
+.ds RH Bells and whistles.
+.NH
+Bells and whistles.
+.PP
+brk(2) is actually not a very fast system call when you ask for storage.
+This is mainly because of the need by the kernel to zero the pages before
+handing them over, so therefore this implementation does not release
+heap pages until there is a large chunk to release back to the kernel.
+Chances are pretty good that we will need it again pretty soon anyway.
+Since these pages are not accessed at all, they will soon be paged out
+and don't affect anything but swap-space usage.
+.PP
+The page directory is actually kept in a mmap(2)'ed piece of
+anonymous memory. This avoids some rather silly cases that
+would otherwise have to be handled when the page directory
+has to be extended.
+.PP
+One particularly nice feature is that all pointers passed to free(3)
+and realloc(3) can be checked conclusively for validity:
+First the pointer is masked to find the page. The page directory
+is then examined, it must contain either MALLOC_FIRST, in which
+case the pointer must point exactly at the page, or it can contain
+a struct pginfo*, in which case the pointer must point to one of
+the chunks described by that structure.
+Warnings will be printed on
+.B stderr
+and nothing will be done with
+the pointer if it is found to be invalid.
+.PP
+An environment variable
+.B MALLOC_OPTIONS
+allows the user some control over the behavior of malloc.
+Some of the more interesting options are:
+.IP
+.B Abort
+If malloc fails to allocate storage, core-dump the process with
+a message rather than expect it handle this correctly.
+It's amazing how few programs actually handle this condition correctly,
+and consequently the havoc they can create is the more creative or
+destructive.
+.IP
+.B Dump
+Writes malloc statistics to a file called ``malloc.out'' prior
+to process termination.
+.IP
+.B Hint
+Pass a hint to the kernel about pages we no longer need through the
+madvise(2) system call. This can help performance on machines that
+page heavily by eliminating unnecessary page-ins and page-outs of
+unused data.
+.IP
+.B Realloc
+Always do a free and malloc when realloc(3) is called.
+For programs doing garbage collection using realloc(3), this makes the
+heap collapse faster since malloc will reallocate from the
+lowest available address.
+The default
+is to leave things alone if the size of the allocation is still in
+the same size-class.
+.IP
+.B Junk
+will explicitly fill the allocated area with a particular value
+to try to detect if programs rely on it being zero.
+.IP
+.B Zero
+will explicitly zero out the allocated chunk of memory, while any
+space after the allocation in the chunk will be filled with the
+junk value to try to catch out of the chunk references.
+.ds RH The road not taken.
+.NH
+The road not yet taken.
+.PP
+A couple of avenues were explored that could be interesting in some
+set of circumstances.
+.PP
+Using mmap(2) instead of brk(2) was actually slower, since brk(2)
+knows a lot of the things that mmap has to find out first.
+.PP
+In general there is little room for further improvement of the
+time-overhead of the malloc, further improvements will have to
+be in the area of improving paging behavior.
+.PP
+It is still under consideration to add a feature such that
+if realloc is called with two zero arguments, the internal
+allocations will be reallocated to perform a garbage collect.
+This could be used in certain types of programs to collapse
+the memory use, but so far it doesn't seem to be worth the effort.
+.PP
+Malloc/Free can be a significant point of contention in multi-threaded
+programs. Low-grain locking of the data-structures inside the
+implementation should be implemented to avoid excessive spin-waiting.
diff --git a/share/doc/papers/malloc/intro.ms b/share/doc/papers/malloc/intro.ms
new file mode 100644
index 000000000000..0ee87c959a2b
--- /dev/null
+++ b/share/doc/papers/malloc/intro.ms
@@ -0,0 +1,74 @@
+.\"
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@FreeBSD.org> wrote this file. As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $FreeBSD$
+.\"
+.ds RH Introduction
+.NH
+Introduction
+.PP
+Most programs need to allocate storage dynamically in addition
+to whatever static storage the compiler reserved at compile-time.
+To C programmers this fact is rather obvious, but for many years
+this was not an accepted and recognized fact, and many languages
+still used today don't support this notion adequately.
+.PP
+The classic UNIX kernel provides two very simple and powerful
+mechanisms for obtaining dynamic storage, the execution stack
+and the heap.
+The stack is usually put at the far upper end of the address-space,
+from where it grows down as far as needed, though this may depend on
+the CPU design.
+The heap starts at the end of the
+.B bss
+segment and grows upwards as needed.
+.PP
+There isn't really a kernel-interface to the stack as such.
+The kernel will allocate some amount of memory for it,
+not even telling the process the exact size.
+If the process needs more space than that, it will simply try to access
+it, hoping that the kernel will detect that an access has been
+attempted outside the allocated memory, and try to extend it.
+If the kernel fails to extend the stack, this could be because of lack
+of resources or permissions or because it may just be impossible
+to do in the first place, the process will usually be shot down by the
+kernel.
+.PP
+In the C language, there exists a little used interface to the stack,
+.B alloca(3) ,
+which will explicitly allocate space on the stack.
+This is not an interface to the kernel, but merely an adjustment
+done to the stack-pointer such that space will be available and
+unharmed by any subroutine calls yet to be made while the context
+of the current subroutine is intact.
+.PP
+Due to the nature of normal use of the stack, there is no corresponding
+"free" operator, but instead the space is returned when the current
+function returns to its caller and the stack frame is dismantled.
+This is the cause of much grief, and probably the single most important
+reason that alloca(3) is not, and should not be, used widely.
+.PP
+The heap on the other hand has an explicit kernel-interface in the
+system call
+.B brk(2) .
+The argument to brk(2) is a pointer to where the process wants the
+heap to end.
+There is also an interface called
+.B sbrk(2)
+taking an increment to the current end of the heap, but this is merely a
+.B libc
+front for brk(2).
+.PP
+In addition to these two memory resources, modern virtual memory kernels
+provide the mmap(2)/munmap(2) interface which allows almost complete
+control over any bit of virtual memory in the process address space.
+.PP
+Because of the generality of the mmap(2) interface and the way the
+data structures representing the regions are laid out, sbrk(2) is actually
+faster in use than the equivalent mmap(2) call, simply because
+mmap(2) has to search for information that is implicit in the sbrk(2) call.
diff --git a/share/doc/papers/malloc/kernel.ms b/share/doc/papers/malloc/kernel.ms
new file mode 100644
index 000000000000..952e95ccd962
--- /dev/null
+++ b/share/doc/papers/malloc/kernel.ms
@@ -0,0 +1,56 @@
+.\"
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@FreeBSD.org> wrote this file. As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $FreeBSD$
+.\"
+.ds RH The kernel and memory
+.NH
+The kernel and memory
+.PP
+Brk(2) isn't a particularly convenient interface,
+it was probably made more to fit the memory model of the
+hardware being used, than to fill the needs of the programmers.
+.PP
+Before paged and/or virtual memory systems became
+common, the most popular memory management facility used for
+UNIX was segments.
+This was also very often the only vehicle for imposing protection on
+various parts of memory.
+Depending on the hardware, segments can be anything, and consequently
+how the kernels exploited them varied a lot from UNIX to UNIX and from
+machine to machine.
+.PP
+Typically a process would have one segment for the text section, one
+for the data and bss section combined and one for the stack.
+On some systems the text shared a segment with the data and bss, and was
+consequently just as writable as them.
+.PP
+In this setup all the brk(2) system call has to do is to find the
+right amount of free storage, possibly moving things around in physical
+memory, maybe even swapping out a segment or two to make space,
+and change the upper limit on the data segment according to the address given.
+.PP
+In a more modern page based virtual memory implementation this is still
+pretty much the situation, except that the granularity is now pages:
+The kernel finds the right number of free pages, possibly paging some
+pages out to free them up, and then plugs them into the page-table of
+the process.
+.PP
+As such the difference is very small, the real difference is that in
+the old world of swapping, either the entire process was in primary
+storage or it wouldn't be selected to be run. In a modern VM kernel,
+a process might only have a subset of its pages in primary memory,
+the rest will be paged in, if and when the process tries to access them.
+.PP
+Only very few programs deal with the brk(2) interface directly.
+The few that do usually have their own memory management facilities.
+LISP or FORTH interpreters are good examples.
+Most other programs use the
+.B malloc(3)
+interface instead, and leave it to the malloc implementation to
+use brk(2) to get storage allocated from the kernel.
diff --git a/share/doc/papers/malloc/malloc.ms b/share/doc/papers/malloc/malloc.ms
new file mode 100644
index 000000000000..4f3cf7d80def
--- /dev/null
+++ b/share/doc/papers/malloc/malloc.ms
@@ -0,0 +1,72 @@
+.\"
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@FreeBSD.org> wrote this file. As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $FreeBSD$
+.\"
+.ds RH Malloc and free
+.NH
+Malloc and free
+.PP
+The job of malloc(3) is to turn the rather simple
+brk(2) facility into a service programs can
+actually use without getting hurt.
+.PP
+The archetypical malloc(3) implementation keeps track of the memory between
+the end of the bss section, as defined by the
+.B _end
+symbol, and the current brk(2) point using a linked list of chunks of memory.
+Each item on the list has a status as either free or used, a pointer
+to the next entry and in most cases to the previous as well, to speed
+up inserts and deletes in the list.
+.PP
+When a malloc(3) request comes in, the list is traversed from the
+front and if a free chunk big enough to hold the request is found,
+it is returned, if the free chunk is bigger than the size requested,
+a new free chunk is made from the excess and put back on the list.
+.PP
+When a chunk is
+.B free(3) 'ed,
+the chunk is found in the list, its status
+is changed to free and if one or both of the surrounding chunks
+are free, they are collapsed to one.
+.PP
+A third kind of request,
+.B realloc(3) ,
+will resize
+a chunk, trying to avoid copying the contents if possible.
+It is seldom used, and has only had a significant impact on performance
+in a few special situations.
+The typical pattern of use is to malloc(3) a chunk of the maximum size
+needed, read in the data and adjust the size of the chunk to match the
+size of the data read using realloc(3).
+.PP
+For reasons of efficiency, the original implementation of malloc(3)
+put the small structure used to contain the next and previous pointers
+plus the state of the chunk right before the chunk itself.
+.PP
+As a matter of fact, the canonical malloc(3) implementation can be
+studied in the ``Old testament'', chapter 8 verse 7 [Kernighan & Ritchie]
+.PP
+Various optimisations can be applied to the above basic algorithm:
+.IP
+If in freeing a chunk, we end up with the last chunk on the list being
+free, we can return that to the kernel by calling brk(2) with the first
+address of that chunk and then make the previous chunk the last on the
+chain by terminating its ``next'' pointer.
+.IP
+A best-fit algorithm can be used instead of first-fit at an expense
+of memory, because statistically fewer chances to brk(2) backwards will
+present themselves.
+.IP
+Splitting the list in two, one for used and one for free chunks, to
+speed the searching.
+.IP
+Putting free chunks on one of several free lists, depending on their size,
+to speed allocation.
+.IP
+\&...
diff --git a/share/doc/papers/malloc/performance.ms b/share/doc/papers/malloc/performance.ms
new file mode 100644
index 000000000000..773f92ab7832
--- /dev/null
+++ b/share/doc/papers/malloc/performance.ms
@@ -0,0 +1,113 @@
+.\"
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@FreeBSD.org> wrote this file. As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $FreeBSD$
+.\"
+.ds RH Performance
+.NH
+Performance
+.PP
+Performance for a malloc(3) implementation comes as two variables:
+.IP
+A: How much time does it use for searching and manipulating data structures.
+We will refer to this as ``overhead time''.
+.IP
+B: How well does it manage the storage.
+This rather vague metric we call ``quality of allocation''.
+.PP
+The overhead time is easy to measure, just do a lot of malloc/free calls
+of various kinds and combination, and compare the results.
+.PP
+The quality of allocation is not quite as simple as that.
+One measure of quality is the size of the process, that should obviously
+be minimized.
+Another measure is the execution time of the process.
+This is not an obvious indicator of quality, but people will generally
+agree that it should be minimized as well, and if malloc(3) can do
+anything to do so, it should.
+Explanation why it is still a good metric follows:
+.PP
+In a traditional segment/swap kernel, the desirable behavior of a process
+is to keep the brk(2) as low as possible, thus minimizing the size of the
+data/bss/heap segment, which in turn translates to a smaller process and
+a smaller probability of the process being swapped out, qed: faster
+execution time as an average.
+.PP
+In a paging environment this is not a bad choice for a default, but
+a couple of details needs to be looked at much more carefully.
+.PP
+First of all, the size of a process becomes a more vague concept since
+only the pages that are actually used need to be in primary storage
+for execution to progress, and they only need to be there when used.
+That implies that many more processes can fit in the same amount of
+primary storage, since most processes have a high degree of locality
+of reference and thus only need some fraction of their pages to actually
+do their job.
+.PP
+From this it follows that the interesting size of the process, is some
+subset of the total amount of virtual memory occupied by the process.
+This number isn't a constant, it varies depending on the whereabouts
+of the process, and it may indeed fluctuate wildly over the lifetime
+of the process.
+.PP
+One of the names for this vague concept is ``current working set''.
+It has been defined many different ways over the years, mostly to
+satisfy and support claims in marketing or benchmark contexts.
+.PP
+For now we can simply say that it is the number of pages the process
+needs in order to run at a sufficiently low paging rate in a congested
+primary storage.
+(If primary storage isn't congested, this is not really important
+of course, but most systems would be better off using the pages for
+disk-cache or similar functions, so from that perspective it will
+always be congested.)
+If the number of pages is too small, the process will wait for its
+pages to be read from secondary storage much of the time, if it's too
+big, the space could be used better for something else.
+.PP
+From the view of any single process, this number of pages is
+"all of my pages", but from the point of view of the OS it should
+be tuned to maximise the total throughput of all the processes on
+the machine at the time.
+This is usually done using various kinds of least-recently-used
+replacement algorithms to select page candidates for replacement.
+.PP
+With this knowledge, can we decide what the performance goal is for
+a modern malloc(3) ?
+Well, it's almost as simple as it used to be:
+.B
+Minimize the number of pages accessed.
+.R
+.PP
+This really is the core of it all.
+If the number of accessed pages is smaller, then locality of reference is
+higher, and all kinds of caches (which is essentially what the
+primary storage is in a VM system) work better.
+.PP
+It's interesting to notice that the classical malloc fails on this one
+because the information about free chunks is kept with the free
+chunks themselves. In some of the benchmarks this came out as all the
+pages being paged in every time a malloc call was made, because malloc
+had to traverse the free list to find a suitable chunk for the allocation.
+If memory is not in use, then you shouldn't access it.
+.PP
+The secondary goal is more evident:
+.B
+Try to work in pages.
+.R
+.PP
+That makes it easier for the kernel, and wastes less virtual memory.
+Most modern implementations do this when they interact with the
+kernel, but few try to avoid objects spanning pages.
+.PP
+If an object's size
+is less than or equal to a page, there is no reason for it to span two pages.
+Having objects span pages means that two pages must be
+paged in, if that object is accessed.
+.PP
+With this analysis in the luggage, we can start coding.
diff --git a/share/doc/papers/malloc/problems.ms b/share/doc/papers/malloc/problems.ms
new file mode 100644
index 000000000000..980f2e97ddba
--- /dev/null
+++ b/share/doc/papers/malloc/problems.ms
@@ -0,0 +1,54 @@
+.\"
+.\" ----------------------------------------------------------------------------
+.\" "THE BEER-WARE LICENSE" (Revision 42):
+.\" <phk@FreeBSD.org> wrote this file. As long as you retain this notice you
+.\" can do whatever you want with this stuff. If we meet some day, and you think
+.\" this stuff is worth it, you can buy me a beer in return. Poul-Henning Kamp
+.\" ----------------------------------------------------------------------------
+.\"
+.\" $FreeBSD$
+.\"
+.ds RH The problems
+.NH
+The problems
+.PP
+Even though malloc(3) is a lot simpler to use
+than the raw brk(2)/sbrk(2) interface,
+or maybe exactly because
+of that,
+a lot of problems arise from its use.
+.IP
+Writing to memory outside the allocated chunk.
+The most likely result being that the data structure used to hold
+the links and flags about this chunk or the next one gets thrashed.
+.IP
+Freeing a pointer to memory not allocated by malloc.
+This is often a pointer that points to an object on the stack or in the
+data-section, in newer implementations of C it may even be in the text-
+section where it is likely to be readonly.
+Some malloc implementations detect this, some don't.
+.IP
+Freeing a modified pointer. This is a very common mistake, freeing
+not the pointer malloc(3) returned, but rather some offset from it.
+Some mallocs will handle this correctly if the offset is positive.
+.IP
+Freeing the same pointer more than once.
+.IP
+Accessing memory in a chunk after it has been free(3)'ed.
+.PP
+The handling of these problems have traditionally been weak.
+A core-dump was the most common form for "handling", but in rare
+cases one could experience the famous "malloc: corrupt arena."
+message before the core-dump.
+Even worse though, very often the program will just continue,
+possibly giving wrong results.
+.PP
+An entirely different form of problem is that
+the memory returned by malloc(3) can contain any value.
+Unfortunately most kernels, correctly, zero out the storage they
+provide with brk(2), and thus the storage malloc returns will be zeroed
+in many cases as well, so programmers are not particular apt to notice
+that their code depends on malloc'ed storage being zeroed.
+.PP
+With problems this big and error handling this weak, it is not
+surprising that problems are hard and time consuming to find and fix.
diff --git a/share/doc/papers/newvm/0.t b/share/doc/papers/newvm/0.t<