aboutsummaryrefslogtreecommitdiff
path: root/en_US.ISO8859-1/articles/geom-class/article.xml
blob: 92ec3b3be0ef0085226419249dc6a0bc409b56cf (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
<?xml version="1.0" encoding="iso-8859-1" standalone="no"?>
<!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook XML V4.5-Based Extension//EN"
	"../../../share/xml/freebsd45.dtd">

<article lang='en'>
  <title>Writing a GEOM Class</title>
  <articleinfo>

    <authorgroup>
      <author>
        <firstname>Ivan</firstname>
        <surname>Voras</surname>
        <affiliation>
          <address><email>ivoras@FreeBSD.org</email>
          </address>
        </affiliation>
      </author>
    </authorgroup>

    <legalnotice id="trademarks" role="trademarks">
      &tm-attrib.freebsd;
      &tm-attrib.cvsup;
      &tm-attrib.intel;
      &tm-attrib.general;
    </legalnotice>

    <pubdate>$FreeBSD$</pubdate>

    <releaseinfo>$FreeBSD$</releaseinfo>

    <abstract>

      <para>This text documents some starting points in developing
      GEOM classes, and kernel modules in general. It is assumed
      that the reader is familiar with C userland programming.</para>

    </abstract>

  </articleinfo>

<!-- Introduction -->
<sect1 id="intro">
  <title>Introduction</title>

  <sect2 id="intro-docs">
    <title>Documentation</title>

    <para>Documentation on kernel programming is scarce &mdash; it is one of
      few areas where there is nearly nothing in the way of friendly
      tutorials, and the phrase <quote>use the source!</quote> really
      holds true. However, there are some bits and pieces (some of
      them seriously outdated) floating around that should be studied
      before beginning to code:</para>

    <itemizedlist>

      <listitem><para>The <ulink
        url="&url.books.developers-handbook;/index.html">FreeBSD
        Developer's Handbook</ulink> &mdash; part of the documentation
        project, it does not contain anything specific to kernel
        programming, but rather some general useful information.</para></listitem>

      <listitem><para>The <ulink
        url="&url.books.arch-handbook;/index.html">FreeBSD
        Architecture Handbook</ulink> &mdash; also from the documentation
        project, contains descriptions of several low-level facilities
        and procedures.  The most important chapter is 13, <ulink
        url="&url.books.arch-handbook;/driverbasics.html">Writing
        FreeBSD device drivers</ulink>.</para></listitem>

      <listitem><para>The Blueprints section of <ulink
        url="http://www.freebsddiary.org">FreeBSD Diary</ulink> web
        site &mdash; contains several interesting articles on kernel
        facilities.</para></listitem>

      <listitem><para>The man pages in section 9 &mdash; for important
        documentation on kernel functions.</para></listitem>

      <listitem><para>The &man.geom.4; man page and <ulink
        url="http://phk.freebsd.dk/pubs/">PHK's GEOM slides</ulink>
        &mdash; for general introduction of the GEOM
        subsystem.</para></listitem>

      <listitem><para>Man pages &man.g.bio.9;, &man.g.event.9;, &man.g.data.9;,
        &man.g.geom.9;, &man.g.provider.9; &man.g.consumer.9;, &man.g.access.9;
        &amp; others linked from those, for documentation on specific
        functionalities.
      </para></listitem>

      <listitem><para>The &man.style.9; man page &mdash; for documentation on
        the coding-style conventions which must be followed for any code
        which is to be committed to the FreeBSD CVS tree.</para></listitem>

    </itemizedlist>

    </sect2>
  </sect1>

  <sect1 id="prelim">
    <title>Preliminaries</title>

    <para>The best way to do kernel development is to have (at least)
      two separate computers. One of these would contain the
      development environment and sources, and the other would be used
      to test the newly written code by network-booting and
      network-mounting filesystems from the first one.  This way if
      the new code contains bugs and crashes the machine, it will not
      mess up the sources (and other <quote>live</quote> data). The
      second system does not even require a proper display.  Instead, it
      could be connected with a serial cable or KVM to the first
      one.</para>

    <para>But, since not everybody has two or more computers handy, there are
      a few things that can be done to prepare an otherwise <quote>live</quote>
      system for developing kernel code. This setup is also applicable
      for developing in a <ulink url="http://www.vmware.com/">VMWare</ulink>
      or <ulink url="http://www.qemu.org/">QEmu</ulink> virtual machine (the
      next best thing after a dedicated development machine).</para>

    <sect2 id="prelim-system">
      <title>Modifying a system for development</title>

      <para>For any kernel programming a kernel with
        <option>INVARIANTS</option> enabled is a must-have. So enter
        these in your kernel configuration file:</para>

       <programlisting>options INVARIANT_SUPPORT
options INVARIANTS</programlisting>

      <para>For more debugging you should also include WITNESS support,
        which will alert you of mistakes in locking:</para>

       <programlisting>options WITNESS_SUPPORT
options WITNESS</programlisting>

      <para>For debugging crash dumps, a kernel with debug symbols is
        needed:</para>

      <programlisting>  makeoptions    DEBUG=-g</programlisting>

      <para>With the usual way of installing the kernel (<command>make
        installkernel</command>) the debug kernel will not be
        automatically installed. It is called
        <filename>kernel.debug</filename> and located in
        <filename>/usr/obj/usr/src/sys/KERNELNAME/</filename>.  For
        convenience it should be copied to
        <filename>/boot/kernel/</filename>.</para>

      <para>Another convenience is enabling the kernel debugger so you
        can examine a kernel panic when it happens. For this, enter
        the following lines in your kernel configuration file:</para>

      <programlisting>options KDB
options DDB
options KDB_TRACE</programlisting>

      <para>For this to work you might need to set a sysctl (if it is
        not on by default):</para>

      <programlisting>  debug.debugger_on_panic=1</programlisting>

      <para>Kernel panics will happen, so care should be taken with
        the filesystem cache. In particular, having softupdates might
        mean the latest file version could be lost if a panic occurs
        before it is committed to storage.  Disabling softupdates
        yields a great performance hit, and still does not guarantee
        data consistency.  Mounting filesystem with the <quote>sync</quote> option
        is needed for that.  For a compromise, the softupdates cache delays can
        be shortened. There are three sysctl's that are useful for
        this (best to be set in
        <filename>/etc/sysctl.conf</filename>):</para>

      <programlisting>kern.filedelay=5
kern.dirdelay=4
kern.metadelay=3</programlisting>

      <para>The numbers represent seconds.</para>

      <para>For debugging kernel panics, kernel core dumps are
        required. Since a kernel panic might make filesystems
        unusable, this crash dump is first written to a raw
        partition. Usually, this is the swap partition.  This partition must be at
        least as large as the physical RAM in the machine. On the
        next boot, the dump is copied to a regular file.
        This happens after filesystems are checked and mounted, and
        before swap is enabled.  This is controlled with two
        <filename>/etc/rc.conf</filename> variables:</para>

      <programlisting>dumpdev="/dev/ad0s4b"
dumpdir="/usr/core </programlisting>

      <para>The <varname>dumpdev</varname> variable specifies the swap
        partition and <varname>dumpdir</varname> tells the system
        where in the filesystem to relocate the core dump on reboot.</para>

      <para>Writing kernel core dumps is slow and takes a long time so
        if you have lots of memory (>256M) and lots of panics it could
        be frustrating to sit and wait while it is done (twice &mdash; first
        to write it to swap, then to relocate it to filesystem). It is
        convenient then to limit the amount of RAM the system will use
        via a <filename>/boot/loader.conf</filename> tunable:</para>

      <programlisting>  hw.physmem="256M"</programlisting>

      <para>If the panics are frequent and filesystems large (or you
        simply do not trust softupdates+background fsck) it is advisable
        to turn background fsck off via
        <filename>/etc/rc.conf</filename> variable:</para>

      <programlisting>  background_fsck="NO"</programlisting>

      <para>This way, the filesystems will always get checked when
        needed.  Note that with background fsck, a new panic could happen while
        it is checking the disks. Again, the safest way is not to have
        many local filesystems by using another computer as an NFS
        server.</para>
    </sect2>

    <sect2 id="prelim-starting">
      <title>Starting the project</title>

      <para>For the purpose of creating a new GEOM class, an empty
        subdirectory has to be created under an arbitrary user-accessible
        directory. You do not have to create the module directory under
        <filename>/usr/src</filename>.</para>
    </sect2>

    <sect2 id="prelim-makefile">
      <title>The Makefile</title>

      <para>It is good practice to create
        <filename>Makefile</filename>s for every nontrivial coding
        project, which of course includes kernel modules.</para>

      <para>Creating the <filename>Makefile</filename> is simple
        thanks to an extensive set of helper routines provided by the
        system. In short, here is how a minimal <filename>Makefile</filename>
        looks for a kernel module:</para>

      <programlisting>SRCS=g_journal.c
KMOD=geom_journal

.include &lt;bsd.kmod.mk&gt;</programlisting>

      <para>This <filename>Makefile</filename> (with changed filenames)
        will do for any kernel module, and a GEOM class can reside in just
        one kernel module. If more than one file is required, list it in the
        <envar>SRCS</envar> variable, separated with whitespace from
        other filenames.</para>
    </sect2>
  </sect1>

  <sect1 id="kernelprog">
    <title>On FreeBSD kernel programming</title>

    <sect2 id="kernelprog-memalloc">
      <title>Memory allocation</title>

      <para>See &man.malloc.9;. Basic memory allocation is only
        slightly different than its userland equivalent. Most
        notably, <function>malloc</function>() and
        <function>free</function>() accept additional parameters as is
        described in the man page.</para>

      <para>A <quote>malloc type</quote> must be declared in the
        declaration section of a source file, like this:</para>

      <programlisting>  static MALLOC_DEFINE(M_GJOURNAL, "gjournal data", "GEOM_JOURNAL Data");</programlisting>

      <para>To use this macro, <filename>sys/param.h</filename>,
        <filename>sys/kernel.h</filename> and
        <filename>sys/malloc.h</filename> headers must be
        included.</para>

      <para>There is another mechanism for allocating memory, the UMA
        (Universal Memory Allocator). See &man.uma.9; for details, but
        it is a special type of allocator mainly used for speedy
        allocation of lists comprised of same-sized items (for
        example, dynamic arrays of structs).</para>
    </sect2>

    <sect2 id="kernelprog-lists">
      <title>Lists and queues</title>

      <para>See &man.queue.3;. There are a LOT of cases when a list of
        things needs to be maintained. Fortunately, this data
        structure is implemented (in several ways) by C macros
        included in the system. The most used list type is TAILQ
        because it is the most flexible. It is also the one with largest
        memory requirements (its elements are doubly-linked) and
        also the slowest (although the speed variation is on
        the order of several CPU instructions more, so it should not be
        taken seriously).</para>

      <para>If data retrieval speed is very important, see
        &man.tree.3; and &man.hashinit.9;.</para>
    </sect2>

    <sect2 id="kernelprog-bios">
      <title>BIOs</title>

      <para>Structure <structname>bio</structname> is used for any and
        all Input/Output operations concerning GEOM. It basically
        contains information about what device ('provider') should
        satisfy the request, request type, offset, length, pointer to
        a buffer, and a bunch of <quote>user-specific</quote> flags
        and fields that can help implement various hacks.</para>

      <para>The important thing here is that <structname>bio</structname>s
        are handled asynchronously. That means that, in most parts of the code,
        there is no analogue to userland's &man.read.2; and
        &man.write.2; calls that do not return until a request is
        done. Rather, a developer-supplied function is called as a
        notification when the request gets completed (or results in
        error).</para>

      <para>The asynchronous programming model (also
        called <quote>event-driven</quote>) is somewhat harder
        than the much more used imperative one used in userland
        (at least it takes a
        while to get used to it). In some cases the helper routines
        <function>g_write_data</function>() and
        <function>g_read_data</function>() can be used, but <emphasis>not
        always</emphasis>. In particular, they cannot be used when
        a mutex is held; for example, the GEOM topology mutex or
        the internal mutex held during the <function>.start</function>() and
        <function>.stop</function>() functions.</para>

    </sect2>
  </sect1>

  <sect1 id="geom">
    <title>On GEOM programming</title>

    <sect2 id="geom-ggate">
      <title>Ggate</title>

      <para>If maximum performance is not needed, a much simpler way
        of making a data transformation is to implement it in userland
        via the ggate (GEOM gate) facility. Unfortunately, there is no
        easy way to convert between, or even share code between the
        two approaches.</para>
    </sect2>

    <sect2 id="geom-class">
      <title>GEOM class</title>

      <para>GEOM classes are transformations on the data. These transformations
        can be combined in a tree-like fashion. Instances of GEOM classes are
        called <emphasis>geoms</emphasis>.</para>

      <para>Each GEOM class has several <quote>class methods</quote> that get called
        when there is no geom instance available (or they are simply not
        bound to a single instance):</para>

      <itemizedlist>

        <listitem><para><function>.init</function> is called when GEOM
          becomes aware of a GEOM class (e.g. when the kernel module
          gets loaded.)</para></listitem>

        <listitem><para><function>.fini</function> gets called when GEOM
          abandons the class (e.g. when the module gets
          unloaded)</para></listitem>

        <listitem><para><function>.taste</function> is called next, once for
          each provider the system has available.  If applicable, this
          function will usually create and start a geom
          instance.</para></listitem>

        <listitem><para><function>.destroy_geom</function> is called when
          the geom should be disbanded</para></listitem>

        <listitem><para><function>.ctlconf</function> is called when user
          requests reconfiguration of existing geom</para></listitem>

      </itemizedlist>

      <para>Also defined are the GEOM event functions, which will get
        copied to the geom instance.</para>

      <para>Field <function>.geom</function> in the
        <structname>g_class</structname> structure is a LIST of geoms
        instantiated from the class.</para>

      <para>These functions are called from the g_event kernel thread.</para>

    </sect2>

    <sect2 id="geom-softc">
      <title>Softc</title>

      <para>The name <quote>softc</quote> is a legacy term for
        <quote>driver private data</quote>. The name most probably
        comes from the archaic term <quote>software control block</quote>.
        In GEOM, it is a structure (more precise: pointer to a
        structure) that can be attached to a geom instance to hold
        whatever data is private to the geom instance. Most GEOM classes
        have the following members:</para>

      <itemizedlist>
        <listitem><para><varname>struct g_provider *provider</varname> : The
        <quote>provider</quote> this geom instantiates</para></listitem>

        <listitem><para><varname>uint16_t n_disks</varname> : Number of
          consumer this geom consumes</para></listitem>

        <listitem><para><varname>struct g_consumer **disks</varname> : Array
          of <varname>struct g_consumer*</varname>. (It is not possible
          to use just single indirection because struct g_consumer*
          are created on our behalf by GEOM).</para></listitem>
      </itemizedlist>

      <para>The <structname>softc</structname> structure contains all
        the state of geom instance. Every geom instance has its own
        softc.</para>
    </sect2>

    <sect2 id="geom-metadata">
      <title>Metadata</title>

      <para>Format of metadata is more-or-less class-dependent, but
        MUST start with:</para>

      <itemizedlist>

        <listitem><para>16 byte buffer for null-terminated signature
          (usually the class name)</para></listitem>

        <listitem><para>uint32 version ID</para></listitem>

      </itemizedlist>

      <para>It is assumed that geom classes know how to handle metadata
        with version ID's lower than theirs.</para>

      <para>Metadata is located in the last sector of the provider
        (and thus must fit in it).</para>

      <para>(All this is implementation-dependent but all existing
        code works like that, and it is supported by libraries.)</para>
    </sect2>

    <sect2 id="geom-creating">
      <title>Labeling/creating a geom</title>

      <para>The sequence of events is:</para>

      <itemizedlist>

        <listitem><para>user calls &man.geom.8; utility (or one of its
          hardlinked friends)</para></listitem>

        <listitem><para>the utility figures out which geom class it is
          supposed to handle and searches for
          <filename>geom_<replaceable>CLASSNAME</replaceable>.so</filename>
          library (usually in
          <filename>/lib/geom</filename>).</para></listitem>

        <listitem><para>it &man.dlopen.3;-s the library, extracts the
          definitions of command-line parameters and helper
          functions.</para></listitem>

      </itemizedlist>

      <para>In the case of creating/labeling a new geom, this is what
      happens:</para>

      <itemizedlist>

        <listitem><para>&man.geom.8; looks in the command-line argument
          for the command (usually <option>label</option>), and calls a helper
          function.</para></listitem>

        <listitem><para>The helper function checks parameters and gathers
          metadata, which it proceeds to write to all concerned
          providers.</para></listitem>

        <listitem><para>This <quote>spoils</quote> existing geoms (if any) and
          initializes a new round of <quote>tasting</quote> of the providers. The
          intended geom class recognizes the metadata and brings the
          geom up.</para></listitem>

      </itemizedlist>

      <para>(The above sequence of events is implementation-dependent
        but all existing code works like that, and it is supported by
        libraries.)</para>

    </sect2>

    <sect2 id="geom-command">
      <title>Geom command structure</title>

      <para>The helper <filename>geom_CLASSNAME.so</filename> library
        exports <structname>class_commands</structname> structure,
        which is an array of <structname>struct g_command</structname>
        elements. Commands are of uniform format and look like:</para>

      <programlisting>  verb [-options] geomname [other]</programlisting>

      <para>Common verbs are:</para>

      <itemizedlist>

        <listitem><para>label &mdash; to write metadata to devices so they can be
          recognized at tasting and brought up in geoms</para></listitem>

        <listitem><para>destroy &mdash; to destroy metadata, so the geoms get
          destroyed</para></listitem>

      </itemizedlist>

      <para>Common options are:</para>

      <itemizedlist>
        <listitem><para><literal>-v</literal> : be verbose</para></listitem>
        <listitem><para><literal>-f</literal> : force</para></listitem>
      </itemizedlist>

      <para>Many actions, such as labeling and destroying metadata can
        be performed in userland. For this, <structname>struct
        g_command</structname> provides field
        <varname>gc_func</varname> that can be set to a function (in
        the same <filename>.so</filename>) that will be called to
        process a verb. If <varname>gc_func</varname> is NULL, the
        command will be passed to kernel module, to
        <function>.ctlreq</function> function of the geom
        class.</para>
    </sect2>

    <sect2 id="geom-geoms">
      <title>Geoms</title>

      <para>Geoms are instances of GEOM classes. They have internal
        data (a softc structure) and some functions with which they
        respond to external events.</para>

      <para>The event functions are:</para>

      <itemizedlist>
        <listitem><para><function>.access</function> : calculates
        permissions (read/write/exclusive)</para></listitem>

        <listitem><para><function>.dumpconf</function> : returns
        XML-formatted information about the geom</para></listitem>

        <listitem><para><function>.orphan</function> : called when some
        underlying provider gets disconnected</para></listitem>

        <listitem><para><function>.spoiled</function> : called when some
        underlying provider gets written to</para></listitem>

        <listitem><para><function>.start</function> : handles I/O</para></listitem>
      </itemizedlist>

      <para>These functions are called from the <function>g_down</function>
        kernel thread and there can be no sleeping in this context,
        (see definition of sleeping elsewhere) which limits what can be done
        quite a bit, but forces the handling to be fast.</para>

      <para>Of these, the most important function for doing actual
        useful work is the <function>.start</function>() function,
        which is called when a BIO request arrives for a provider
        managed by a instance of geom class.</para>
    </sect2>

    <sect2 id="geom-threads">
      <title>Geom threads</title>

      <para>There are three kernel threads created and run by the GEOM
      framework:</para>

      <itemizedlist>
        <listitem><para><literal>g_down</literal> : Handles requests coming
          from high-level entities (such as a userland request) on the
          way to physical devices</para></listitem>

        <listitem><para><literal>g_up</literal> : Handles responses from
          device drivers to requests made by higher-level
          entities</para></listitem>

        <listitem><para><literal>g_event</literal> : Handles all other
          cases: creation of geom instances, access counting, <quote>spoil</quote>
          events, etc.</para></listitem>
      </itemizedlist>

      <para>When a user process issues <quote>read data X at offset Y
        of a file</quote> request, this is what happens:</para>

      <itemizedlist>

        <listitem><para>The filesystem converts the request into a struct bio
          instance and passes it to the GEOM subsystem. It knows what geom
          instance should handle it because filesystems are hosted
          directly on a geom instance.</para></listitem>

        <listitem><para>The request ends up as a call to the
          <function>.start</function>() function made on the g_down
          thread and reaches the top-level geom instance.</para></listitem>

        <listitem><para>This top-level geom instance (for example the
          partition slicer) determines that the request should be
          routed to a lower-level instance (for example the disk
          driver). It makes a copy of the bio request (bio requests
          <emphasis>ALWAYS</emphasis> need to be copied between
          instances, with <function>g_clone_bio</function>()!),
          modifies the data offset and target provider fields and
          executes the copy with
          <function>g_io_request</function>()</para></listitem>

        <listitem><para>The disk driver gets the bio request also as a call
          to <function>.start</function>() on the
          <literal>g_down</literal> thread. It talks to hardware,
          gets the data back, and calls
          <function>g_io_deliver</function>() on the bio.</para></listitem>

        <listitem><para>Now, the notification of bio completion
          <quote>bubbles up</quote> in the <literal>g_up</literal>
          thread. First the partition slicer gets
          <function>.done</function>() called in the
          <literal>g_up</literal> thread, it uses information stored
          in the bio to free the cloned <structname>bio</structname>
          structure (with <function>g_destroy_bio</function>()) and
          calls <function>g_io_deliver</function>() on the original
          request.</para></listitem>

        <listitem><para>The filesystem gets the data and transfers it to
          userland.</para></listitem>
      </itemizedlist>

      <para>See &man.g.bio.9; man page for information how the data is
        passed back and forth in the <structname>bio</structname>
        structure (note in particular the <varname>bio_parent</varname>
        and <varname>bio_children</varname> fields and how they are
        handled).</para>

      <para>One important feature is: <emphasis>THERE CAN BE NO SLEEPING IN G_UP
        AND G_DOWN THREADS</emphasis>. This means that none of the following
        things can be done in those threads (the list is of course not
        complete, but only informative):</para>

      <itemizedlist>
        <listitem><para>Calls to <function>msleep</function>() and
          <function>tsleep</function>(), obviously.</para></listitem>

        <listitem><para>Calls to <function>g_write_data</function>() and
          <function>g_read_data</function>(), because these sleep
          between passing the data to consumers and
          returning.</para></listitem>

        <listitem><para>Waiting for I/O.</para></listitem>

        <listitem><para>Calls to &man.malloc.9; and
          <function>uma_zalloc</function>() with
          <varname>M_WAITOK</varname> flag set</para></listitem>

        <listitem><para>sx and other sleepable locks</para></listitem>
      </itemizedlist>

      <para>This restriction is here to stop GEOM code clogging the I/O
        request path, since sleeping is usually not
        time-bound and there can be no guarantees on how long will it
        take (there are some other, more technical reasons also). It
        also means that there is not much that can be done in those
        threads; for example, almost any complex thing requires memory
        allocation. Fortunately, there is a way out: creating
        additional kernel threads.</para>
    </sect2>

    <sect2 id="geom-kernelthreads">
      <title>Kernel threads for use in geom code</title>

      <para>Kernel threads are created with &man.kthread.create.9;
        function, and they are sort of similar to userland threads in
        behaviour, only they cannot return to caller to signify
        termination, but must call &man.kthread.exit.9;.</para>

      <para>In GEOM code, the usual use of threads is to offload
        processing of requests from <literal>g_down</literal> thread
        (the <function>.start</function>() function). These threads
        look like <quote>event handlers</quote>: they have a linked
        list of event associated with them (on which events can be posted
        by various functions in various threads so it must be
        protected by a mutex), take the events from the list one by
        one and process them in a big <literal>switch</literal>()
        statement.</para>

      <para>The main benefit of using a thread to handle I/O requests
        is that it can sleep when needed. Now, this sounds good, but
        should be carefully thought out. Sleeping is well and very
        convenient but can very effectively destroy performance of the
        geom transformation. Extremely performance-sensitive classes
        probably should do all the work in
        <function>.start</function>() function call, taking great care
        to handle out-of-memory and similar errors.</para>

      <para>The other benefit of having a event-handler thread like
        that is to serialize all the requests and responses coming
        from different geom threads into one thread. This is also very
        convenient but can be slow. In most cases, handling of
        <function>.done</function>() requests can be left to the
        <literal>g_up</literal> thread.</para>

      <para>Mutexes in FreeBSD kernel (see &man.mutex.9;) have
        one distinction from their more common userland cousins &mdash; the
        code cannot sleep while holding
        a mutex). If the code needs to sleep a lot, &man.sx.9; locks
        may be more appropriate.  On the other hand, if you do almost
        everything in a single thread, you may get away with no
        mutexes at all.</para>

    </sect2>

  </sect1>

</article>