aboutsummaryrefslogtreecommitdiff
path: root/en_US.ISO8859-1/books/handbook/vinum/chapter.sgml
blob: 74d3b49cb6c384109b982d20aa67a96025b4befe (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
<!--
	The Vinum Volume Manager
	By Greg Lehey (grog at lemis dot com)

	Added to the Handbook by Hiten Pandya <hiten@uk.FreeBSD.org>
	and Tom Rhodes <trhodes@FreeBSD.org>

	For the FreeBSD Documentation Project
	$FreeBSD$
-->

<chapter id="vinum-vinum">
  <title>The Vinum Volume Manager</title>
  
  <sect1 id="vinum-synopsis">
    <title>Synopsis</title>

    <para>No matter what disks you have, there will always be limitations:</para>
      <itemizedlist>
	<listitem>
	  <para>They can be too small.</para>
	</listitem>

	<listitem>
	  <para>They can be too slow.</para>
	</listitem>

	<listitem>
	  <para>They can be too unreliable.</para>
	</listitem>
      </itemizedlist>
  </sect1>

  <sect1 id="vinum-intro">
    <sect1info>
      <authorgroup>
	<author>
	  <firstname>Greg</firstname>
	  <surname>Lehey</surname>
	  <contrib>Originally written by </contrib>
	</author>
      </authorgroup>
    </sect1info>

    <title>Disks are too small</title>

    <indexterm><primary>Vinum</primary></indexterm>
    <indexterm><primary>Volume</primary>
    <secondary>Manager</secondary></indexterm>
    
    <para><emphasis>Vinum</emphasis> is a
      so-called <emphasis>Volume Manager</emphasis>, a virtual disk driver that
      addresses these three problems.  Let us look at them in more detail.  Various
      solutions to these problems have been proposed and implemented:</para>


    <para>Disks are getting bigger, but so are data storage requirements.
      Often you will find you want a file system that is bigger than the disks
      you have available.  Admittedly, this problem is not as acute as it was
      ten years ago, but it still exists.  Some systems have solved this by
      creating an abstract device which stores its data on a number of disks.</para>
  </sect1>

  <sect1 id="vinum-access-bottlenecks">
    <title>Access bottlenecks</title>
    <para>Modern systems frequently need to access data in a highly
      concurrent manner.  For example, large FTP or HTTP servers can maintain
      thousands of concurrent sessions and have multiple 100&nbsp;Mbit/s connections
      to the outside world, well beyond the sustained transfer rate of most
      disks.</para>

    <para>Current disk drives can transfer data sequentially at up to
      70&nbsp;MB/s, but this value is of little importance in an environment
      where many independent processes access a drive, where they may
      achieve only a fraction of these values.  In such cases it is more
      interesting to view the problem from the viewpoint of the disk
      subsystem: the important parameter is the load that a transfer places
      on the subsystem, in other words the time for which a transfer occupies
      the drives involved in the transfer.</para>

    <para>In any disk transfer, the drive must first position the heads, wait
      for the first sector to pass under the read head, and then perform the
      transfer.  These actions can be considered to be atomic: it does not make
      any sense to interrupt them.</para>

    <para><anchor id="vinum-latency">
      Consider a typical transfer of about 10&nbsp;kB: the current generation of
      high-performance disks can position the heads in an average of 3.5&nbsp;ms.  The
      fastest drives spin at 15,000&nbsp;rpm, so the average rotational latency
      (half a revolution) is 2&nbsp;ms.  At 70&nbsp;MB/s, the transfer itself takes about
      150&nbsp;&mu;s, almost nothing compared to the positioning time.  In such a
      case, the effective  transfer rate drops to a little over 1&nbsp;MB/s and is
      clearly highly dependent on the transfer size.</para>

    <para>The traditional and obvious solution to this bottleneck is
      <quote>more spindles</quote>: rather than using one large disk, it uses
      several smaller disks with the same aggregate storage space.  Each disk is
      capable of positioning and transferring independently, so the effective
      throughput increases by a factor close to the number of disks used.
    </para>

    <para>The exact throughput improvement is, of course, smaller than the
      number of disks involved: although each drive is capable of transferring
      in parallel, there is no way to ensure that the requests are evenly
      distributed across the drives.  Inevitably the load on one drive will be
      higher than on another.</para>

    <indexterm>
      <primary>concatenation</primary>
      <secondary>Vinum</secondary>
    </indexterm>
    <indexterm>
      <primary>Vinum</primary>
      <secondary>concatenation</secondary>
    </indexterm>

    <para>The evenness of the load on the disks is strongly dependent on
      the way the data is shared across the drives.  In the following
      discussion, it is convenient to think of the disk storage as a large
      number of data sectors which are addressable by number, rather like the
      pages in a book.  The most obvious method is to divide the virtual disk
      into groups of consecutive sectors the size of the individual physical
      disks and store them in this manner, rather like taking a large book and
      tearing it into smaller sections.  This method is called
      <emphasis>concatenation</emphasis> and has the advantage that the disks
      are not required to have any specific size relationships.  It works
      well when the access to the virtual disk is spread evenly about its
      address space.  When access is concentrated on a smaller area, the
      improvement is less marked.  <xref linkend="vinum-concat"> illustrates
      the sequence in which storage units are allocated in a concatenated
      organization.</para>

    <para>
      <figure id="vinum-concat">
	<title>Concatenated organization</title>
	<graphic fileref="vinum/vinum-concat">
      </figure>
    </para>

    <indexterm>
      <primary>striping</primary>
      <secondary>Vinum</secondary>
    </indexterm>
    <indexterm>
      <primary>Vinum</primary>
      <secondary>striping</secondary>
    </indexterm>

    <para>An alternative mapping is to divide the address space into smaller,
      equal-sized components and store them sequentially on different devices.
      For example, the first 256 sectors may be stored on the first disk, the
      next 256 sectors on the next disk and so on.  After filling the last
      disk, the process repeats until the disks are full.  This mapping is called
      <emphasis>striping</emphasis> or <acronym>RAID-0</acronym>

    <footnote>
      <indexterm>
	<primary>RAID</primary>
      </indexterm>
      <indexterm>
	<primary>Redundant</primary>
	<secondary>Array of Inexpensive Disks</secondary>
      </indexterm>
    
      <para><acronym>RAID</acronym> stands for <emphasis>Redundant Array of
      Inexpensive Disks</emphasis> and offers various forms of fault tolerance,
      though the latter term is somewhat misleading: it provides no redundancy.</para>
    </footnote>.

    Striping requires somewhat more effort to locate the data, and it can cause
    additional I/O load where a transfer is spread over multiple disks, but it
    can also provide a more constant load across the disks.
    <xref linkend="vinum-striped"> illustrates the sequence in which storage
    units are allocated in a striped organization.</para>

    <para>
      <figure id="vinum-striped">
        <title>Striped organization</title>
	<graphic fileref="vinum/vinum-striped">
      </figure>
    </para>
  </sect1>

  <sect1 id="vinum-data-integrity">
    <title>Data integrity</title>
      <para>The final problem with current disks is that they are unreliable.
	Although disk drive reliability has increased tremendously over the last
	few years, they are still the most likely core component of a server to
	fail.  When they do, the results can be catastrophic: replacing a failed
	disk drive and restoring data to it can take days.</para>

      <indexterm>
	<primary>mirroring</primary>
	<secondary>Vinum</secondary>
      </indexterm>
      <indexterm>
	<primary>Vinum</primary>
	<secondary>mirroring</secondary>
      </indexterm>
      <indexterm>
	<primary>RAID</primary>
	<secondary>level 1</secondary>
      </indexterm>
      <indexterm>
	<primary>RAID-1</primary>
      </indexterm>
      
      <para>The traditional way to approach this problem has been
	<emphasis>mirroring</emphasis>, keeping two copies of the data
	on different physical hardware.  Since the advent of the
	<acronym>RAID</acronym> levels, this technique has also been called
	<acronym>RAID level 1</acronym> or <acronym>RAID-1</acronym>.  Any
	write to the volume writes to both locations; a read can be satisfied from
	either, so if one drive fails, the data is still available on the other
	drive.</para>
    
      <para>Mirroring has two problems:</para>
    
	<itemizedlist>
	  <listitem>
	    <para>The price.  It requires twice as much disk storage as
	      a non-redundant solution.</para>
	  </listitem>

	  <listitem>
	    <para>The performance impact.  Writes must be performed to
	      both drives, so they take up twice the bandwidth of a non-mirrored
	      volume.  Reads do not suffer from a performance penalty: it even looks
	      as if they are faster.</para>
	  </listitem>
	</itemizedlist>

      <para><indexterm><primary>RAID-5</primary></indexterm>An alternative
	solution is <emphasis>parity</emphasis>, implemented in the
	<acronym>RAID</acronym> levels 2, 3, 4 and 5.  Of these,
	<acronym>RAID-5</acronym> is the most interesting. As implemented
	in Vinum, it is a variant on a striped organization which dedicates
	one block of each stripe to parity of the other blocks. As implemented
	by Vinum, a <acronym>RAID-5</acronym> plex is similar to a
	striped plex, except that it implements <acronym>RAID-5</acronym> by
	including a parity block in each stripe.  As required by
	<acronym>RAID-5</acronym>, the location of this parity block changes from one
	stripe to the next.  The numbers in the data blocks indicate the relative
	block numbers.</para>

      <para>
	<figure id="vinum-raid5-org">
	  <title>RAID-5 organization</title>
	  <graphic fileref="vinum/vinum-raid5-org">
	</figure>
      </para>

      <para>Compared to mirroring, <acronym>RAID-5</acronym> has the advantage of requiring
	significantly less storage space.  Read access is similar to that of
	striped organizations, but write access is significantly slower,
	approximately 25% of the read performance.  If one drive fails, the array
	can continue to operate in degraded mode: a read from one of the remaining
	accessible drives continues normally, but a read from the failed drive is
	recalculated from the corresponding block from all the remaining drives.
      </para>
  </sect1>

  <sect1 id="vinum-objects">
    <title>Vinum objects</title>
      <para>In order to address these problems, Vinum implements a four-level
	hierarchy of objects:</para>

      <itemizedlist>
	<listitem>
	  <para>The most visible object is the virtual disk, called a
	    <emphasis>volume</emphasis>.  Volumes have essentially the same
	    properties as a UNIX&trade; disk drive, though there are some minor
	    differences.  They have no size limitations.</para>
	</listitem>

	<listitem>
	  <para>Volumes are composed of <emphasis>plexes</emphasis>, each of which
	    represent the total address space of a volume.  This level in the
	    hierarchy thus provides redundancy.  Think of plexes as individual
	    disks in a mirrored array, each containing the same data.</para>
	</listitem>

	<listitem>
	  <para>Since Vinum exists within the UNIX&trade; disk storage framework,
	    it would be possible to use UNIX&trade; partitions as the building
	    block for multi-disk plexes, but in fact this turns out to be too
	    inflexible: UNIX&trade; disks can have only a limited number of partitions.
	    Instead, Vinum subdivides a single UNIX&trade; partition (the
	    <emphasis>drive</emphasis>) into contiguous areas called
	    <emphasis>subdisks</emphasis>, which it uses as building blocks for plexes.</para>
	</listitem>
      
	<listitem>
	  <para>Subdisks reside on Vinum <emphasis>drives</emphasis>,
	    currently UNIX&trade; partitions.  Vinum drives can contain any number of
	    subdisks.  With the exception of a small area at the beginning of the
	    drive, which is used for storing configuration and state information,
	    the entire drive is available for data storage.</para>
	</listitem>
      </itemizedlist>

      <para>The following sections describe the way these objects provide the
	functionality required of Vinum.</para>

    <sect2>
      <title>Volume size considerations</title>

      <para>Plexes can include multiple subdisks spread over all drives in the
	Vinum configuration.  As a result, the size of an individual drive does
	not limit the size of a plex, and thus of a volume.</para>
    </sect2>
    
    <sect2>
      <title>Redundant data storage</title>
      <para>Vinum implements mirroring by attaching multiple plexes to a
	volume.  Each plex is a representation of the data in a volume.  A
	volume may contain between one and eight plexes.</para>

      <para>Although a plex represents the complete data of a volume, it is
	possible for parts of the representation to be physically missing,
	either by design (by not defining a subdisk for parts of the plex) or by
	accident (as a result of the failure of a drive).  As long as at least
	one plex can provide the data for the complete address range of the
	volume, the volume is fully functional.</para>
    </sect2>
    
    <sect2>
      <title>Performance issues</title>
      <para>Vinum implements both concatenation and striping at the plex
	level:</para>

      <itemizedlist>
	<listitem>
	  <para>A <emphasis>concatenated plex</emphasis> uses the
	    address space of each subdisk in turn.</para>
	</listitem>

	<listitem>
	  <para>A <emphasis>striped plex</emphasis> stripes the data
	  across each subdisk.  The subdisks must all have the same size, and
	  there must be at least two subdisks in order to distinguish it from a
	  concatenated plex.</para>
	</listitem>
      </itemizedlist>
    </sect2>

    <sect2>
      <title>Which plex organization?</title>
      <para>The version of Vinum supplied with FreeBSD &rel.current; implements
	two kinds of plex:</para>
    
      <itemizedlist>
	<listitem>
	  <para>Concatenated plexes are the most flexible: they can
	    contain any number of subdisks, and the subdisks may be of different
	    length.  The plex may be extended by adding additional subdisks.  They
	    require less <acronym>CPU</acronym> time than striped plexes, though
	    the difference in <acronym>CPU</acronym> overhead is not measurable.
	    On the other hand, they are most susceptible to hot spots, where one
	    disk is very active and others are idle.</para>
        </listitem>

	<listitem>
	  <para>The greatest advantage of striped (<acronym>RAID-0</acronym>)
	    plexes is that they reduce hot spots: by choosing an optimum sized stripe
	    (about 256&nbsp;kB), you can even out the load on the component drives.
	    The disadvantages of this approach are (fractionally) more complex
	    code and restrictions on subdisks: they must be all the same size, and
	    extending a plex by adding new subdisks is so complicated that Vinum
	    currently does not implement it.  Vinum imposes an additional, trivial
	    restriction: a striped plex must have at least two subdisks, since
	    otherwise it is indistinguishable from a concatenated plex.</para>
	</listitem>
      </itemizedlist>
    
      <para><xref linkend="vinum-comparison"> summarizes the advantages
	and disadvantages of each plex organization.</para>
    
      <table id="vinum-comparison">
	<title>Vinum Plex organizations</title>
	<tgroup cols="5">
	  <thead>
	    <row>
	      <entry>Plex type</entry>
	  	<entry>Minimum subdisks</entry>
	  	<entry>Can add subdisks</entry>
	  	<entry>Must be equal size</entry>
	  	<entry>Application</entry>
	    </row>
	  </thead>

	  <tbody>
	    <row>
	      <entry>concatenated</entry>
	      <entry>1</entry>
	      <entry>yes</entry>
	      <entry>no</entry>
	      <entry>Large data storage with maximum placement flexibility
	        and moderate performance</entry>
	    </row>
	    
	    <row>
	      <entry>striped</entry>
	      <entry>2</entry>
	      <entry>no</entry>
	      <entry>yes</entry>
	      <entry>High performance in combination with highly concurrent
		access</entry>
	    </row>
	  </tbody>
	</tgroup>
      </table>
    </sect2>
  </sect1>
  
  <sect1 id="vinum-examples">
    <title>Some examples</title>
    <para>Vinum maintains a <emphasis>configuration database</emphasis>
      which describes the objects known to an individual system.  Initially, the
      user creates the configuration database from one or more configuration files
      with the aid of the &man.vinum.8; utility program.  Vinum stores a copy of
      its configuration database on each disk slice (which Vinum calls a
      <emphasis>device</emphasis>) under its control.  This database is updated on
      each state change, so that a restart accurately restores the state of each
      Vinum object.</para>
  
    <sect2>
      <title>The configuration file</title>
      <para>The configuration file describes individual Vinum objects.  The
	definition of a simple volume might be:</para>

      <programlisting>
    drive a device /dev/da3h
    volume myvol
      plex org concat
        sd length 512m drive a</programlisting>

      <para>This file describes four Vinum objects:</para>

      <itemizedlist>
	<listitem>
	  <para>The <emphasis>drive</emphasis> line describes a disk
	    partition (<emphasis>drive</emphasis>) and its location relative to the
	    underlying hardware.  It is given the symbolic name
	    <emphasis>a</emphasis>.  This separation of the symbolic names from the
	    device names allows disks to be moved from one location to another
	    without confusion.</para>
	</listitem>

	<listitem>
	  <para>The <emphasis>volume</emphasis> line describes a volume.
	    The only required attribute is the name, in this case
	    <emphasis>myvol</emphasis>.</para>
	</listitem>

	<listitem>
	  <para>The <emphasis>plex</emphasis> line defines a plex.  The
	    only required parameter is the organization, in this case
	    <emphasis>concat</emphasis>.  No name is necessary: the system
	    automatically generates a name from the volume name by adding the suffix
	    <emphasis>.p</emphasis><emphasis>x</emphasis>, where
	    <emphasis>x</emphasis> is the number of the plex in the volume.  Thus
	    this plex will be called <emphasis>myvol.p0</emphasis>.</para>
	</listitem>

	<listitem>
	  <para>The <emphasis>sd</emphasis> line describes a subdisk.
	    The minimum specifications are the name of a drive on which to store it,
	    and the length of the subdisk.  As with plexes, no name is necessary:
	    the system automatically assigns names derived from the plex name by
	    adding the suffix <emphasis>.s</emphasis><emphasis>x</emphasis>, where
	    <emphasis>x</emphasis> is the number of the subdisk in the plex.  Thus
	    Vinum gives this subdisk the name <emphasis>myvol.p0.s0</emphasis>.</para>
	</listitem>
      </itemizedlist>

      <para>After processing this file, &man.vinum.8; produces the following
	output:</para>

      <programlisting>
      &prompt.root; vinum -&gt; <command>create config1</command>
      Configuration summary
      Drives:         1 (4 configured)
      Volumes:        1 (4 configured)
      Plexes:         1 (8 configured)
      Subdisks:       1 (16 configured)
     
	D a                     State: up       Device /dev/da3h        Avail: 2061/2573 MB (80%)
	
	V myvol                 State: up       Plexes:       1 Size:        512 MB
	
	P myvol.p0            C State: up       Subdisks:     1 Size:        512 MB
	
	S myvol.p0.s0           State: up       PO:        0  B Size:        512 MB</programlisting>

      <para>This output shows the brief listing format of &man.vinum.8;.  It
	is represented graphically in <xref linkend="vinum-simple-vol">.</para>

      <para>
	<figure id="vinum-simple-vol">
	  <title>A simple Vinum volume</title>
	  <graphic fileref="vinum/vinum-simple-vol">
	</figure>
      </para>

      <para>This figure, and the ones which follow, represent a volume, which
	contains the plexes, which in turn contain the subdisks.  In this trivial
	example, the volume contains one plex, and the plex contains one subdisk.</para>

      <para>This particular volume has no specific advantage over a conventional
	disk partition.  It contains a single plex, so it is not redundant.  The
	plex contains a single subdisk, so there is no difference in storage
	allocation from a conventional disk partition.  The following sections
	illustrate various more interesting configuration methods.</para>
    </sect2>

    <sect2>
      <title>Increased resilience: mirroring</title>
      <para>The resilience of a volume can be increased by mirroring.  When
	laying out a mirrored volume, it is important to ensure that the subdisks
	of each plex are on different drives, so that a drive failure will not
	take down both plexes.  The following configuration mirrors a volume:</para>

      <programlisting>
	drive b device /dev/da4h
	volume mirror
      plex org concat
        sd length 512m drive a
	  plex org concat
	    sd length 512m drive b</programlisting>

      <para>In this example, it was not necessary to specify a definition of
	drive <emphasis>a</emphasis> again, since Vinum keeps track of all
	objects in its configuration database.  After processing this
	definition, the configuration looks like:</para>


      <programlisting>
	Drives:         2 (4 configured)
	Volumes:        2 (4 configured)
	Plexes:         3 (8 configured)
	Subdisks:       3 (16 configured)
	
	D a                     State: up       Device /dev/da3h        Avail: 1549/2573 MB (60%)
	D b                     State: up       Device /dev/da4h        Avail: 2061/2573 MB (80%)

    V myvol                 State: up       Plexes:       1 Size:        512 MB
    V mirror                State: up       Plexes:       2 Size:        512 MB
  
    P myvol.p0            C State: up       Subdisks:     1 Size:        512 MB
    P mirror.p0           C State: up       Subdisks:     1 Size:        512 MB
    P mirror.p1           C State: initializing     Subdisks:     1 Size:        512 MB
  
    S myvol.p0.s0           State: up       PO:        0  B Size:        512 MB
	S mirror.p0.s0          State: up       PO:        0  B Size:        512 MB
	S mirror.p1.s0          State: empty    PO:        0  B Size:        512 MB</programlisting>
	
      <para><xref linkend="vinum-mirrored-vol"> shows the structure
	graphically.</para>

      <para>
	<figure id="vinum-mirrored-vol">
	  <title>A mirrored Vinum volume</title>
	  <graphic fileref="vinum/vinum-mirrored-vol">
	</figure>
      </para>

      <para>In this example, each plex contains the full 512&nbsp;MB of address
	space.  As in the previous example, each plex contains only a single
	subdisk.</para>
    </sect2>

    <sect2>
      <title>Optimizing performance</title>
      <para>The mirrored volume in the previous example is more resistant to
	failure than an unmirrored volume, but its performance is less: each write
	to the volume requires a write to both drives, using up a greater
	proportion of the total disk bandwidth.  Performance considerations demand
	a different approach: instead of mirroring, the data is striped across as
	many disk drives as possible.  The following configuration shows a volume
	with a plex striped across four disk drives:</para>

	<programlisting>
	drive c device /dev/da5h
	drive d device /dev/da6h
	volume stripe
	plex org striped 512k
	  sd length 128m drive a
	  sd length 128m drive b
	  sd length 128m drive c
	  sd length 128m drive d</programlisting>

      <para>As before, it is not necessary to define the drives which are
	already known to Vinum.  After processing this definition, the
	configuration looks like:</para>

      <programlisting>
	Drives:         4 (4 configured)
	Volumes:        3 (4 configured)
	Plexes:         4 (8 configured)
	Subdisks:       7 (16 configured)
  
    D a                     State: up       Device /dev/da3h        Avail: 1421/2573 MB (55%)
    D b                     State: up       Device /dev/da4h        Avail: 1933/2573 MB (75%)
    D c                     State: up       Device /dev/da5h        Avail: 2445/2573 MB (95%)
    D d                     State: up       Device /dev/da6h        Avail: 2445/2573 MB (95%)
  
    V myvol                 State: up       Plexes:       1 Size:        512 MB
    V mirror                State: up       Plexes:       2 Size:        512 MB
    V striped               State: up       Plexes:       1 Size:        512 MB
  
    P myvol.p0            C State: up       Subdisks:     1 Size:        512 MB
    P mirror.p0           C State: up       Subdisks:     1 Size:        512 MB
    P mirror.p1           C State: initializing     Subdisks:     1 Size:        512 MB
    P striped.p1            State: up       Subdisks:     1 Size:        512 MB
  
    S myvol.p0.s0           State: up       PO:        0  B Size:        512 MB
    S mirror.p0.s0          State: up       PO:        0  B Size:        512 MB
    S mirror.p1.s0          State: empty    PO:        0  B Size:        512 MB
    S striped.p0.s0         State: up       PO:        0  B Size:        128 MB
    S striped.p0.s1         State: up       PO:      512 kB Size:        128 MB
    S striped.p0.s2         State: up       PO:     1024 kB Size:        128 MB
    S striped.p0.s3         State: up       PO:     1536 kB Size:        128 MB</programlisting>

      <para>
	<figure id="vinum-striped-vol">
	  <title>A striped Vinum volume</title>
	  <graphic fileref="vinum/vinum-striped-vol">
	</figure>
      </para>

      <para>This volume is represented in
	<xref linkend="vinum-striped-vol">.  The darkness of the stripes
	indicates the position within the plex address space: the lightest stripes
	come first, the darkest last.</para>
    </sect2>

    <sect2>
      <title>Resilience and performance</title>
      <para><anchor id="vinum-resilience">With sufficient hardware, it is
	possible to build volumes which show both increased resilience and
	increased performance compared to standard UNIX&trade; partitions.  A typical
	configuration file might be:</para>

      <programlisting>
	volume raid10
      plex org striped 512k
        sd length 102480k drive a
        sd length 102480k drive b
        sd length 102480k drive c
        sd length 102480k drive d
        sd length 102480k drive e
      plex org striped 512k
        sd length 102480k drive c
        sd length 102480k drive d
        sd length 102480k drive e
        sd length 102480k drive a
        sd length 102480k drive b</programlisting>

      <para>The subdisks of the second plex are offset by two drives from those
	of the first plex: this helps ensure that writes do not go to the same
	subdisks even if a transfer goes over two drives.</para>

      <para><xref linkend="vinum-raid10-vol"> represents the structure
	of this volume.</para>

      <para>
	<figure id="vinum-raid10-vol">
	  <title>A mirrored, striped Vinum volume</title>
	  <graphic fileref="vinum/vinum-raid10-vol">
        </figure>
      </para>
    </sect2>
  </sect1>
  
  <sect1 id="vinum-object-naming">
    <title>Object naming</title>
    <para>As described above, Vinum assigns default names to plexes and
      subdisks, although they may be overridden.  Overriding the default names
      is not recommended: experience with the VERITAS volume manager, which
      allows arbitrary naming of objects, has shown that this flexibility does
      not bring a significant advantage, and it can cause confusion.</para>

    <para>Names may contain any non-blank character, but it is recommended to
      restrict them to letters, digits and the underscore characters.  The names
      of volumes, plexes and subdisks may be up to 64 characters long, and the
      names of drives may be up to 32 characters long.</para>

    <para><indexterm><primary>/dev/vinum</primary></indexterm>Vinum objects
      are assigned device nodes in the hierarchy <filename>/dev/vinum</filename>.
      The configuration shown above would cause Vinum to create the following
      device nodes:</para>

    <itemizedlist>
      <listitem>
	<para>The control devices <devicename>/dev/vinum/control</devicename> and
	  <devicename>/dev/vinum/controld</devicename>, which are used by 
	  &man.vinum.8; and the Vinum daemon respectively.</para>
      </listitem>

      <listitem>
	<para>Block and character device entries for each volume.
	  These are the main devices used by Vinum.  The block device names are
	  the name of the volume, while the character device names follow the BSD 
	  tradition of prepending the letter <emphasis>r</emphasis> to the name.  
	  Thus the configuration above would include the block devices 
	  <devicename>/dev/vinum/myvol</devicename>, 
	  <devicename>/dev/vinum/mirror</devicename>,  
	  <devicename>/dev/vinum/striped</devicename>, 
	  <devicename>/dev/vinum/raid5</devicename> and 
	  <devicename>/dev/vinum/raid10</devicename>, and the character devices 
	  <devicename>/dev/vinum/rmyvol</devicename>, 
	  <devicename>/dev/vinum/rmirror</devicename>,
	  <devicename>/dev/vinum/rstriped</devicename>, 
	  <devicename>/dev/vinum/rraid5</devicename> and 
	  <devicename>/dev/vinum/rraid10</devicename>.  
	  There is obviously a problem here: it is possible to have two volumes 
	  called <emphasis>r</emphasis> and <emphasis>rr</emphasis>, but there 
	  will be a conflict creating the device node 
	  <devicename>/dev/vinum/rr</devicename>: is it a character device for 
	  volume <emphasis>r</emphasis> or a block device for volume 
	  <emphasis>rr</emphasis>?  Currently Vinum does not address this 
	  conflict: the first-defined volume will get the name.</para>
      </listitem>

      <listitem>
	<para>A directory <devicename>/dev/vinum/drive</devicename>
	  with entries for each drive.  These entries are in fact symbolic links
	  to the corresponding disk nodes.</para>
      </listitem>

      <listitem>
	<para>A directory <filename>/dev/vinum/volume</filename> with
	  entries for each volume.  It contains subdirectories for each plex, 
	  which in turn contain subdirectories for their component subdisks.</para>
      </listitem>

      <listitem>
	<para>The directories <devicename>/dev/vinum/plex</devicename>,
	  <devicename>/dev/vinum/sd</devicename>, and
	  <devicename>/dev/vinum/rsd</devicename>, which contain block device
	  nodes for each plex and block and character device nodes respectively 
	  for each subdisk.</para>
      </listitem>
    </itemizedlist>

    <para>For example, consider the following configuration file:</para>
	<programlisting>
	drive drive1 device /dev/sd1h
	drive drive2 device /dev/sd2h
	drive drive3 device /dev/sd3h
	drive drive4 device /dev/sd4h
    volume s64 setupstate
      plex org striped 64k
        sd length 100m drive drive1
        sd length 100m drive drive2
        sd length 100m drive drive3
        sd length 100m drive drive4</programlisting>

    <para>After processing this file, &man.vinum.8; creates the following
      structure in <filename>/dev/vinum</filename>:</para>

    <programlisting>
	brwx------  1 root  wheel   25, 0x40000001 Apr 13 16:46 Control
	brwx------  1 root  wheel   25, 0x40000002 Apr 13 16:46 control
	brwx------  1 root  wheel   25, 0x40000000 Apr 13 16:46 controld
	drwxr-xr-x  2 root  wheel       512 Apr 13 16:46 drive
	drwxr-xr-x  2 root  wheel       512 Apr 13 16:46 plex
	crwxr-xr--  1 root  wheel   91,   2 Apr 13 16:46 rs64
	drwxr-xr-x  2 root  wheel       512 Apr 13 16:46 rsd
	drwxr-xr-x  2 root  wheel       512 Apr 13 16:46 rvol
	brwxr-xr--  1 root  wheel   25,   2 Apr 13 16:46 s64
	drwxr-xr-x  2 root  wheel       512 Apr 13 16:46 sd
	drwxr-xr-x  3 root  wheel       512 Apr 13 16:46 vol

	/dev/vinum/drive:
    total 0
    lrwxr-xr-x  1 root  wheel  9 Apr 13 16:46 drive1 -&gt; /dev/sd1h
    lrwxr-xr-x  1 root  wheel  9 Apr 13 16:46 drive2 -&gt; /dev/sd2h
    lrwxr-xr-x  1 root  wheel  9 Apr 13 16:46 drive3 -&gt; /dev/sd3h
    lrwxr-xr-x  1 root  wheel  9 Apr 13 16:46 drive4 -&gt; /dev/sd4h
  
    /dev/vinum/plex:
    total 0
    brwxr-xr--  1 root  wheel   25, 0x10000002 Apr 13 16:46 s64.p0
    
    /dev/vinum/rsd:
    total 0
    crwxr-xr--  1 root  wheel   91, 0x20000002 Apr 13 16:46 s64.p0.s0
    crwxr-xr--  1 root  wheel   91, 0x20100002 Apr 13 16:46 s64.p0.s1
    crwxr-xr--  1 root  wheel   91, 0x20200002 Apr 13 16:46 s64.p0.s2
    crwxr-xr--  1 root  wheel   91, 0x20300002 Apr 13 16:46 s64.p0.s3
  
    /dev/vinum/rvol:
    total 0
    crwxr-xr--  1 root  wheel   91,   2 Apr 13 16:46 s64
  
    /dev/vinum/sd:
    total 0
    brwxr-xr--  1 root  wheel   25, 0x20000002 Apr 13 16:46 s64.p0.s0
    brwxr-xr--  1 root  wheel   25, 0x20100002 Apr 13 16:46 s64.p0.s1
    brwxr-xr--  1 root  wheel   25, 0x20200002 Apr 13 16:46 s64.p0.s2
    brwxr-xr--  1 root  wheel   25, 0x20300002 Apr 13 16:46 s64.p0.s3
  
    /dev/vinum/vol:
    total 1
    brwxr-xr--  1 root  wheel   25,   2 Apr 13 16:46 s64
    drwxr-xr-x  3 root  wheel       512 Apr 13 16:46 s64.plex
  
    /dev/vinum/vol/s64.plex:
    total 1
    brwxr-xr--  1 root  wheel   25, 0x10000002 Apr 13 16:46 s64.p0
    drwxr-xr-x  2 root  wheel       512 Apr 13 16:46 s64.p0.sd
  
    /dev/vinum/vol/s64.plex/s64.p0.sd:
    total 0
    brwxr-xr--  1 root  wheel   25, 0x20000002 Apr 13 16:46 s64.p0.s0
    brwxr-xr--  1 root  wheel   25, 0x20100002 Apr 13 16:46 s64.p0.s1
    brwxr-xr--  1 root  wheel   25, 0x20200002 Apr 13 16:46 s64.p0.s2
    brwxr-xr--  1 root  wheel   25, 0x20300002 Apr 13 16:46 s64.p0.s3</programlisting>

    <para>Although it is recommended that plexes and subdisks should not be
      allocated specific names, Vinum drives must be named.  This makes it
      possible to move a drive to a different location and still recognize it
      automatically.  Drive names may be up to 32 characters long.</para>

    <sect2>
      <title>Creating file systems</title>
	<para>Volumes appear to the system to be identical to disks, with one exception.
	  Unlike UNIX&trade; drives, Vinum does not partition volumes, which thus do
	  not contain a partition table.  This has required modification to some disk
	  utilities, notably &man.newfs.8;, which previously tried to
	  interpret the last letter of a Vinum volume name as a partition identifier.
	  For example, a disk drive may have a name like <devicename>/dev/ad0a</devicename>
	  or <devicename>/dev/da2h</devicename>.  These names represent the first
	  partition (<devicename>a</devicename>) on the first (0) IDE disk
	  (<devicename>ad</devicename>) and the eighth partition
	  (<devicename>h</devicename>) on the third (2) SCSI disk
	  (<devicename>da</devicename>) respectively.  By contrast, a Vinum volume
	  might be called <devicename>/dev/vinum/concat</devicename>, a name which
	  has no relationship with a partition name.</para>

	<para>Normally, &man.newfs.8; interprets the name of the disk and
	  complains if it cannot understand it.  For example:</para>

	<screen>&prompt.root; <userinput>newfs /dev/vinum/concat</userinput>
newfs: /dev/vinum/concat: can't figure out file system partition</screen>

	<para>In order to create a file system on this volume, use the
	  <option>-v</option> option to &man.newfs.8;:</para>

	<screen>&prompt.root; <userinput>newfs -v /dev/vinum/concat</userinput></screen>

    </sect2>
  </sect1>
  
  <sect1 id="vinum-config">
    <title>Configuring Vinum</title>
    <para>The <filename>GENERIC</filename> kernel does not contain Vinum.  It is
	possible to build a special kernel which includes Vinum, but this is not
	recommended.  The standard way to start Vinum is as a kernel module
	(<acronym>kld</acronym>).  You do not even need to use &man.kldload.8; 
	for Vinum: when you start &man.vinum.8;, it checks whether the module 
	has been loaded, and if it is not, it loads it automatically.</para>


    <sect2>
      <title>Startup</title>
	<para>Vinum stores configuration information on the disk slices in
	  essentially the same form as in the configuration files.  When reading
	  from the configuration database, Vinum recognizes a number of keywords
	  which are not allowed in the configuration files.  For example, a disk
	  configuration might contain the following text:</para>

	<programlisting>volume myvol state up
volume bigraid state down
plex name myvol.p0 state up org concat vol myvol
plex name myvol.p1 state up org concat vol myvol
plex name myvol.p2 state init org striped 512b vol myvol
plex name bigraid.p0 state initializing org raid5 512b vol bigraid
sd name myvol.p0.s0 drive a plex myvol.p0 state up len 1048576b driveoffset 265b plexoffset 0b
sd name myvol.p0.s1 drive b plex myvol.p0 state up len 1048576b driveoffset 265b plexoffset 1048576b
sd name myvol.p1.s0 drive c plex myvol.p1 state up len 1048576b driveoffset 265b plexoffset 0b
sd name myvol.p1.s1 drive d plex myvol.p1 state up len 1048576b driveoffset 265b plexoffset 1048576b
sd name myvol.p2.s0 drive a plex myvol.p2 state init len 524288b driveoffset 1048841b plexoffset 0b
sd name myvol.p2.s1 drive b plex myvol.p2 state init len 524288b driveoffset 1048841b plexoffset 524288b
sd name myvol.p2.s2 drive c plex myvol.p2 state init len 524288b driveoffset 1048841b plexoffset 1048576b
sd name myvol.p2.s3 drive d plex myvol.p2 state init len 524288b driveoffset 1048841b plexoffset 1572864b
sd name bigraid.p0.s0 drive a plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 0b
sd name bigraid.p0.s1 drive b plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 4194304b
sd name bigraid.p0.s2 drive c plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 8388608b
sd name bigraid.p0.s3 drive d plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 12582912b
sd name bigraid.p0.s4 drive e plex bigraid.p0 state initializing len 4194304b driveoff set 1573129b plexoffset 16777216b</programlisting>
  
	<para>The obvious differences here are the presence of explicit location
	  information and naming (both of which are also allowed, but discouraged, for
	  use by the user) and the information on the states (which are not available
	  to the user).  Vinum does not store information about drives in the
	  configuration information: it finds the drives by scanning the configured
	  disk drives for partitions with a Vinum label.  This enables Vinum to
	  identify drives correctly even if they have been assigned different UNIX&trade;
	  drive IDs.</para>
  
      <sect3>
	<title>Automatic startup</title>
	  <para>In order to start Vinum automatically when you boot the system,
	    ensure that you have the following line in your
	    <filename>/etc/rc.conf</filename>:</para>

	<programlisting>start_vinum="YES"		# set to YES to start vinum</programlisting>

	<para>If you do not have a file <filename>/etc/rc.conf</filename>, create
	  one with this content.  This will cause the system to load the Vinum
	  <acronym>kld</acronym> at startup, and to start any objects mentioned in
	  the configuration.  This is done before mounting file systems, so it is
	  possible to automatically &man.fsck.8; and mount file systems on Vinum
	  volumes.</para>

	<para>When you start Vinum with the <command>vinum start</command> command,
	  Vinum reads the configuration database from one of the Vinum drives.
	  Under normal circumstances, each drive contains an identical copy of the
	  configuration database, so it does not matter which drive is read.  After
	  a crash, however, Vinum must determine which drive was updated most
	  recently and read the configuration from this drive.  It then updates the
	  configuration if necessary from progressively older drives.</para>

      </sect3>
    </sect2>
  </sect1>
</chapter>