path: root/en_US.ISO8859-1/articles/vm-design/article.xml

                                           
                                                                               
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE article PUBLIC "-//FreeBSD//DTD DocBook XML V5.0-Based Extension//EN"
	"http://www.FreeBSD.org/XML/share/xml/freebsd50.dtd">
<!-- $FreeBSD$ -->
<!-- FreeBSD Documentation Project -->
<article xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0" xml:lang="en">
  <info><title>Design elements of the &os; VM system</title>
    

    <authorgroup>
      <author><personname><firstname>Matthew</firstname><surname>Dillon</surname></personname><affiliation>
	  <address>
	    <email>dillon@apollo.backplane.com</email>
	  </address>
	</affiliation></author>
    </authorgroup>

    <legalnotice xml:id="trademarks" role="trademarks">
      &tm-attrib.freebsd;
      &tm-attrib.linux;
      &tm-attrib.microsoft;
      &tm-attrib.opengroup;
      &tm-attrib.general;
    </legalnotice>

    <pubdate>$FreeBSD$</pubdate>

    <releaseinfo>$FreeBSD$</releaseinfo>

    <abstract>
      <para>The title is really just a fancy way of saying that I am going to
	attempt to describe the whole VM enchilada, hopefully in a way that
	everyone can follow.  For the last year I have concentrated on a number
	of major kernel subsystems within &os;, with the VM and Swap
	subsystems being the most interesting and NFS being <quote>a necessary
	chore</quote>.  I rewrote only small portions of the code.  In the VM
	arena the only major rewrite I have done is to the swap subsystem.
	Most of my work was cleanup and maintenance, with only moderate code
	rewriting and no major algorithmic adjustments within the VM
	subsystem.  The bulk of the VM subsystem's theoretical base remains
	unchanged and a lot of the credit for the modernization effort in the
	last few years belongs to John Dyson and David Greenman.  Not being a
	historian like Kirk I will not attempt to tag all the various features
	with peoples names, since I will invariably get it wrong.</para>
    </abstract>

    <legalnotice xml:id="legalnotice">
      <para>This article was originally published in the January 2000 issue of
	<link xlink:href="http://www.daemonnews.org/">DaemonNews</link>.  This
	version of the article may include updates from Matt and other authors
	to reflect changes in &os;'s VM implementation.</para>
    </legalnotice>
  </info>

  <sect1 xml:id="introduction">
    <title>Introduction</title>

    <para>Before moving along to the actual design let's spend a little time
      on the necessity of maintaining and modernizing any long-living
      codebase.  In the programming world, algorithms tend to be more
      important than code and it is precisely due to BSD's academic roots that
      a great deal of attention was paid to algorithm design from the
      beginning.  More attention paid to the design generally leads to a clean
      and flexible codebase that can be fairly easily modified, extended, or
      replaced over time.  While BSD is considered an <quote>old</quote>
      operating system by some people, those of us who work on it tend to view
      it more as a <quote>mature</quote> codebase which has various components
      modified, extended, or replaced with modern code.  It has evolved, and
      &os; is at the bleeding edge no matter how old some of the code might
      be.  This is an important distinction to make and one that is
      unfortunately lost to many people.  The biggest error a programmer can
      make is to not learn from history, and this is precisely the error that
      many other modern operating systems have made.  &windowsnt; is the best example
      of this, and the consequences have been dire.  Linux also makes this
      mistake to some degree&mdash;enough that we BSD folk can make small
      jokes about it every once in a while, anyway.  Linux's problem is simply
      one of a lack of experience and history to compare ideas against, a
      problem that is easily and rapidly being addressed by the Linux
      community in the same way it has been addressed in the BSD
      community&mdash;by continuous code development.  The &windowsnt; folk, on the
      other hand, repeatedly make the same mistakes solved by &unix; decades ago
      and then spend years fixing them. Over and over again.  They have a
      severe case of <quote>not designed here</quote> and <quote>we are always
      right because our marketing department says so</quote>.  I have little
      tolerance for anyone who cannot learn from history.</para>

    <para>Much of the apparent complexity of the &os; design, especially in
      the VM/Swap subsystem, is a direct result of having to solve serious
      performance issues that occur under various conditions.  These issues
      are not due to bad algorithmic design but instead rise from
      environmental factors.  In any direct comparison between platforms,
      these issues become most apparent when system resources begin to get
      stressed.  As I describe &os;'s VM/Swap subsystem the reader should
      always keep two points in mind:</para>

    <orderedlist>
      <listitem>
        <para>The most important aspect of performance design is what is
          known as <quote>Optimizing the Critical Path</quote>.  It is often
          the case that performance optimizations add a little bloat to the
          code in order to make the critical path perform better.</para>
      </listitem>

      <listitem>
        <para>A solid, generalized design outperforms a heavily-optimized
          design over the long run.  While a generalized design may end up
          being slower than an heavily-optimized design when they are
          first implemented, the generalized design tends to be easier to
          adapt to changing conditions and the heavily-optimized design
          winds up having to be thrown away.</para>
      </listitem>
    </orderedlist>

    <para>Any codebase that will survive and be maintainable for
      years must therefore be designed properly from the beginning even if it
      costs some performance.  Twenty years ago people were still arguing that
      programming in assembly was better than programming in a high-level
      language because it produced code that was ten times as fast.  Today,
      the fallibility of that argument is obvious &nbsp;&mdash;&nbsp;as are
      the parallels to algorithmic design and code generalization.</para>
  </sect1>

  <sect1 xml:id="vm-objects">
    <title>VM Objects</title>

    <para>The best way to begin describing the &os; VM system is to look at
      it from the perspective of a user-level process.  Each user process sees
      a single, private, contiguous VM address space containing several types
      of memory objects.  These objects have various characteristics.  Program
      code and program data are effectively a single memory-mapped file (the
      binary file being run), but program code is read-only while program data
      is copy-on-write.  Program BSS is just memory allocated and filled with
      zeros on demand, called demand zero page fill.  Arbitrary files can be
      memory-mapped into the address space as well, which is how the shared
      library mechanism works.  Such mappings can require modifications to
      remain private to the process making them.  The fork system call adds an
      entirely new dimension to the VM management problem on top of the
      complexity already given.</para>

    <para>A program binary data page (which is a basic copy-on-write page)
      illustrates the complexity.  A program binary contains a preinitialized
      data section which is initially mapped directly from the program file.
      When a program is loaded into a process's VM space, this area is
      initially memory-mapped and backed by the program binary itself,
      allowing the VM system to free/reuse the page and later load it back in
      from the binary.  The moment a process modifies this data, however, the
      VM system must make a private copy of the page for that process.  Since
      the private copy has been modified, the VM system may no longer free it,
      because there is no longer any way to restore it later on.</para>

    <para>You will notice immediately that what was originally a simple file
      mapping has become much more complex.  Data may be modified on a
      page-by-page basis whereas the file mapping encompasses many pages at
      once.  The complexity further increases when a process forks.  When a
      process forks, the result is two processes&mdash;each with their own
      private address spaces, including any modifications made by the original
      process prior to the call to <function>fork()</function>.  It would be
      silly for the VM system to make a complete copy of the data at the time
      of the <function>fork()</function> because it is quite possible that at
      least one of the two processes will only need to read from that page
      from then on, allowing the original page to continue to be used.  What
      was a private page is made copy-on-write again, since each process
      (parent and child) expects their own personal post-fork modifications to
      remain private to themselves and not effect the other.</para>

    <para>&os; manages all of this with a layered VM Object model.  The
      original binary program file winds up being the lowest VM Object layer.
      A copy-on-write layer is pushed on top of that to hold those pages which
      had to be copied from the original file.  If the program modifies a data
      page belonging to the original file the VM system takes a fault and
      makes a copy of the page in the higher layer.  When a process forks,
      additional VM Object layers are pushed on.  This might make a little
      more sense with a fairly basic example.  A <function>fork()</function>
      is a common operation for any *BSD system, so this example will consider
      a program that starts up, and forks.  When the process starts, the VM
      system creates an object layer, let's call this A:</para>

    <mediaobject>
      <imageobject>
        <imagedata fileref="fig1"/>
      </imageobject>

      <textobject>
	<literallayout class="monospaced">+---------------+
|       A       |
+---------------+</literallayout>
      </textobject>

      <textobject>
	<phrase>A picture</phrase>
      </textobject>
    </mediaobject>

    <para>A represents the file&mdash;pages may be paged in and out of the
      file's physical media as necessary.  Paging in from the disk is
      reasonable for a program, but we really do not want to page back out and
      overwrite the executable.  The VM system therefore creates a second
      layer, B, that will be physically backed by swap space:</para>

    <mediaobject>
      <imageobject>
        <imagedata fileref="fig2"/>
      </imageobject>

      <textobject>
	<literallayout class="monospaced">+---------------+
|       B       |
+---------------+
|       A       |
+---------------+</literallayout>
      </textobject>
    </mediaobject>

    <para>On the first write to a page after this, a new page is created in B,
      and its contents are initialized from A.  All pages in B can be paged in
      or out to a swap device.  When the program forks, the VM system creates
      two new object layers&mdash;C1 for the parent, and C2 for the
      child&mdash;that rest on top of B:</para>

    <mediaobject>
      <imageobject>
        <imagedata fileref="fig3"/>
      </imageobject>

      <textobject>
	<literallayout class="monospaced">+-------+-------+
|   C1  |   C2  |
+-------+-------+
|       B       |
+---------------+
|       A       |
+---------------+</literallayout>
      </textobject>
    </mediaobject>

    <para>In this case, let's say a page in B is modified by the original
      parent process.  The process will take a copy-on-write fault and
      duplicate the page in C1, leaving the original page in B untouched.
      Now, let's say the same page in B is modified by the child process.  The
      process will take a copy-on-write fault and duplicate the page in C2.
      The original page in B is now completely hidden since both C1 and C2
      have a copy and B could theoretically be destroyed if it does not
      represent a <quote>real</quote> file; however, this sort of optimization is not
      trivial to make because it is so fine-grained.  &os; does not make
      this optimization.  Now, suppose (as is often the case) that the child
      process does an <function>exec()</function>.  Its current address space
      is usually replaced by a new address space representing a new file.  In
      this case, the C2 layer is destroyed:</para>

    <mediaobject>
      <imageobject>
        <imagedata fileref="fig4"/>
      </imageobject>

      <textobject>
	<literallayout class="monospaced">+-------+
|   C1  |
+-------+-------+
|       B       |
+---------------+
|       A       |
+---------------+</literallayout>
      </textobject>
    </mediaobject>

    <para>In this case, the number of children of B drops to one, and all
      accesses to B now go through C1.  This means that B and C1 can be
      collapsed together.  Any pages in B that also exist in C1 are deleted
      from B during the collapse.  Thus, even though the optimization in the
      previous step could not be made, we can recover the dead pages when
      either of the processes exit or <function>exec()</function>.</para>

    <para>This model creates a number of potential problems.  The first is that
      you can wind up with a relatively deep stack of layered VM Objects which
      can cost scanning time and memory when you take a fault.  Deep
      layering can occur when processes fork and then fork again (either
      parent or child).  The second problem is that you can wind up with dead,
      inaccessible pages deep in the stack of VM Objects.  In our last example
      if both the parent and child processes modify the same page, they both
      get their own private copies of the page and the original page in B is
      no longer accessible by anyone.  That page in B can be freed.</para>

    <para>&os; solves the deep layering problem with a special optimization
      called the <quote>All Shadowed Case</quote>.  This case occurs if either
      C1 or C2 take sufficient COW faults to completely shadow all pages in B.
      Lets say that C1 achieves this.  C1 can now bypass B entirely, so rather
      then have C1-&gt;B-&gt;A and C2-&gt;B-&gt;A we now have C1-&gt;A and C2-&gt;B-&gt;A.  But
      look what also happened&mdash;now B has only one reference (C2), so we
      can collapse B and C2 together.  The end result is that B is deleted
      entirely and we have C1-&gt;A and C2-&gt;A.  It is often the case that B will
      contain a large number of pages and neither C1 nor C2 will be able to
      completely overshadow it.  If we fork again and create a set of D
      layers, however, it is much more likely that one of the D layers will
      eventually be able to completely overshadow the much smaller dataset
      represented by C1 or C2.  The same optimization will work at any point in
      the graph and the grand result of this is that even on a heavily forked
      machine VM Object stacks tend to not get much deeper then 4.  This is
      true of both the parent and the children and true whether the parent is
      doing the forking or whether the children cascade forks.</para>

    <para>The dead page problem still exists in the case where C1 or C2 do not
      completely overshadow B.  Due to our other optimizations this case does
      not represent much of a problem and we simply allow the pages to be
      dead.  If the system runs low on memory it will swap them out, eating a
      little swap, but that is it.</para>

    <para>The advantage to the VM Object model is that
      <function>fork()</function> is extremely fast, since no real data
      copying need take place.  The disadvantage is that you can build a
      relatively complex VM Object layering that slows page fault handling
      down a little, and you spend memory managing the VM Object structures.
      The optimizations &os; makes proves to reduce the problems enough
      that they can be ignored, leaving no real disadvantage.</para>
  </sect1>

  <sect1 xml:id="swap-layers">
    <title>SWAP Layers</title>

    <para>Private data pages are initially either copy-on-write or zero-fill
      pages.  When a change, and therefore a copy, is made, the original
      backing object (usually a file) can no longer be used to save a copy of
      the page when the VM system needs to reuse it for other purposes.  This
      is where SWAP comes in.  SWAP is allocated to create backing store for
      memory that does not otherwise have it.  &os; allocates the swap
      management structure for a VM Object only when it is actually needed.
      However, the swap management structure has had problems
      historically:</para>

    <itemizedlist>
      <listitem>
        <para>Under &os; 3.X the swap management structure preallocates an
          array that encompasses the entire object requiring swap backing
          store&mdash;even if only a few pages of that object are
          swap-backed.  This creates a kernel memory fragmentation problem
          when large objects are mapped, or processes with large runsizes
         (RSS) fork.</para>
      </listitem>

      <listitem>
        <para>Also, in order to keep track of swap space, a <quote>list of
          holes</quote> is kept in kernel memory, and this tends to get
          severely fragmented as well.  Since the <quote>list of
          holes</quote> is a linear list, the swap allocation and freeing
          performance is a non-optimal O(n)-per-page.</para>
      </listitem>

      <listitem>
        <para>It requires kernel memory allocations to take place during
          the swap freeing process, and that creates low memory deadlock
          problems.</para>
      </listitem>

      <listitem>
        <para>The problem is further exacerbated by holes created due to
          the interleaving algorithm.</para>
      </listitem>

      <listitem>
        <para>Also, the swap block map can become fragmented fairly easily
          resulting in non-contiguous allocations.</para>
      </listitem>

      <listitem>
        <para>Kernel memory must also be allocated on the fly for additional
          swap management structures when a swapout occurs.</para>
      </listitem>
    </itemizedlist>

    <para>It is evident from that list that there was plenty of room for
       improvement.  For &os; 4.X, I completely rewrote the swap
       subsystem:</para>

    <itemizedlist>
      <listitem>
        <para>Swap management structures are allocated through a hash
          table rather than a linear array giving them a fixed allocation
          size and much finer granularity.</para>
      </listitem>

      <listitem>
        <para>Rather then using a linearly linked list to keep track of
          swap space reservations, it now uses a bitmap of swap blocks
          arranged in a radix tree structure with free-space hinting in
          the radix  node structures.  This effectively makes swap
          allocation and freeing an O(1) operation.</para>
      </listitem>

      <listitem>
        <para>The entire radix tree bitmap is also preallocated in
          order to avoid having to allocate kernel memory during critical
          low memory swapping operations.  After all, the system tends to
          swap when it is low on memory so we should avoid allocating
          kernel memory at such times in order to avoid potential
          deadlocks.</para>
      </listitem>

      <listitem>
        <para>To reduce fragmentation the radix tree is capable
          of allocating large contiguous chunks at once, skipping over
          smaller fragmented chunks.</para>
      </listitem>
    </itemizedlist>

    <para>I did not take the final step of having an
      <quote>allocating hint pointer</quote> that would trundle
      through a portion of swap as allocations were made in order to further
      guarantee contiguous allocations or at least locality of reference, but
      I ensured that such an addition could be made.</para>
  </sect1>

  <sect1 xml:id="freeing-pages">
    <title>When to free a page</title>

    <para>Since the VM system uses all available memory for disk caching,
      there are usually very few truly-free pages.  The VM system depends on
      being able to properly choose pages which are not in use to reuse for
      new allocations.  Selecting the optimal pages to free is possibly the
      single-most important function any VM system can perform because if it
      makes a poor selection, the VM system may be forced to unnecessarily
      retrieve pages from disk, seriously degrading system performance.</para>

    <para>How much overhead are we willing to suffer in the critical path to
      avoid freeing the wrong page?  Each wrong choice we make will cost us
      hundreds of thousands of CPU cycles and a noticeable stall of the
      affected processes, so we are willing to endure a significant amount of
      overhead in order to be sure that the right page is chosen.  This is why
      &os; tends to outperform other systems when memory resources become
      stressed.</para>

    <para>The free page determination algorithm is built upon a history of the
      use of memory pages.  To acquire this history, the system takes advantage
      of a page-used bit feature that most hardware page tables have.</para>

    <para>In any case, the page-used bit is cleared and at some later point
      the VM system comes across the page again and sees that the page-used
      bit has been set.  This indicates that the page is still being actively
      used.  If the bit is still clear it is an indication that the page is not
      being actively used.  By testing this bit periodically, a use history (in
      the form of a counter) for the physical page is developed.  When the VM
      system later needs to free up some pages, checking this history becomes
      the cornerstone of determining the best candidate page to reuse.</para>

    <sidebar>
      <title>What if the hardware has no page-used bit?</title>

      <para>For those platforms that do not have this feature, the system
	actually emulates a page-used bit.  It unmaps or protects a page,
	forcing a page fault if the page is accessed again.  When the page
	fault is taken, the system simply marks the page as having been used
	and unprotects the page so that it may be used.  While taking such page
	faults just to determine if a page is being used appears to be an
	expensive proposition, it is much less expensive than reusing the page
	for some other purpose only to find that a process needs it back and
	then have to go to disk.</para>
    </sidebar>

    <para>&os; makes use of several page queues to further refine the
      selection of pages to reuse as well as to determine when dirty pages
      must be flushed to their backing store.  Since page tables are dynamic
      entities under &os;, it costs virtually nothing to unmap a page from
      the address space of any processes using it.  When a page candidate has
      been chosen based on the page-use counter, this is precisely what is
      done.  The system must make a distinction between clean pages which can
      theoretically be freed up at any time, and dirty pages which must first
      be written to their backing store before being reusable.  When a page
      candidate has been found it is moved to the inactive queue if it is
      dirty, or the cache queue if it is clean.  A separate algorithm based on
      the dirty-to-clean page ratio determines when dirty pages in the
      inactive queue must be flushed to disk.  Once this is accomplished, the
      flushed pages are moved from the inactive queue to the cache queue.  At
      this point, pages in the cache queue can still be reactivated by a VM
      fault at relatively low cost.  However, pages in the cache queue are
      considered to be <quote>immediately freeable</quote> and will be reused
      in an LRU (least-recently used) fashion when the system needs to
      allocate new memory.</para>

    <para>It is important to note that the &os; VM system attempts to
      separate clean and dirty pages for the express reason of avoiding
      unnecessary flushes of dirty pages (which eats I/O bandwidth), nor does
      it move pages between the various page queues gratuitously when the
      memory subsystem is not being stressed.  This is why you will see some
      systems with very low cache queue counts and high active queue counts
      when doing a <command>systat -vm</command> command.  As the VM system
      becomes more stressed, it makes a greater effort to maintain the various
      page queues at the levels determined to be the most effective.</para>

    <para>An urban
      myth has circulated for years that Linux did a better job avoiding
      swapouts than &os;, but this in fact is not true.  What was actually
      occurring was that &os; was proactively paging out unused pages in
      order to make room for more disk cache while Linux was keeping unused
      pages in core and leaving less memory available for cache and process
      pages.  I do not know whether this is still true today.</para>
  </sect1>

  <sect1 xml:id="prefault-optimizations">
    <title>Pre-Faulting and Zeroing Optimizations</title>

    <para>Taking a VM fault is not expensive if the underlying page is already
      in core and can simply be mapped into the process, but it can become
      expensive if you take a whole lot of them on a regular basis.  A good
      example of this is running a program such as &man.ls.1; or &man.ps.1;
      over and over again.  If the program binary is mapped into memory but
      not mapped into the page table, then all the pages that will be accessed
      by the program will have to be faulted in every time the program is run.
      This is unnecessary when the pages in question are already in the VM
      Cache, so &os; will attempt to pre-populate a process's page tables
      with those pages that are already in the VM Cache.  One thing that
      &os; does not yet do is pre-copy-on-write certain pages on exec.  For
      example, if you run the &man.ls.1; program while running <command>vmstat
	1</command> you will notice that it always takes a certain number of
      page faults, even when you run it over and over again.  These are
      zero-fill faults, not program code faults (which were pre-faulted in
      already).  Pre-copying pages on exec or fork is an area that could use
      more study.</para>

    <para>A large percentage of page faults that occur are zero-fill faults.
      You can usually see this by observing the <command>vmstat -s</command>
      output.  These occur when a process accesses pages in its BSS area.  The
      BSS area is expected to be initially zero but the VM system does not
      bother to allocate any memory at all until the process actually accesses
      it.  When a fault occurs the VM system must not only allocate a new page,
      it must zero it as well.  To optimize the zeroing operation the VM system
      has the ability to pre-zero pages and mark them as such, and to request
      pre-zeroed pages when zero-fill faults occur.  The pre-zeroing occurs
      whenever the CPU is idle but the number of pages the system pre-zeros is
      limited in order to avoid blowing away the memory caches.  This is an
      excellent example of adding complexity to the VM system in order to
      optimize the critical path.</para>
  </sect1>

  <sect1 xml:id="page-table-optimizations">
    <title>Page Table Optimizations</title>

    <para>The page table optimizations make up the most contentious part of
      the &os; VM design and they have shown some strain with the advent of
      serious use of <function>mmap()</function>.  I think this is actually a
      feature of most BSDs though I am not sure when it was first introduced.
      There are two major optimizations.  The first is that hardware page
      tables do not contain persistent state but instead can be thrown away at
      any time with only a minor amount of management overhead.  The second is
      that every active page table entry in the system has a governing
      <literal>pv_entry</literal> structure which is tied into the
      <literal>vm_page</literal> structure.  &os; can simply iterate
      through those mappings that are known to exist while Linux must check
      all page tables that <emphasis>might</emphasis> contain a specific
      mapping to see if it does, which can achieve O(n^2) overhead in certain
      situations.  It is because of this that &os; tends to make better
      choices on which pages to reuse or swap when memory is stressed, giving
      it better performance under load. However, &os; requires kernel
      tuning to accommodate large-shared-address-space situations such as
      those that can occur in a news system because it may run out of
      <literal>pv_entry</literal> structures.</para>

    <para>Both Linux and &os; need work in this area.  &os; is trying to
      maximize the advantage of a potentially sparse active-mapping model (not
      all processes need to map all pages of a shared library, for example),
      whereas Linux is trying to simplify its algorithms.  &os; generally
      has the performance advantage here at the cost of wasting a little extra
      memory, but &os; breaks down in the case where a large file is
      massively shared across hundreds of processes.  Linux, on the other hand,
      breaks down in the case where many processes are sparsely-mapping the
      same shared library and also runs non-optimally when trying to determine
      whether a page can be reused or not.</para>
  </sect1>

  <sect1 xml:id="page-coloring-optimizations">
    <title>Page Coloring</title>

    <para>We will end with the page coloring optimizations.  Page coloring is a
      performance optimization designed to ensure that accesses to contiguous
      pages in virtual memory make the best use of the processor cache.  In
      ancient times (i.e. 10+ years ago) processor caches tended to map
      virtual memory rather than physical memory.  This led to a huge number of
      problems including having to clear the cache on every context switch in
      some cases, and problems with data aliasing in the cache.  Modern
      processor caches map physical memory precisely to solve those problems.
      This means that two side-by-side pages in a processes address space may
      not correspond to two side-by-side pages in the cache.  In fact, if you
      are not careful side-by-side pages in virtual memory could wind up using
      the same page in the processor cache&mdash;leading to cacheable data
      being thrown away prematurely and reducing CPU performance.  This is true
      even with multi-way set-associative caches (though the effect is
      mitigated somewhat).</para>

    <para>&os;'s memory allocation code implements page coloring
      optimizations, which means that the memory allocation code will attempt
      to locate free pages that are contiguous from the point of view of the
      cache.  For example, if page 16 of physical memory is assigned to page 0
      of a process's virtual memory and the cache can hold 4 pages, the page
      coloring code will not assign page 20 of physical memory to page 1 of a
      process's virtual memory.  It would, instead, assign page 21 of physical
      memory.  The page coloring code attempts to avoid assigning page 20
      because this maps over the same cache memory as page 16 and would result
      in non-optimal caching.  This code adds a significant amount of
      complexity to the VM memory allocation subsystem as you can well
      imagine, but the result is well worth the effort.  Page Coloring makes VM
      memory as deterministic as physical memory in regards to cache
      performance.</para>
  </sect1>

  <sect1 xml:id="conclusion">
    <title>Conclusion</title>

    <para>Virtual memory in modern operating systems must address a number of
      different issues efficiently and for many different usage patterns.  The
      modular and algorithmic approach that BSD has historically taken allows
      us to study and understand the current implementation as well as
      relatively cleanly replace large sections of the code.  There have been a
      number of improvements to the &os; VM system in the last several
      years, and work is ongoing.</para>
  </sect1>

  <sect1 xml:id="allen-briggs-qa">
    <title>Bonus QA session by Allen Briggs
      <email>briggs@ninthwonder.com</email></title>

    <qandaset>
      <qandaentry>
	<question>
	  <para>What is <quote>the interleaving algorithm</quote> that you
	    refer to in your listing of the ills of the &os; 3.X swap
	    arrangements?</para>
	</question>

	<answer>
	  <para>&os; uses a fixed swap interleave which defaults to 4.  This
	    means that &os; reserves space for four swap areas even if you
	    only have one, two, or three.  Since swap is interleaved the linear
	    address space representing the <quote>four swap areas</quote> will be
	    fragmented if you do not actually have four swap areas.  For
	    example, if you have two swap areas A and B &os;'s address
	    space representation for that swap area will be interleaved in
	    blocks of 16 pages:</para>

	  <literallayout>A B C D A B C D A B C D A B C D</literallayout>

	  <para>&os; 3.X uses a <quote>sequential list of free
	    regions</quote> approach to accounting for the free swap areas.
	    The idea is that large blocks of free linear space can be
	    represented with a single list node
	    (<filename>kern/subr_rlist.c</filename>).  But due to the
	    fragmentation the sequential list winds up being insanely
	    fragmented.  In the above example, completely unused swap will
	    have A and B shown as <quote>free</quote> and C and D shown as
	    <quote>all allocated</quote>.  Each A-B sequence requires a list
	    node to account for because C and D are holes, so the list node
	    cannot be combined with the next A-B sequence.</para>

	  <para>Why do we interleave our swap space instead of just tack swap
	    areas onto the end and do something fancier?  Because it is a whole
	    lot easier to allocate linear swaths of an address space and have
	    the result automatically be interleaved across multiple disks than
	    it is to try to put that sophistication elsewhere.</para>

	  <para>The fragmentation causes other problems.  Being a linear list
	    under 3.X, and having such a huge amount of inherent
	    fragmentation, allocating and freeing swap winds up being an O(N)
	    algorithm instead of an O(1) algorithm.  Combined with other
	    factors (heavy swapping) and you start getting into O(N^2) and
	    O(N^3) levels of overhead, which is bad.  The 3.X system may also
	    need to allocate KVM during a swap operation to create a new list
	    node which can lead to a deadlock if the system is trying to
	    pageout pages in a low-memory situation.</para>

	  <para>Under 4.X we do not use a sequential list.  Instead we use a
	    radix tree and bitmaps of swap blocks rather than ranged list
	    nodes.  We take the hit of preallocating all the bitmaps required
	    for the entire swap area up front but it winds up wasting less
	    memory due to the use of a bitmap (one bit per block) instead of a
	    linked list of nodes.  The use of a radix tree instead of a
	    sequential list gives us nearly O(1) performance no matter how
	    fragmented the tree becomes.</para>
	</answer>
      </qandaentry>

      <qandaentry>
	<question>
	  <para>How is the separation of clean and dirty (inactive) pages
	    related to the situation where you see low cache queue counts and
	    high active queue counts in <command>systat -vm</command>?  Do the
	    systat stats roll the active and dirty pages together for the
	    active queue count?</para>

	  <para>I do not get the following:</para>

	  <blockquote>
	    <para>It is important to note that the &os; VM system attempts
	      to separate clean and dirty pages for the express reason of
	      avoiding unnecessary flushes of dirty pages (which eats I/O
	      bandwidth), nor does it move pages between the various page
	      queues gratuitously when the memory subsystem is not being
	      stressed.  This is why you will see some systems with very low
	      cache queue counts and high active queue counts when doing a
	      <command>systat -vm</command> command.</para>
	  </blockquote>
	</question>

	<answer>
	  <para>Yes, that is confusing.  The relationship is
	    <quote>goal</quote> verses <quote>reality</quote>.  Our goal is to
	    separate the pages but the reality is that if we are not in a
	    memory crunch, we do not really have to.</para>

	  <para>What this means is that &os; will not try very hard to
	    separate out dirty pages (inactive queue) from clean pages (cache
	    queue) when the system is not being stressed, nor will it try to
	    deactivate pages (active queue -&gt; inactive queue) when the system
	    is not being stressed, even if they are not being used.</para>
	</answer>
      </qandaentry>

      <qandaentry>
	<question>
	  <para> In the &man.ls.1; / <command>vmstat 1</command> example,
	    would not some of the page faults be data page faults (COW from
	    executable file to private page)?  I.e., I would expect the page
	    faults to be some zero-fill and some program data.  Or are you
	    implying that &os; does do pre-COW for the program data?</para>
	</question>

	<answer>
	  <para>A COW fault can be either zero-fill or program-data.  The
	    mechanism is the same either way because the backing program-data
	    is almost certainly already in the cache.  I am indeed lumping the
	    two together.  &os; does not pre-COW program data or zero-fill,
	    but it <emphasis>does</emphasis> pre-map pages that exist in its
	    cache.</para>
	</answer>
      </qandaentry>

      <qandaentry>
	<question>
	  <para>In your section on page table optimizations, can you give a
	    little more detail about <literal>pv_entry</literal> and
	    <literal>vm_page</literal> (or should vm_page be
	    <literal>vm_pmap</literal>&mdash;as in 4.4, cf. pp. 180-181 of
	    McKusick, Bostic, Karel, Quarterman)?  Specifically, what kind of
	    operation/reaction would require scanning the mappings?</para>

	  <para>How does Linux do in the case where &os; breaks down
	    (sharing a large file mapping over many processes)?</para>
	</question>

	<answer>
	  <para>A <literal>vm_page</literal> represents an (object,index#)
	    tuple.  A <literal>pv_entry</literal> represents a hardware page
	    table entry (pte).  If you have five processes sharing the same
	    physical page, and three of those processes's page tables actually
	    map the page, that page will be represented by a single
	    <literal>vm_page</literal> structure and three
	    <literal>pv_entry</literal> structures.</para>

	  <para><literal>pv_entry</literal> structures only represent pages
	    mapped by the MMU (one <literal>pv_entry</literal> represents one
	    pte).  This means that when we need to remove all hardware
	    references to a <literal>vm_page</literal> (in order to reuse the
	    page for something else, page it out, clear it, dirty it, and so
	    forth) we can simply scan the linked list of
	    <literal>pv_entry</literal>'s associated with that
	    <literal>vm_page</literal> to remove or modify the pte's from
	    their page tables.</para>

	  <para>Under Linux there is no such linked list.  In order to remove
	    all the hardware page table mappings for a
	    <literal>vm_page</literal> linux must index into every VM object
	    that <emphasis>might</emphasis> have mapped the page.  For
	    example, if you have 50 processes all mapping the same shared
	    library and want to get rid of page X in that library, you need to
	    index into the page table for each of those 50 processes even if
	    only 10 of them have actually mapped the page.  So Linux is
	    trading off the simplicity of its design against performance.
	    Many VM algorithms which are O(1) or (small N) under &os; wind
	    up being O(N), O(N^2), or worse under Linux.  Since the pte's
	    representing a particular page in an object tend to be at the same
	    offset in all the page tables they are mapped in, reducing the
	    number of accesses into the page tables at the same pte offset
	    will often avoid blowing away the L1 cache line for that offset,
	    which can lead to better performance.</para>

	  <para>&os; has added complexity (the <literal>pv_entry</literal>
	    scheme) in order to increase performance (to limit page table
	    accesses to <emphasis>only</emphasis> those pte's that need to be
	    modified).</para>

	  <para>But &os; has a scaling problem that Linux does not in that
	    there are a limited number of <literal>pv_entry</literal>
	    structures and this causes problems when you have massive sharing
	    of data.  In this case you may run out of
	    <literal>pv_entry</literal> structures even though there is plenty
	    of free memory available.  This can be fixed easily enough by
	    bumping up the number of <literal>pv_entry</literal> structures in
	    the kernel config, but we really need to find a better way to do
	    it.</para>

	  <para>In regards to the memory overhead of a page table verses the
	    <literal>pv_entry</literal> scheme: Linux uses
	    <quote>permanent</quote> page tables that are not throw away, but
	    does not need a <literal>pv_entry</literal> for each potentially
	    mapped pte.  &os; uses <quote>throw away</quote> page tables but
	    adds in a <literal>pv_entry</literal> structure for each
	    actually-mapped pte.  I think memory utilization winds up being
	    about the same, giving &os; an algorithmic advantage with its
	    ability to throw away page tables at will with very low
	    overhead.</para>
	</answer>
      </qandaentry>

      <qandaentry>
	<question>
	  <para>Finally, in the page coloring section, it might help to have a
	    little more description of what you mean here.  I did not quite
	    follow it.</para>
	</question>

	<answer>
	  <para>Do you know how an L1 hardware memory cache works?  I will
	    explain: Consider a machine with 16MB of main memory but only 128K
	    of L1 cache.  Generally the way this cache works is that each 128K
	    block of main memory uses the <emphasis>same</emphasis> 128K of
	    cache.  If you access offset 0 in main memory and then offset
	    128K in main memory you can wind up throwing away the
	    cached data you read from offset 0!</para>

	  <para>Now, I am simplifying things greatly.  What I just described
	    is what is called a <quote>direct mapped</quote> hardware memory
	    cache.  Most modern caches are what are called
	    2-way-set-associative or 4-way-set-associative caches.  The
	    set-associatively allows you to access up to N different memory
	    regions that overlap the same cache memory without destroying the
	    previously cached data.  But only N.</para>

	  <para>So if I have a 4-way set associative cache I can access offset
	    0, offset 128K, 256K and offset 384K and still be able to access
	    offset 0 again and have it come from the L1 cache.  If I then
	    access offset 512K, however, one of the four previously cached
	    data objects will be thrown away by the cache.</para>

	  <para>It is extremely important&hellip;
	    <emphasis>extremely</emphasis> important for most of a processor's
	    memory accesses to be able to come from the L1 cache, because the
	    L1 cache operates at the processor frequency.  The moment you have
	    an L1 cache miss and have to go to the L2 cache or to main memory,
	    the processor will stall and potentially sit twiddling its fingers
	    for <emphasis>hundreds</emphasis> of instructions worth of time
	    waiting for a read from main memory to complete.  Main memory (the
	    dynamic ram you stuff into a computer) is
	    <emphasis>slow</emphasis>, when compared to the speed of a modern
	    processor core.</para>

	  <para>Ok, so now onto page coloring: All modern memory caches are
	    what are known as <emphasis>physical</emphasis> caches.  They
	    cache physical memory addresses, not virtual memory addresses.
	    This allows the cache to be left alone across a process context
	    switch, which is very important.</para>

	  <para>But in the &unix; world you are dealing with virtual address
	    spaces, not physical address spaces.  Any program you write will
	    see the virtual address space given to it.  The actual
	    <emphasis>physical</emphasis> pages underlying that virtual
	    address space are not necessarily physically contiguous! In fact,
	    you might have two pages that are side by side in a processes
	    address space which wind up being at offset 0 and offset 128K in
	    <emphasis>physical</emphasis> memory.</para>

	  <para>A program normally assumes that two side-by-side pages will be
	    optimally cached.  That is, that you can access data objects in
	    both pages without having them blow away each other's cache entry.
	    But this is only true if the physical pages underlying the virtual
	    address space are contiguous (insofar as the cache is
	    concerned).</para>

	  <para>This is what Page coloring does.  Instead of assigning
	    <emphasis>random</emphasis> physical pages to virtual addresses,
	    which may result in non-optimal cache performance, Page coloring
	    assigns <emphasis>reasonably-contiguous</emphasis> physical pages
	    to virtual addresses.  Thus programs can be written under the
	    assumption that the characteristics of the underlying hardware
	    cache are the same for their virtual address space as they would
	    be if the program had been run directly in a physical address
	    space.</para>

	  <para>Note that I say <quote>reasonably</quote> contiguous rather
	    than simply <quote>contiguous</quote>.  From the point of view of a
	    128K direct mapped cache, the physical address 0 is the same as
	    the physical address 128K.  So two side-by-side pages in your
	    virtual address space may wind up being offset 128K and offset
	    132K in physical memory, but could also easily be offset 128K and
	    offset 4K in physical memory and still retain the same cache
	    performance characteristics.  So page-coloring does
	    <emphasis>not</emphasis> have to assign truly contiguous pages of
	    physical memory to contiguous pages of virtual memory, it just
	    needs to make sure it assigns contiguous pages from the point of
	    view of cache performance and operation.</para>
	</answer>
      </qandaentry>
    </qandaset>
  </sect1>
</article>