src/module, branch zfs-0.6.2

Use directory xattrs for symlinks

2013-08-22T20:30:44+00:00

There is currently a subtle bug in the SA implementation which
can crop up which prevents us from safely using multiple variable
length SAs in one object.

Fortunately, the only existing use case for this are symlinks with
SA based xattrs.  Therefore, until the root cause in the SA code
can be identified and fixed we prevent adding SA xattrs to symlinks.

Signed-off-by: Brian Behlendorf 
Issue #1468

Revert "Evict meta data from ghost lists + l2arc headers"

2013-08-22T19:15:37+00:00

This reverts commit fadd0c4da1e2ccd6014800d8b1a0fd117dd323e8 which
introduced a regression in honoring the meta limit.

Signed-off-by: Brian Behlendorf 
Close #1660

Linux 3.11 compat: fops->iterate()

2013-08-15T23:19:07+00:00

Commit torvalds/linux@2233f31aade393641f0eaed43a71110e629bb900
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.

To handle this the code was reworked to use the new ->iterate
interface.  Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.

Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions.  Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.

Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function.  The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names.  Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.

Signed-off-by: Richard Yao 
Signed-off-by: Brian Behlendorf 
Closes #1653
Issue #1591

Fix z_wr_iss_h zio_execute() import hang

2013-08-15T22:20:36+00:00

Because we need to be more frugal about our stack usage under
Linux.  The __zio_execute() function was modified to re-dispatch
zios to a ZIO_TASKQ_ISSUE thread when we're in a context which
is known to be stack heavy.  Those two contexts are the sync
thread and what ever thread is performing spa initialization.

Unfortunately, this change introduced an unlikely bug which can
result in a zio being re-dispatched indefinitely and never being
executed.  If during spa initialization we handle a zio with
ZIO_PRIORITY_NOW it will be moved to the high priority queue.
When __zio_execute() is called again for the zio it will mis-
interpret the context and re-dispatch it again.  The system
will get stuck spinning re-dispatching the zio and making no
forward progress.

To fix this rare issue __zio_execute() has been updated not
to re-dispatch zios on either the ZIO_TASKQ_ISSUE or
ZIO_TASKQ_ISSUE_HIGH task queues.

In practice this issue was rarely reported and can usually
be fixed by rebooting the system and importing the pool again.

Signed-off-by: Brian Behlendorf 
Closes #1455

Illumos #3618 ::zio dcmd does not show timestamp data

2013-08-12T23:46:50+00:00

3618 ::zio dcmd does not show timestamp data
Reviewed by: Adam Leventhal 
Reviewed by: George Wilson 
Reviewed by: Christopher Siden 
Reviewed by: Garrett D'Amore 
Approved by: Dan McDonald 

References:
  http://www.illumos.org/issues/3618
  illumos/illumos-gate@c55e05cb35da47582b7afd38734d2f0d9c6deb40

Notes on porting to ZFS on Linux:

The original changeset mostly deals with mdb ::zio dcmd.
However, in order to provide the requested functionality
it modifies vdev and zio structures to keep the timing data
in nanoseconds instead of ticks. It is these changes that
are ported over in the commit in hand.

One visible change of this commit is that the default value
of 'zfs_vdev_time_shift' tunable is changed:

    zfs_vdev_time_shift = 6
        to
    zfs_vdev_time_shift = 29

The original value of 6 was inherited from OpenSolaris and
was subotimal - since it shifted the raw tick value - it
didn't compensate for different tick frequencies on Linux and
OpenSolaris. The former has HZ=1000, while the latter HZ=100.

(Which itself led to other interesting performance anomalies
under non-trivial load. The deadline scheduler delays the IO
according to its priority - the lower priority the further
the deadline is set. The delay is measured in units of
"shifted ticks". Since the HZ value was 10 times higher,
the delay units were 10 times shorter. Thus really low
priority IO like resilver (delay is 10 units) and scrub
(delay is 20 units) were scheduled much sooner than intended.
The overall effect is that resilver and scrub IO consumed
more bandwidth at the expense of the other IO.)

Now that the bookkeeping is done is nanoseconds the shift
behaves correctly for any tick frequency (HZ).

Ported-by: Cyril Plisko 
Signed-off-by: Brian Behlendorf 
Closes #1643

Linux 3.8 compat: Support CONFIG_UIDGID_STRICT_TYPE_CHECKS

2013-08-09T22:31:52+00:00

When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.

We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.

Signed-off-by: Richard Yao 
Signed-off-by: Brian Behlendorf 
Closes #1589

Evict meta data from ghost lists + l2arc headers

2013-08-09T17:06:12+00:00

When the meta limit is exceeded the ARC evicts some meta data
buffers from the mfu+mru lists.  Unfortunately, for meta data
heavy workloads it's possible for these buffers to accumulate
on the ghost lists if arc_c doesn't exceed arc_size.

To handle this case arc_adjust_meta() has been entended to
explicitly evict meta data buffers from the ghost lists in
proportion to what was evicted from the mfu+mru lists.

If this is insufficient we request that the VFS release
some inodes and dentries.  This will result in the release
of some dnodes which are counted as 'other' metadata.

Signed-off-by: Brian Behlendorf

Allow arc_evict_ghost() to only evict meta data

2013-08-09T17:06:08+00:00

The default behavior of arc_evict_ghost() is to start by evicting
data buffers.  Then only if the requested number of bytes to evict
cannot be satisfied by data buffers move on to meta data buffers.

This is ideal for honoring arc_c since it's preferable to keep the
meta data cached.  However, if we're trying to free memory from the
arc to honor the meta limit it's a problem because we will need to
discard all the data to get to the meta data.

To avoid this issue the arc_evict_ghost() is now passed a fourth
argumented describing which buffer type to start with.  The
arc_evict() function already behaves exactly like this for a
same reason so this is consistent with the existing code.

All existing callers have been updated to pass ARC_BUFC_DATA so
this patch introduces no functional change.  New callers may
pass ARC_BUFC_METADATA to skip immediately to evicting meta
data leaving the normal data untouched.

Signed-off-by: Brian Behlendorf

Illumos #3137 L2ARC compression

2013-08-08T20:27:21+00:00

3137 L2ARC compression
Reviewed by: George Wilson 
Reviewed by: Matthew Ahrens 
Approved by: Dan McDonald 

References:
  illumos/illumos-gate@aad02571bc59671aa3103bb070ae365f531b0b62
  https://www.illumos.org/issues/3137
  http://wiki.illumos.org/display/illumos/L2ARC+Compression

Notes for Linux port:

A l2arc_nocompress module option was added to prevent the
compression of l2arc buffers regardless of how a dataset's
compression property is set.  This allows the legacy behavior
to be preserved.

Ported by: James H 
Signed-off-by: Brian Behlendorf 
Closes #1379

Return -1 from arc_shrinker_func()

2013-08-08T16:20:56+00:00

This is analogous to SPL commit zfsonlinux/spl@b9b3715.  While
we don't have clear evidence of systems getting caught here
indefinately like in the SPL this ensures that it will never
happen.

Signed-off-by: Richard Yao 
Signed-off-by: Brian Behlendorf 
Closes #1579