aboutsummaryrefslogtreecommitdiff
path: root/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h
diff options
context:
space:
mode:
authorAlexander Motin <mav@FreeBSD.org>2018-01-21 23:18:42 +0000
committerAlexander Motin <mav@FreeBSD.org>2018-01-21 23:18:42 +0000
commit272f833cde1fa8604b2b06433303c63dca031ea8 (patch)
tree0bccea6f1d40648116e1f23ef9312ced4765fa0c /sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h
parentd252c97fc09d49bc3a2d022defc76be46d9b08a9 (diff)
parentebd7699264cc35ac0c89dd3a8ab0046ba7dd1f78 (diff)
downloadsrc-272f833cde1fa8604b2b06433303c63dca031ea8.tar.gz
src-272f833cde1fa8604b2b06433303c63dca031ea8.zip
MFV r328227: 8909 8585 can cause a use-after-free kernel panic
illumos/illumos-gate@94ddd0900a8838f62bba15e270649a42f4ef9f81 https://www.illumos.org/issues/8909: There's a race condition that exists if `zil_free_lwb` races with either `zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`. Here's an example panic due to this bug: > ::status debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40 operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc) image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo dule "zfs" due to a NULL pointer dereference dump content: kernel pages only > $c zio_shrink+0x12() zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20) zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit+0x80(ffffff03dcd15cc0, 9a9) zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) write+0x250(42, fffffd7ff4832000, 2000) sys_syscall+0x177() If there's an outstanding lwb that's in `zil_commit_waiter_timeout` waiting to timeout, waiting on it's waiter's CV, we must be sure not to call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB may be freed and can result in a use-after-free situation where the stale lwb pointer stored in the `zil_commit_waiter_t` structure of the thread waiting on the waiter's CV is used. A similar situation can occur if an lwb is issued to disk, and thus in the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the disk is servicing that lwb. In this situation, the lwb will be freed by `zil_free_lwb`, which will result in a use-after-free situation when the lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called. This race condition is prevented in `zil_close` by calling `zil_commit` before `zil_free_lwb` is called, which will ensure all outstanding (i.e. all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states) reach the `LWB_STATE_DONE` state before the lwb's are freed (`zil_commit` will not return untill all the lwb's are `LWB_STATE_DONE`). Further, this race condition is prevented in `zil_sync` by only calling `zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set. All lwb's not in the `LWB_STATE_DONE` state will have a non-null value for this pointer; the pointer is only cleared in `zil_lwb_flush_vdevs_done`, at which point the lwb's state will be changed to `LWB_STATE_DONE`. This race is present in `zil_suspend`, leading to this bug. At first glance, it would appear as though this would not be true because `zil_suspend` will call `zil_commit`, just like `zil_close`, but the problem is that `zil_suspend` will set the zilog's `zl_suspend` field prior to calling `zil_commit`. Further, in `zil_commit`, if `zl_suspend` is set, `zil_commit` will take a special branch of logic and use `txg_wait_synced` instead of performing the normal `zil_commit` logic. This call to `txg_wait_synced` might be good enough for the data to reach disk safely before it returns, but it does not ensure that all outstanding lwb's reach the `LWB_STATE_DONE` state before it returns. This is because, if there's an lwb "stuck" in `zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will maintain a non-null value for it's `lwb_buf` field and thus `zil_sync` will not free that lwb. Thus, even though the lwb's data is already on disk, the lwb will be left lingering, waiting on the CV, and will eventually timeout and be issued to disk even though the write is unnesseary. So, after `zil_commit` is called from `zil_suspend`, we incorrectly assume that there are not outstanding lwb's, and proceed to free all lwb's found on the zilog's lwb list. As a result, we free the lwb that will later be used `zil_commit_waiter_timeout`. Reviewed by: John Kennedy <jwk404@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>
Notes
Notes: svn path=/head/; revision=328228
Diffstat (limited to 'sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h')
-rw-r--r--sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h39
1 files changed, 32 insertions, 7 deletions
diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h
index 8702ce82091c..1825ced56663 100644
--- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h
+++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zil_impl.h
@@ -37,13 +37,38 @@ extern "C" {
#endif
/*
- * Possbile states for a given lwb structure. An lwb will start out in
- * the "closed" state, and then transition to the "opened" state via a
- * call to zil_lwb_write_open(). After the lwb is "open", it can
- * transition into the "issued" state via zil_lwb_write_issue(). After
- * the lwb's zio completes, and the vdev's are flushed, the lwb will
- * transition into the "done" state via zil_lwb_write_done(), and the
- * structure eventually freed.
+ * Possbile states for a given lwb structure.
+ *
+ * An lwb will start out in the "closed" state, and then transition to
+ * the "opened" state via a call to zil_lwb_write_open(). When
+ * transitioning from "closed" to "opened" the zilog's "zl_issuer_lock"
+ * must be held.
+ *
+ * After the lwb is "opened", it can transition into the "issued" state
+ * via zil_lwb_write_issue(). Again, the zilog's "zl_issuer_lock" must
+ * be held when making this transition.
+ *
+ * After the lwb's zio completes, and the vdev's are flushed, the lwb
+ * will transition into the "done" state via zil_lwb_write_done(). When
+ * transitioning from "issued" to "done", the zilog's "zl_lock" must be
+ * held, *not* the "zl_issuer_lock".
+ *
+ * The zilog's "zl_issuer_lock" can become heavily contended in certain
+ * workloads, so we specifically avoid acquiring that lock when
+ * transitioning an lwb from "issued" to "done". This allows us to avoid
+ * having to acquire the "zl_issuer_lock" for each lwb ZIO completion,
+ * which would have added more lock contention on an already heavily
+ * contended lock.
+ *
+ * Additionally, correctness when reading an lwb's state is often
+ * acheived by exploiting the fact that these state transitions occur in
+ * this specific order; i.e. "closed" to "opened" to "issued" to "done".
+ *
+ * Thus, if an lwb is in the "closed" or "opened" state, holding the
+ * "zl_issuer_lock" will prevent a concurrent thread from transitioning
+ * that lwb to the "issued" state. Likewise, if an lwb is already in the
+ * "issued" state, holding the "zl_lock" will prevent a concurrent
+ * thread from transitioning that lwb to the "done" state.
*/
typedef enum {
LWB_STATE_CLOSED,